All of lore.kernel.org
 help / color / mirror / Atom feed
* [pull request][net-next 00/15] mlx5 updates 2023-01-18
@ 2023-01-18 18:35 Saeed Mahameed
  2023-01-18 18:35 ` [net-next 01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB Saeed Mahameed
                   ` (14 more replies)
  0 siblings, 15 replies; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan

From: Saeed Mahameed <saeedm@nvidia.com>

This series provides misc updates to mxl5.
For more information please see tag log below.

Please pull and let me know if there is any problem.

Thanks,
Saeed.


The following changes since commit 68e5b6aa2795fd05c6ff58616cb16f2f216e4123:

  xdp: document xdp_do_flush() before napi_complete_done() (2023-01-18 14:33:34 +0000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2023-01-18

for you to fetch changes up to efb4879f76236f248bbe5b9e2bf408d9470d59f3:

  net/mlx5e: Use read lock for eswitch get callbacks (2023-01-18 10:34:09 -0800)

----------------------------------------------------------------
mlx5-updates-2023-01-18

1) From Rahul,
  1.1) extended range for PTP adjtime and adjphase
  1.2) adjphase function to support hardware-only offset control

2) From Roi, code cleanup to the TC module.

3) From Maor, TC support for Geneve and GRE with VF tunnel offload

4) Cleanups and minor updates.

----------------------------------------------------------------
Adham Faris (2):
      net/mlx5e: Fail with messages when params are not valid for XSK
      net/mlx5e: Add warning when log WQE size is smaller than log stride size

Leon Romanovsky (1):
      net/mlx5e: Use read lock for eswitch get callbacks

Maor Dickman (2):
      net/mlx5e: Support Geneve and GRE with VF tunnel offload
      net/mlx5e: Remove redundant allocation of spec in create indirect fwd group

Rahul Rameshbabu (3):
      net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB
      net/mlx5: Add adjphase function to support hardware-only offset control
      net/mlx5: Add hardware extended range support for PTP adjtime and adjphase

Roi Dayan (6):
      net/mlx5: E-switch, Remove redundant comment about meta rules
      net/mlx5e: TC, Pass flow attr to attach/detach mod hdr functions
      net/mlx5e: TC, Add tc prefix to attach/detach hdr functions
      net/mlx5e: TC, Use common function allocating flow mod hdr or encap mod hdr
      net/mlx5e: Warn when destroying mod hdr hash table that is not empty
      net/mlx5: E-Switch, Fix typo for egress

Yishai Hadas (1):
      net/mlx5: Suppress error logging on UCTX creation

 drivers/net/ethernet/mellanox/mlx5/core/cmd.c      |   3 +-
 .../net/ethernet/mellanox/mlx5/core/en/mod_hdr.c   |   1 +
 .../net/ethernet/mellanox/mlx5/core/en/params.c    |  18 +-
 .../net/ethernet/mellanox/mlx5/core/en/tc_priv.h   |   2 -
 .../net/ethernet/mellanox/mlx5/core/en/tc_tun.c    |   2 -
 .../ethernet/mellanox/mlx5/core/en/tc_tun_encap.c  |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.c |  19 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    | 101 ++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.h    |  14 +-
 .../ethernet/mellanox/mlx5/core/esw/indir_table.c  | 213 ++++-----------------
 .../ethernet/mellanox/mlx5/core/esw/indir_table.h  |   4 -
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   6 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |  35 ++--
 .../net/ethernet/mellanox/mlx5/core/lib/clock.c    |  41 +++-
 include/linux/mlx5/mlx5_ifc.h                      |   4 +-
 16 files changed, 177 insertions(+), 293 deletions(-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [net-next 01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:31   ` Jacob Keller
  2023-01-20  4:00   ` patchwork-bot+netdevbpf
  2023-01-18 18:35 ` [net-next 02/15] net/mlx5: Suppress error logging on UCTX creation Saeed Mahameed
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Rahul Rameshbabu,
	kernel test robot

From: Rahul Rameshbabu <rrameshbabu@nvidia.com>

Send WQEBB size is 64 bytes and the max number of WQEBBs for an SQ is 255.
For 16KB pages and greater, there is always sufficient spaces for all
WQEBBs of an SQ. Cast mlx5e_get_max_sq_wqebbs(mdev) to u16. Prevents
-Wtautological-constant-out-of-range-compare warnings from occurring when
PAGE_SIZE >= 16KB.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 853f312cd757..5578f92f7e0f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -445,7 +445,7 @@ mlx5e_set_eseg_swp(struct sk_buff *skb, struct mlx5_wqe_eth_seg *eseg,
 
 static inline u16 mlx5e_stop_room_for_wqe(struct mlx5_core_dev *mdev, u16 wqe_size)
 {
-	WARN_ON_ONCE(PAGE_SIZE / MLX5_SEND_WQE_BB < mlx5e_get_max_sq_wqebbs(mdev));
+	WARN_ON_ONCE(PAGE_SIZE / MLX5_SEND_WQE_BB < (u16)mlx5e_get_max_sq_wqebbs(mdev));
 
 	/* A WQE must not cross the page boundary, hence two conditions:
 	 * 1. Its size must not exceed the page size.
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 02/15] net/mlx5: Suppress error logging on UCTX creation
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
  2023-01-18 18:35 ` [net-next 01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:32   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control Saeed Mahameed
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Yishai Hadas, Maor Gottlieb

From: Yishai Hadas <yishaih@nvidia.com>

Suppress error logging that can be triggered by userspace upon DEVX UCTX
creation.

The reason that it's not suppressed today with the uid check to suppress
DEVX is that MLX5_CMD_OP_CREATE_UCTX command still doesn't have a uid as
it comes to create it..

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index c3c8a7148723..382d02f6619c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -813,7 +813,8 @@ static void cmd_status_print(struct mlx5_core_dev *dev, void *in, void *out)
 	op_mod = MLX5_GET(mbox_in, in, op_mod);
 	uid    = MLX5_GET(mbox_in, in, uid);
 
-	if (!uid && opcode != MLX5_CMD_OP_DESTROY_MKEY)
+	if (!uid && opcode != MLX5_CMD_OP_DESTROY_MKEY &&
+	    opcode != MLX5_CMD_OP_CREATE_UCTX)
 		mlx5_cmd_out_err(dev, opcode, op_mod, out);
 }
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
  2023-01-18 18:35 ` [net-next 01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB Saeed Mahameed
  2023-01-18 18:35 ` [net-next 02/15] net/mlx5: Suppress error logging on UCTX creation Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:33   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 04/15] net/mlx5: Add hardware extended range support for PTP adjtime and adjphase Saeed Mahameed
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Rahul Rameshbabu, Gal Pressman

From: Rahul Rameshbabu <rrameshbabu@nvidia.com>

The adjtime function supports using hardware to set the clock offset when
the delta was supported by the hardware. When the delta is not supported by
the hardware, the driver handles adjusting the clock. The newly-introduced
adjphase function is similar to the adjtime function, except it guarantees
that a provided clock offset will be used directly by the hardware to
adjust the PTP clock. When the range is not acceptable by the hardware, an
error is returned.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
index 69318b143268..ecdff26a22b0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
@@ -326,6 +326,14 @@ static int mlx5_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
 	return 0;
 }
 
+static int mlx5_ptp_adjphase(struct ptp_clock_info *ptp, s32 delta)
+{
+	if (delta < S16_MIN || delta > S16_MAX)
+		return -ERANGE;
+
+	return mlx5_ptp_adjtime(ptp, delta);
+}
+
 static int mlx5_ptp_adjfreq_real_time(struct mlx5_core_dev *mdev, s32 freq)
 {
 	u32 in[MLX5_ST_SZ_DW(mtutc_reg)] = {};
@@ -688,6 +696,7 @@ static const struct ptp_clock_info mlx5_ptp_clock_info = {
 	.n_pins		= 0,
 	.pps		= 0,
 	.adjfine	= mlx5_ptp_adjfine,
+	.adjphase	= mlx5_ptp_adjphase,
 	.adjtime	= mlx5_ptp_adjtime,
 	.gettimex64	= mlx5_ptp_gettimex,
 	.settime64	= mlx5_ptp_settime,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 04/15] net/mlx5: Add hardware extended range support for PTP adjtime and adjphase
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (2 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:35   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 05/15] net/mlx5: E-switch, Remove redundant comment about meta rules Saeed Mahameed
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Rahul Rameshbabu, Gal Pressman

From: Rahul Rameshbabu <rrameshbabu@nvidia.com>

Capable hardware can use an extended range for offsetting the clock. An
extended range of [-200000,200000] is used instead of [-32768,32767] for
the delta/phase parameter of the adjtime/adjphase ptp_clock_info callbacks.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/lib/clock.c   | 34 +++++++++++++++++--
 include/linux/mlx5/mlx5_ifc.h                 |  4 ++-
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
index ecdff26a22b0..75510a12ab02 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
@@ -69,6 +69,13 @@ enum {
 	MLX5_MTPPS_FS_OUT_PULSE_DURATION_NS     = BIT(0xa),
 };
 
+enum {
+	MLX5_MTUTC_OPERATION_ADJUST_TIME_MIN          = S16_MIN,
+	MLX5_MTUTC_OPERATION_ADJUST_TIME_MAX          = S16_MAX,
+	MLX5_MTUTC_OPERATION_ADJUST_TIME_EXTENDED_MIN = -200000,
+	MLX5_MTUTC_OPERATION_ADJUST_TIME_EXTENDED_MAX = 200000,
+};
+
 static bool mlx5_real_time_mode(struct mlx5_core_dev *mdev)
 {
 	return (mlx5_is_real_time_rq(mdev) || mlx5_is_real_time_sq(mdev));
@@ -86,6 +93,22 @@ static bool mlx5_modify_mtutc_allowed(struct mlx5_core_dev *mdev)
 	return MLX5_CAP_MCAM_FEATURE(mdev, ptpcyc2realtime_modify);
 }
 
+static bool mlx5_is_mtutc_time_adj_cap(struct mlx5_core_dev *mdev, s64 delta)
+{
+	s64 min = MLX5_MTUTC_OPERATION_ADJUST_TIME_MIN;
+	s64 max = MLX5_MTUTC_OPERATION_ADJUST_TIME_MAX;
+
+	if (MLX5_CAP_MCAM_FEATURE(mdev, mtutc_time_adjustment_extended_range)) {
+		min = MLX5_MTUTC_OPERATION_ADJUST_TIME_EXTENDED_MIN;
+		max = MLX5_MTUTC_OPERATION_ADJUST_TIME_EXTENDED_MAX;
+	}
+
+	if (delta < min || delta > max)
+		return false;
+
+	return true;
+}
+
 static int mlx5_set_mtutc(struct mlx5_core_dev *dev, u32 *mtutc, u32 size)
 {
 	u32 out[MLX5_ST_SZ_DW(mtutc_reg)] = {};
@@ -288,8 +311,8 @@ static int mlx5_ptp_adjtime_real_time(struct mlx5_core_dev *mdev, s64 delta)
 	if (!mlx5_modify_mtutc_allowed(mdev))
 		return 0;
 
-	/* HW time adjustment range is s16. If out of range, settime instead */
-	if (delta < S16_MIN || delta > S16_MAX) {
+	/* HW time adjustment range is checked. If out of range, settime instead */
+	if (!mlx5_is_mtutc_time_adj_cap(mdev, delta)) {
 		struct timespec64 ts;
 		s64 ns;
 
@@ -328,7 +351,12 @@ static int mlx5_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
 
 static int mlx5_ptp_adjphase(struct ptp_clock_info *ptp, s32 delta)
 {
-	if (delta < S16_MIN || delta > S16_MAX)
+	struct mlx5_clock *clock = container_of(ptp, struct mlx5_clock, ptp_info);
+	struct mlx5_core_dev *mdev;
+
+	mdev = container_of(clock, struct mlx5_core_dev, clock);
+
+	if (!mlx5_is_mtutc_time_adj_cap(mdev, delta))
 		return -ERANGE;
 
 	return mlx5_ptp_adjtime(ptp, delta);
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index a84bdeeed2c6..0b102c651fe2 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -9941,7 +9941,9 @@ struct mlx5_ifc_pcam_reg_bits {
 };
 
 struct mlx5_ifc_mcam_enhanced_features_bits {
-	u8         reserved_at_0[0x5d];
+	u8         reserved_at_0[0x51];
+	u8         mtutc_time_adjustment_extended_range[0x1];
+	u8         reserved_at_52[0xb];
 	u8         mcia_32dwords[0x1];
 	u8         out_pulse_duration_ns[0x1];
 	u8         npps_period[0x1];
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 05/15] net/mlx5: E-switch, Remove redundant comment about meta rules
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (3 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 04/15] net/mlx5: Add hardware extended range support for PTP adjtime and adjphase Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:40   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 06/15] net/mlx5e: Fail with messages when params are not valid for XSK Saeed Mahameed
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman

From: Roi Dayan <roid@nvidia.com>

Meta rules are created/destroyed per vport and not in eswitch
init/destroy.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index bbb6dab3b21f..05a352d8d13f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1406,9 +1406,7 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
 	mlx5_eswitch_unload_vf_vports(esw, esw->esw_funcs.num_vfs);
 	if (clear_vf)
 		mlx5_eswitch_clear_vf_vports_info(esw);
-	/* If disabling sriov in switchdev mode, free meta rules here
-	 * because it depends on num_vfs.
-	 */
+
 	if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
 		struct devlink *devlink = priv_to_devlink(esw->dev);
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 06/15] net/mlx5e: Fail with messages when params are not valid for XSK
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (4 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 05/15] net/mlx5: E-switch, Remove redundant comment about meta rules Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:41   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 07/15] net/mlx5e: Add warning when log WQE size is smaller than log stride size Saeed Mahameed
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Adham Faris

From: Adham Faris <afaris@nvidia.com>

Current XSK prerequisites validation implementation
(setup.c/mlx5e_validate_xsk_param()) fails silently when xsk
prerequisites are not fulfilled.
Add error messages to the kernel log to help the user understand what
went wrong when params are not valid for XSK.

Signed-off-by: Adham Faris <afaris@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/params.c   |  9 +++++++--
 .../mellanox/mlx5/core/en/xsk/setup.c         | 19 +++++++++++++++++--
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 585bdc8383ee..a17b768b81f1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -581,11 +581,16 @@ int mlx5e_mpwrq_validate_xsk(struct mlx5_core_dev *mdev, struct mlx5e_params *pa
 	bool unaligned = xsk ? xsk->unaligned : false;
 	u16 max_mtu_pkts;
 
-	if (!mlx5e_check_fragmented_striding_rq_cap(mdev, page_shift, umr_mode))
+	if (!mlx5e_check_fragmented_striding_rq_cap(mdev, page_shift, umr_mode)) {
+		mlx5_core_err(mdev, "Striding RQ for XSK can't be activated with page_shift %u and umr_mode %d\n",
+			      page_shift, umr_mode);
 		return -EOPNOTSUPP;
+	}
 
-	if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk))
+	if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk)) {
+		mlx5_core_err(mdev, "Striding RQ linear mode for XSK can't be activated with current params\n");
 		return -EINVAL;
+	}
 
 	/* Current RQ length is too big for the given frame size, the
 	 * needed number of WQEs exceeds the maximum.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
index ff03c43833bb..81a567e17264 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
@@ -7,6 +7,18 @@
 #include "en/health.h"
 #include <net/xdp_sock_drv.h>
 
+static int mlx5e_legacy_rq_validate_xsk(struct mlx5_core_dev *mdev,
+					struct mlx5e_params *params,
+					struct mlx5e_xsk_param *xsk)
+{
+	if (!mlx5e_rx_is_linear_skb(mdev, params, xsk)) {
+		mlx5_core_err(mdev, "Legacy RQ linear mode for XSK can't be activated with current params\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 /* The limitation of 2048 can be altered, but shouldn't go beyond the minimal
  * stride size of striding RQ.
  */
@@ -17,8 +29,11 @@ bool mlx5e_validate_xsk_param(struct mlx5e_params *params,
 			      struct mlx5_core_dev *mdev)
 {
 	/* AF_XDP doesn't support frames larger than PAGE_SIZE. */
-	if (xsk->chunk_size > PAGE_SIZE || xsk->chunk_size < MLX5E_MIN_XSK_CHUNK_SIZE)
+	if (xsk->chunk_size > PAGE_SIZE || xsk->chunk_size < MLX5E_MIN_XSK_CHUNK_SIZE) {
+		mlx5_core_err(mdev, "XSK chunk size %u out of bounds [%u, %lu]\n", xsk->chunk_size,
+			      MLX5E_MIN_XSK_CHUNK_SIZE, PAGE_SIZE);
 		return false;
+	}
 
 	/* frag_sz is different for regular and XSK RQs, so ensure that linear
 	 * SKB mode is possible.
@@ -27,7 +42,7 @@ bool mlx5e_validate_xsk_param(struct mlx5e_params *params,
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		return !mlx5e_mpwrq_validate_xsk(mdev, params, xsk);
 	default: /* MLX5_WQ_TYPE_CYCLIC */
-		return mlx5e_rx_is_linear_skb(mdev, params, xsk);
+		return !mlx5e_legacy_rq_validate_xsk(mdev, params, xsk);
 	}
 }
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 07/15] net/mlx5e: Add warning when log WQE size is smaller than log stride size
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (5 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 06/15] net/mlx5e: Fail with messages when params are not valid for XSK Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:42   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 08/15] net/mlx5e: TC, Pass flow attr to attach/detach mod hdr functions Saeed Mahameed
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Adham Faris

From: Adham Faris <afaris@nvidia.com>

Add warning macro in the function mlx5e_mpwqe_get_log_num_strides()
when log WQE size is smaller than log stride size. Theoretically this
should not happen.

Signed-off-by: Adham Faris <afaris@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/params.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index a17b768b81f1..53d2979e9457 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -411,9 +411,14 @@ u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
 {
 	enum mlx5e_mpwrq_umr_mode umr_mode = mlx5e_mpwrq_umr_mode(mdev, xsk);
 	u8 page_shift = mlx5e_mpwrq_page_shift(mdev, xsk);
-
-	return mlx5e_mpwrq_log_wqe_sz(mdev, page_shift, umr_mode) -
-		mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk);
+	u8 log_wqe_size, log_stride_size;
+
+	log_wqe_size = mlx5e_mpwrq_log_wqe_sz(mdev, page_shift, umr_mode);
+	log_stride_size = mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk);
+	WARN(log_wqe_size < log_stride_size,
+	     "Log WQE size %u < log stride size %u (page shift %u, umr mode %d, xsk on? %d)\n",
+	     log_wqe_size, log_stride_size, page_shift, umr_mode, !!xsk);
+	return log_wqe_size - log_stride_size;
 }
 
 u8 mlx5e_mpwqe_get_min_wqe_bulk(unsigned int wq_sz)
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 08/15] net/mlx5e: TC, Pass flow attr to attach/detach mod hdr functions
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (6 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 07/15] net/mlx5e: Add warning when log WQE size is smaller than log stride size Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:42   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 09/15] net/mlx5e: TC, Add tc prefix to attach/detach " Saeed Mahameed
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman

From: Roi Dayan <roid@nvidia.com>

In preparation to remove duplicate functions handling mod hdr allocation
and the fact that modify hdr should be per flow attr and not flow
pass flow attr to the attach and detach mod hdr funcs.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/tc_priv.h  |  2 -
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 41 ++++++++++---------
 .../net/ethernet/mellanox/mlx5/core/en_tc.h   |  2 +
 3 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_priv.h b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_priv.h
index 2b7fd1c0e643..f575646d2f50 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_priv.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_priv.h
@@ -95,8 +95,6 @@ struct mlx5e_tc_flow {
 	 */
 	struct encap_flow_item encaps[MLX5_MAX_FLOW_FWD_VPORTS];
 	struct mlx5e_tc_flow *peer_flow;
-	struct mlx5e_mod_hdr_handle *mh; /* attached mod header instance */
-	struct mlx5e_mod_hdr_handle *slow_mh; /* attached mod header instance for slow path */
 	struct mlx5e_hairpin_entry *hpe; /* attached hairpin instance */
 	struct list_head hairpin; /* flows sharing the same hairpin */
 	struct list_head peer;    /* flows with peer flow */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 0c04a5e7c274..455c178f4115 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -648,34 +648,34 @@ get_mod_hdr_table(struct mlx5e_priv *priv, struct mlx5e_tc_flow *flow)
 
 static int mlx5e_attach_mod_hdr(struct mlx5e_priv *priv,
 				struct mlx5e_tc_flow *flow,
-				struct mlx5e_tc_flow_parse_attr *parse_attr)
+				struct mlx5_flow_attr *attr)
 {
-	struct mlx5_modify_hdr *modify_hdr;
 	struct mlx5e_mod_hdr_handle *mh;
 
 	mh = mlx5e_mod_hdr_attach(priv->mdev, get_mod_hdr_table(priv, flow),
 				  mlx5e_get_flow_namespace(flow),
-				  &parse_attr->mod_hdr_acts);
+				  &attr->parse_attr->mod_hdr_acts);
 	if (IS_ERR(mh))
 		return PTR_ERR(mh);
 
-	modify_hdr = mlx5e_mod_hdr_get(mh);
-	flow->attr->modify_hdr = modify_hdr;
-	flow->mh = mh;
+	WARN_ON(attr->modify_hdr);
+	attr->modify_hdr = mlx5e_mod_hdr_get(mh);
+	attr->mh = mh;
 
 	return 0;
 }
 
 static void mlx5e_detach_mod_hdr(struct mlx5e_priv *priv,
-				 struct mlx5e_tc_flow *flow)
+				 struct mlx5e_tc_flow *flow,
+				 struct mlx5_flow_attr *attr)
 {
 	/* flow wasn't fully initialized */
-	if (!flow->mh)
+	if (!attr->mh)
 		return;
 
 	mlx5e_mod_hdr_detach(priv->mdev, get_mod_hdr_table(priv, flow),
-			     flow->mh);
-	flow->mh = NULL;
+			     attr->mh);
+	attr->mh = NULL;
 }
 
 static
@@ -1433,7 +1433,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
 	}
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
-		err = mlx5e_attach_mod_hdr(priv, flow, parse_attr);
+		err = mlx5e_attach_mod_hdr(priv, flow, attr);
 		if (err)
 			return err;
 	}
@@ -1493,7 +1493,7 @@ static void mlx5e_tc_del_nic_flow(struct mlx5e_priv *priv,
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
 		mlx5e_mod_hdr_dealloc(&attr->parse_attr->mod_hdr_acts);
-		mlx5e_detach_mod_hdr(priv, flow);
+		mlx5e_detach_mod_hdr(priv, flow, attr);
 	}
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_COUNT)
@@ -1604,7 +1604,7 @@ mlx5e_tc_offload_to_slow_path(struct mlx5_eswitch *esw,
 		goto err_offload;
 	}
 
-	flow->slow_mh = mh;
+	flow->attr->slow_mh = mh;
 	flow->chain_mapping = chain_mapping;
 	flow_flag_set(flow, SLOW);
 
@@ -1629,6 +1629,7 @@ mlx5e_tc_offload_to_slow_path(struct mlx5_eswitch *esw,
 void mlx5e_tc_unoffload_from_slow_path(struct mlx5_eswitch *esw,
 				       struct mlx5e_tc_flow *flow)
 {
+	struct mlx5e_mod_hdr_handle *slow_mh = flow->attr->slow_mh;
 	struct mlx5_flow_attr *slow_attr;
 
 	slow_attr = mlx5_alloc_flow_attr(MLX5_FLOW_NAMESPACE_FDB);
@@ -1641,16 +1642,16 @@ void mlx5e_tc_unoffload_from_slow_path(struct mlx5_eswitch *esw,
 	slow_attr->action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
 	slow_attr->esw_attr->split_count = 0;
 	slow_attr->flags |= MLX5_ATTR_FLAG_SLOW_PATH;
-	if (flow->slow_mh) {
+	if (slow_mh) {
 		slow_attr->action |= MLX5_FLOW_CONTEXT_ACTION_MOD_HDR;
-		slow_attr->modify_hdr = mlx5e_mod_hdr_get(flow->slow_mh);
+		slow_attr->modify_hdr = mlx5e_mod_hdr_get(slow_mh);
 	}
 	mlx5e_tc_unoffload_fdb_rules(esw, flow, slow_attr);
-	if (flow->slow_mh) {
-		mlx5e_mod_hdr_detach(esw->dev, get_mod_hdr_table(flow->priv, flow), flow->slow_mh);
+	if (slow_mh) {
+		mlx5e_mod_hdr_detach(esw->dev, get_mod_hdr_table(flow->priv, flow), slow_mh);
 		mlx5_chains_put_chain_mapping(esw_chains(esw), flow->chain_mapping);
 		flow->chain_mapping = 0;
-		flow->slow_mh = NULL;
+		flow->attr->slow_mh = NULL;
 	}
 	flow_flag_clear(flow, SLOW);
 	kfree(slow_attr);
@@ -1927,7 +1928,7 @@ post_process_attr(struct mlx5e_tc_flow *flow,
 			if (err)
 				goto err_out;
 		} else {
-			err = mlx5e_attach_mod_hdr(flow->priv, flow, attr->parse_attr);
+			err = mlx5e_attach_mod_hdr(flow->priv, flow, attr);
 			if (err)
 				goto err_out;
 		}
@@ -2144,7 +2145,7 @@ static void mlx5e_tc_del_fdb_flow(struct mlx5e_priv *priv,
 		if (vf_tun && attr->modify_hdr)
 			mlx5_modify_header_dealloc(priv->mdev, attr->modify_hdr);
 		else
-			mlx5e_detach_mod_hdr(priv, flow);
+			mlx5e_detach_mod_hdr(priv, flow, attr);
 	}
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_COUNT)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
index 50af70ef22f3..01d76ff67315 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
@@ -71,6 +71,8 @@ struct mlx5_flow_attr {
 	u32 action;
 	struct mlx5_fc *counter;
 	struct mlx5_modify_hdr *modify_hdr;
+	struct mlx5e_mod_hdr_handle *mh; /* attached mod header instance */
+	struct mlx5e_mod_hdr_handle *slow_mh; /* attached mod header instance for slow path */
 	struct mlx5_ct_attr ct_attr;
 	struct mlx5e_sample_attr sample_attr;
 	struct mlx5e_meter_attr meter_attr;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 09/15] net/mlx5e: TC, Add tc prefix to attach/detach hdr functions
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (7 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 08/15] net/mlx5e: TC, Pass flow attr to attach/detach mod hdr functions Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:44   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 10/15] net/mlx5e: TC, Use common function allocating flow mod hdr or encap mod hdr Saeed Mahameed
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman

From: Roi Dayan <roid@nvidia.com>

Currently there are confusing names for attach/detach functions.

mlx5e_attach_mod_hdr() vs mlx5e_mod_hdr_attach()
mlx5e_detach_mod_hdr() vs mlx5e_mod_hdr_detach()

Add tc prefix to the functions that are in en_tc.c to separate
from the functions in mod_hdr.c which has the mod_hdr prefix.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 20 +++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 455c178f4115..6a15d3e1741e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -646,9 +646,9 @@ get_mod_hdr_table(struct mlx5e_priv *priv, struct mlx5e_tc_flow *flow)
 		&tc->mod_hdr;
 }
 
-static int mlx5e_attach_mod_hdr(struct mlx5e_priv *priv,
-				struct mlx5e_tc_flow *flow,
-				struct mlx5_flow_attr *attr)
+static int mlx5e_tc_attach_mod_hdr(struct mlx5e_priv *priv,
+				   struct mlx5e_tc_flow *flow,
+				   struct mlx5_flow_attr *attr)
 {
 	struct mlx5e_mod_hdr_handle *mh;
 
@@ -665,9 +665,9 @@ static int mlx5e_attach_mod_hdr(struct mlx5e_priv *priv,
 	return 0;
 }
 
-static void mlx5e_detach_mod_hdr(struct mlx5e_priv *priv,
-				 struct mlx5e_tc_flow *flow,
-				 struct mlx5_flow_attr *attr)
+static void mlx5e_tc_detach_mod_hdr(struct mlx5e_priv *priv,
+				    struct mlx5e_tc_flow *flow,
+				    struct mlx5_flow_attr *attr)
 {
 	/* flow wasn't fully initialized */
 	if (!attr->mh)
@@ -1433,7 +1433,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
 	}
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
-		err = mlx5e_attach_mod_hdr(priv, flow, attr);
+		err = mlx5e_tc_attach_mod_hdr(priv, flow, attr);
 		if (err)
 			return err;
 	}
@@ -1493,7 +1493,7 @@ static void mlx5e_tc_del_nic_flow(struct mlx5e_priv *priv,
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
 		mlx5e_mod_hdr_dealloc(&attr->parse_attr->mod_hdr_acts);
-		mlx5e_detach_mod_hdr(priv, flow, attr);
+		mlx5e_tc_detach_mod_hdr(priv, flow, attr);
 	}
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_COUNT)
@@ -1928,7 +1928,7 @@ post_process_attr(struct mlx5e_tc_flow *flow,
 			if (err)
 				goto err_out;
 		} else {
-			err = mlx5e_attach_mod_hdr(flow->priv, flow, attr);
+			err = mlx5e_tc_attach_mod_hdr(flow->priv, flow, attr);
 			if (err)
 				goto err_out;
 		}
@@ -2145,7 +2145,7 @@ static void mlx5e_tc_del_fdb_flow(struct mlx5e_priv *priv,
 		if (vf_tun && attr->modify_hdr)
 			mlx5_modify_header_dealloc(priv->mdev, attr->modify_hdr);
 		else
-			mlx5e_detach_mod_hdr(priv, flow, attr);
+			mlx5e_tc_detach_mod_hdr(priv, flow, attr);
 	}
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_COUNT)
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 10/15] net/mlx5e: TC, Use common function allocating flow mod hdr or encap mod hdr
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (8 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 09/15] net/mlx5e: TC, Add tc prefix to attach/detach " Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:45   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 11/15] net/mlx5e: Warn when destroying mod hdr hash table that is not empty Saeed Mahameed
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman

From: Roi Dayan <roid@nvidia.com>

Use mlx5e_tc_attach_mod_hdr() when allocating encap mod hdr and
remove mlx5e_tc_add_flow_mod_hdr() which is not being used now.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../mellanox/mlx5/core/en/tc_tun_encap.c      |  5 +-
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 61 +++++--------------
 .../net/ethernet/mellanox/mlx5/core/en_tc.h   | 10 ++-
 3 files changed, 25 insertions(+), 51 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c
index 2aaf8ab857b8..780224fd67a1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c
@@ -1349,7 +1349,8 @@ static void mlx5e_invalidate_encap(struct mlx5e_priv *priv,
 			mlx5e_tc_unoffload_from_slow_path(esw, flow);
 		else
 			mlx5e_tc_unoffload_fdb_rules(esw, flow, flow->attr);
-		mlx5_modify_header_dealloc(priv->mdev, attr->modify_hdr);
+
+		mlx5e_tc_detach_mod_hdr(priv, flow, attr);
 		attr->modify_hdr = NULL;
 
 		esw_attr->dests[flow->tmp_entry_index].flags &=
@@ -1405,7 +1406,7 @@ static void mlx5e_reoffload_encap(struct mlx5e_priv *priv,
 			continue;
 		}
 
-		err = mlx5e_tc_add_flow_mod_hdr(priv, flow, attr);
+		err = mlx5e_tc_attach_mod_hdr(priv, flow, attr);
 		if (err) {
 			mlx5_core_warn(priv->mdev, "Failed to update flow mod_hdr err=%d",
 				       err);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 6a15d3e1741e..c978f4b12131 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -646,9 +646,9 @@ get_mod_hdr_table(struct mlx5e_priv *priv, struct mlx5e_tc_flow *flow)
 		&tc->mod_hdr;
 }
 
-static int mlx5e_tc_attach_mod_hdr(struct mlx5e_priv *priv,
-				   struct mlx5e_tc_flow *flow,
-				   struct mlx5_flow_attr *attr)
+int mlx5e_tc_attach_mod_hdr(struct mlx5e_priv *priv,
+			    struct mlx5e_tc_flow *flow,
+			    struct mlx5_flow_attr *attr)
 {
 	struct mlx5e_mod_hdr_handle *mh;
 
@@ -665,9 +665,9 @@ static int mlx5e_tc_attach_mod_hdr(struct mlx5e_priv *priv,
 	return 0;
 }
 
-static void mlx5e_tc_detach_mod_hdr(struct mlx5e_priv *priv,
-				    struct mlx5e_tc_flow *flow,
-				    struct mlx5_flow_attr *attr)
+void mlx5e_tc_detach_mod_hdr(struct mlx5e_priv *priv,
+			     struct mlx5e_tc_flow *flow,
+			     struct mlx5_flow_attr *attr)
 {
 	/* flow wasn't fully initialized */
 	if (!attr->mh)
@@ -1762,26 +1762,6 @@ int mlx5e_tc_query_route_vport(struct net_device *out_dev, struct net_device *ro
 	return err;
 }
 
-int mlx5e_tc_add_flow_mod_hdr(struct mlx5e_priv *priv,
-			      struct mlx5e_tc_flow *flow,
-			      struct mlx5_flow_attr *attr)
-{
-	struct mlx5e_tc_mod_hdr_acts *mod_hdr_acts = &attr->parse_attr->mod_hdr_acts;
-	struct mlx5_modify_hdr *mod_hdr;
-
-	mod_hdr = mlx5_modify_header_alloc(priv->mdev,
-					   mlx5e_get_flow_namespace(flow),
-					   mod_hdr_acts->num_actions,
-					   mod_hdr_acts->actions);
-	if (IS_ERR(mod_hdr))
-		return PTR_ERR(mod_hdr);
-
-	WARN_ON(attr->modify_hdr);
-	attr->modify_hdr = mod_hdr;
-
-	return 0;
-}
-
 static int
 set_encap_dests(struct mlx5e_priv *priv,
 		struct mlx5e_tc_flow *flow,
@@ -1901,7 +1881,6 @@ verify_attr_actions(u32 actions, struct netlink_ext_ack *extack)
 static int
 post_process_attr(struct mlx5e_tc_flow *flow,
 		  struct mlx5_flow_attr *attr,
-		  bool is_post_act_attr,
 		  struct netlink_ext_ack *extack)
 {
 	struct mlx5_eswitch *esw = flow->priv->mdev->priv.eswitch;
@@ -1923,27 +1902,21 @@ post_process_attr(struct mlx5e_tc_flow *flow,
 	}
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
-		if (vf_tun || is_post_act_attr) {
-			err = mlx5e_tc_add_flow_mod_hdr(flow->priv, flow, attr);
-			if (err)
-				goto err_out;
-		} else {
-			err = mlx5e_tc_attach_mod_hdr(flow->priv, flow, attr);
-			if (err)
-				goto err_out;
-		}
+		err = mlx5e_tc_attach_mod_hdr(flow->priv, flow, attr);
+		if (err)
+			goto err_out;
 	}
 
 	if (attr->branch_true &&
 	    attr->branch_true->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
-		err = mlx5e_tc_add_flow_mod_hdr(flow->priv, flow, attr->branch_true);
+		err = mlx5e_tc_attach_mod_hdr(flow->priv, flow, attr->branch_true);
 		if (err)
 			goto err_out;
 	}
 
 	if (attr->branch_false &&
 	    attr->branch_false->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
-		err = mlx5e_tc_add_flow_mod_hdr(flow->priv, flow, attr->branch_false);
+		err = mlx5e_tc_attach_mod_hdr(flow->priv, flow, attr->branch_false);
 		if (err)
 			goto err_out;
 	}
@@ -2057,7 +2030,7 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
 		esw_attr->int_port = int_port;
 	}
 
-	err = post_process_attr(flow, attr, false, extack);
+	err = post_process_attr(flow, attr, extack);
 	if (err)
 		goto err_out;
 
@@ -2142,10 +2115,7 @@ static void mlx5e_tc_del_fdb_flow(struct mlx5e_priv *priv,
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
 		mlx5e_mod_hdr_dealloc(&attr->parse_attr->mod_hdr_acts);
-		if (vf_tun && attr->modify_hdr)
-			mlx5_modify_header_dealloc(priv->mdev, attr->modify_hdr);
-		else
-			mlx5e_tc_detach_mod_hdr(priv, flow, attr);
+		mlx5e_tc_detach_mod_hdr(priv, flow, attr);
 	}
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_COUNT)
@@ -3964,7 +3934,7 @@ alloc_flow_post_acts(struct mlx5e_tc_flow *flow, struct netlink_ext_ack *extack)
 		if (err)
 			goto out_free;
 
-		err = post_process_attr(flow, attr, true, extack);
+		err = post_process_attr(flow, attr, extack);
 		if (err)
 			goto out_free;
 
@@ -4531,8 +4501,7 @@ mlx5_free_flow_attr(struct mlx5e_tc_flow *flow, struct mlx5_flow_attr *attr)
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
 		mlx5e_mod_hdr_dealloc(&attr->parse_attr->mod_hdr_acts);
-		if (attr->modify_hdr)
-			mlx5_modify_header_dealloc(flow->priv->mdev, attr->modify_hdr);
+		mlx5e_tc_detach_mod_hdr(flow->priv, flow, attr);
 	}
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
index 01d76ff67315..66e8d9d89082 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
@@ -287,9 +287,13 @@ int mlx5e_tc_match_to_reg_set_and_get_id(struct mlx5_core_dev *mdev,
 					 enum mlx5e_tc_attr_to_reg type,
 					 u32 data);
 
-int mlx5e_tc_add_flow_mod_hdr(struct mlx5e_priv *priv,
-			      struct mlx5e_tc_flow *flow,
-			      struct mlx5_flow_attr *attr);
+int mlx5e_tc_attach_mod_hdr(struct mlx5e_priv *priv,
+			    struct mlx5e_tc_flow *flow,
+			    struct mlx5_flow_attr *attr);
+
+void mlx5e_tc_detach_mod_hdr(struct mlx5e_priv *priv,
+			     struct mlx5e_tc_flow *flow,
+			     struct mlx5_flow_attr *attr);
 
 void mlx5e_tc_set_ethertype(struct mlx5_core_dev *mdev,
 			    struct flow_match_basic *match, bool outer,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 11/15] net/mlx5e: Warn when destroying mod hdr hash table that is not empty
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (9 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 10/15] net/mlx5e: TC, Use common function allocating flow mod hdr or encap mod hdr Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:45   ` Jacob Keller
  2023-01-18 18:35 ` [net-next 12/15] net/mlx5: E-Switch, Fix typo for egress Saeed Mahameed
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman

From: Roi Dayan <roid@nvidia.com>

To avoid memory leaks add a warn when destroying mod hdr hash table
but the hash table is not empty.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c b/drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c
index 17325c5d6516..cf60f0a3ff23 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c
@@ -47,6 +47,7 @@ void mlx5e_mod_hdr_tbl_init(struct mod_hdr_tbl *tbl)
 
 void mlx5e_mod_hdr_tbl_destroy(struct mod_hdr_tbl *tbl)
 {
+	WARN_ON(!hash_empty(tbl->hlist));
 	mutex_destroy(&tbl->lock);
 }
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 12/15] net/mlx5: E-Switch, Fix typo for egress
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (10 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 11/15] net/mlx5e: Warn when destroying mod hdr hash table that is not empty Saeed Mahameed
@ 2023-01-18 18:35 ` Saeed Mahameed
  2023-01-18 21:46   ` Jacob Keller
  2023-01-18 18:36 ` [net-next 13/15] net/mlx5e: Support Geneve and GRE with VF tunnel offload Saeed Mahameed
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:35 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan

From: Roi Dayan <roid@nvidia.com>

Fix engress to egress.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 05a352d8d13f..6f11b46ee79a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1250,7 +1250,7 @@ static int mlx5_esw_acls_ns_init(struct mlx5_eswitch *esw)
 		if (err)
 			return err;
 	} else {
-		esw_warn(dev, "engress ACL is not supported by FW\n");
+		esw_warn(dev, "egress ACL is not supported by FW\n");
 	}
 
 	if (MLX5_CAP_ESW_INGRESS_ACL(dev, ft_support)) {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 13/15] net/mlx5e: Support Geneve and GRE with VF tunnel offload
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (11 preceding siblings ...)
  2023-01-18 18:35 ` [net-next 12/15] net/mlx5: E-Switch, Fix typo for egress Saeed Mahameed
@ 2023-01-18 18:36 ` Saeed Mahameed
  2023-01-18 21:48   ` Jacob Keller
  2023-01-18 18:36 ` [net-next 14/15] net/mlx5e: Remove redundant allocation of spec in create indirect fwd group Saeed Mahameed
  2023-01-18 18:36 ` [net-next 15/15] net/mlx5e: Use read lock for eswitch get callbacks Saeed Mahameed
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:36 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Maor Dickman, Paul Blakey,
	Roi Dayan

From: Maor Dickman <maord@nvidia.com>

Today VF tunnel offload (tunnel endpoint is on VF) is implemented
by indirect table which use rules that match on VXLAN VNI to
recirculated to root table, this limit the support for only
VXLAN tunnels.

This patch change indirect table to use one single match all rule
to recirculated to root table which is added when any tunnel decap
rule is added with tunnel endpoint is VF. This allow support of
Geneve and GRE with this configuration.

Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/tc_tun.c   |   2 -
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   9 +-
 .../net/ethernet/mellanox/mlx5/core/en_tc.h   |   2 -
 .../mellanox/mlx5/core/esw/indir_table.c      | 203 +++---------------
 .../mellanox/mlx5/core/esw/indir_table.h      |   4 -
 .../mellanox/mlx5/core/eswitch_offloads.c     |  23 +-
 6 files changed, 48 insertions(+), 195 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
index e6f64d890fb3..83bb0811e774 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
@@ -745,8 +745,6 @@ int mlx5e_tc_tun_route_lookup(struct mlx5e_priv *priv,
 		if (err)
 			goto out;
 
-		esw_attr->rx_tun_attr->vni = MLX5_GET(fte_match_param, spec->match_value,
-						      misc_parameters.vxlan_vni);
 		esw_attr->rx_tun_attr->decap_vport = vport_num;
 	} else if (netif_is_ovs_master(attr.route_dev) && mlx5e_tc_int_port_supported(esw)) {
 		int_port = mlx5e_tc_int_port_get(mlx5e_get_int_port_priv(priv),
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index c978f4b12131..c8377b4c8c8e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2595,13 +2595,13 @@ static int parse_tunnel_attr(struct mlx5e_priv *priv,
 		err = mlx5e_tc_set_attr_rx_tun(flow, spec);
 		if (err)
 			return err;
-	} else if (tunnel && tunnel->tunnel_type == MLX5E_TC_TUNNEL_TYPE_VXLAN) {
+	} else if (tunnel) {
 		struct mlx5_flow_spec *tmp_spec;
 
 		tmp_spec = kvzalloc(sizeof(*tmp_spec), GFP_KERNEL);
 		if (!tmp_spec) {
-			NL_SET_ERR_MSG_MOD(extack, "Failed to allocate memory for vxlan tmp spec");
-			netdev_warn(priv->netdev, "Failed to allocate memory for vxlan tmp spec");
+			NL_SET_ERR_MSG_MOD(extack, "Failed to allocate memory for tunnel tmp spec");
+			netdev_warn(priv->netdev, "Failed to allocate memory for tunnel tmp spec");
 			return -ENOMEM;
 		}
 		memcpy(tmp_spec, spec, sizeof(*tmp_spec));
@@ -4623,9 +4623,6 @@ __mlx5e_add_fdb_flow(struct mlx5e_priv *priv,
 	if (err)
 		goto err_free;
 
-	/* always set IP version for indirect table handling */
-	flow->attr->ip_version = mlx5e_tc_get_ip_version(&parse_attr->spec, true);
-
 	err = parse_tc_fdb_actions(priv, &rule->action, flow, extack);
 	if (err)
 		goto err_free;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
index 66e8d9d89082..ce516dc7f3fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
@@ -84,7 +84,6 @@ struct mlx5_flow_attr {
 	struct mlx5_flow_table *dest_ft;
 	u8 inner_match_level;
 	u8 outer_match_level;
-	u8 ip_version;
 	u8 tun_ip_version;
 	int tunnel_id; /* mapped tunnel id */
 	u32 flags;
@@ -136,7 +135,6 @@ struct mlx5_rx_tun_attr {
 		__be32 v4;
 		struct in6_addr v6;
 	} dst_ip; /* Valid if decap_vport is not zero */
-	u32 vni;
 };
 
 #define MLX5E_TC_TABLE_CHAIN_TAG_BITS 16
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
index c9a91158e99c..8a94870c5b43 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
@@ -16,18 +16,12 @@
 #include "lib/fs_chains.h"
 #include "en/mod_hdr.h"
 
-#define MLX5_ESW_INDIR_TABLE_SIZE 128
-#define MLX5_ESW_INDIR_TABLE_RECIRC_IDX_MAX (MLX5_ESW_INDIR_TABLE_SIZE - 2)
+#define MLX5_ESW_INDIR_TABLE_SIZE 2
+#define MLX5_ESW_INDIR_TABLE_RECIRC_IDX (MLX5_ESW_INDIR_TABLE_SIZE - 2)
 #define MLX5_ESW_INDIR_TABLE_FWD_IDX (MLX5_ESW_INDIR_TABLE_SIZE - 1)
 
 struct mlx5_esw_indir_table_rule {
-	struct list_head list;
 	struct mlx5_flow_handle *handle;
-	union {
-		__be32 v4;
-		struct in6_addr v6;
-	} dst_ip;
-	u32 vni;
 	struct mlx5_modify_hdr *mh;
 	refcount_t refcnt;
 };
@@ -38,12 +32,10 @@ struct mlx5_esw_indir_table_entry {
 	struct mlx5_flow_group *recirc_grp;
 	struct mlx5_flow_group *fwd_grp;
 	struct mlx5_flow_handle *fwd_rule;
-	struct list_head recirc_rules;
-	int recirc_cnt;
+	struct mlx5_esw_indir_table_rule *recirc_rule;
 	int fwd_ref;
 
 	u16 vport;
-	u8 ip_version;
 };
 
 struct mlx5_esw_indir_table {
@@ -89,7 +81,6 @@ mlx5_esw_indir_table_needed(struct mlx5_eswitch *esw,
 	return esw_attr->in_rep->vport == MLX5_VPORT_UPLINK &&
 		vf_sf_vport &&
 		esw->dev == dest_mdev &&
-		attr->ip_version &&
 		attr->flags & MLX5_ATTR_FLAG_SRC_REWRITE;
 }
 
@@ -101,27 +92,8 @@ mlx5_esw_indir_table_decap_vport(struct mlx5_flow_attr *attr)
 	return esw_attr->rx_tun_attr ? esw_attr->rx_tun_attr->decap_vport : 0;
 }
 
-static struct mlx5_esw_indir_table_rule *
-mlx5_esw_indir_table_rule_lookup(struct mlx5_esw_indir_table_entry *e,
-				 struct mlx5_esw_flow_attr *attr)
-{
-	struct mlx5_esw_indir_table_rule *rule;
-
-	list_for_each_entry(rule, &e->recirc_rules, list)
-		if (rule->vni == attr->rx_tun_attr->vni &&
-		    !memcmp(&rule->dst_ip, &attr->rx_tun_attr->dst_ip,
-			    sizeof(attr->rx_tun_attr->dst_ip)))
-			goto found;
-	return NULL;
-
-found:
-	refcount_inc(&rule->refcnt);
-	return rule;
-}
-
 static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
 					 struct mlx5_flow_attr *attr,
-					 struct mlx5_flow_spec *spec,
 					 struct mlx5_esw_indir_table_entry *e)
 {
 	struct mlx5_esw_flow_attr *esw_attr = attr->esw_attr;
@@ -130,73 +102,18 @@ static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
 	struct mlx5_flow_destination dest = {};
 	struct mlx5_esw_indir_table_rule *rule;
 	struct mlx5_flow_act flow_act = {};
-	struct mlx5_flow_spec *rule_spec;
 	struct mlx5_flow_handle *handle;
 	int err = 0;
 	u32 data;
 
-	rule = mlx5_esw_indir_table_rule_lookup(e, esw_attr);
-	if (rule)
+	if (e->recirc_rule) {
+		refcount_inc(&e->recirc_rule->refcnt);
 		return 0;
-
-	if (e->recirc_cnt == MLX5_ESW_INDIR_TABLE_RECIRC_IDX_MAX)
-		return -EINVAL;
-
-	rule_spec = kvzalloc(sizeof(*rule_spec), GFP_KERNEL);
-	if (!rule_spec)
-		return -ENOMEM;
-
-	rule = kzalloc(sizeof(*rule), GFP_KERNEL);
-	if (!rule) {
-		err = -ENOMEM;
-		goto out;
 	}
 
-	rule_spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS |
-					   MLX5_MATCH_MISC_PARAMETERS |
-					   MLX5_MATCH_MISC_PARAMETERS_2;
-	if (MLX5_CAP_FLOWTABLE_NIC_RX(esw->dev, ft_field_support.outer_ip_version)) {
-		MLX5_SET(fte_match_param, rule_spec->match_criteria,
-			 outer_headers.ip_version, 0xf);
-		MLX5_SET(fte_match_param, rule_spec->match_value, outer_headers.ip_version,
-			 attr->ip_version);
-	} else if (attr->ip_version) {
-		MLX5_SET_TO_ONES(fte_match_param, rule_spec->match_criteria,
-				 outer_headers.ethertype);
-		MLX5_SET(fte_match_param, rule_spec->match_value, outer_headers.ethertype,
-			 (attr->ip_version == 4 ? ETH_P_IP : ETH_P_IPV6));
-	} else {
-		err = -EOPNOTSUPP;
-		goto err_ethertype;
-	}
-
-	if (attr->ip_version == 4) {
-		MLX5_SET_TO_ONES(fte_match_param, rule_spec->match_criteria,
-				 outer_headers.dst_ipv4_dst_ipv6.ipv4_layout.ipv4);
-		MLX5_SET(fte_match_param, rule_spec->match_value,
-			 outer_headers.dst_ipv4_dst_ipv6.ipv4_layout.ipv4,
-			 ntohl(esw_attr->rx_tun_attr->dst_ip.v4));
-	} else if (attr->ip_version == 6) {
-		int len = sizeof(struct in6_addr);
-
-		memset(MLX5_ADDR_OF(fte_match_param, rule_spec->match_criteria,
-				    outer_headers.dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
-		       0xff, len);
-		memcpy(MLX5_ADDR_OF(fte_match_param, rule_spec->match_value,
-				    outer_headers.dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
-		       &esw_attr->rx_tun_attr->dst_ip.v6, len);
-	}
-
-	MLX5_SET_TO_ONES(fte_match_param, rule_spec->match_criteria,
-			 misc_parameters.vxlan_vni);
-	MLX5_SET(fte_match_param, rule_spec->match_value, misc_parameters.vxlan_vni,
-		 MLX5_GET(fte_match_param, spec->match_value, misc_parameters.vxlan_vni));
-
-	MLX5_SET(fte_match_param, rule_spec->match_criteria,
-		 misc_parameters_2.metadata_reg_c_0, mlx5_eswitch_get_vport_metadata_mask());
-	MLX5_SET(fte_match_param, rule_spec->match_value, misc_parameters_2.metadata_reg_c_0,
-		 mlx5_eswitch_get_vport_metadata_for_match(esw_attr->in_mdev->priv.eswitch,
-							   MLX5_VPORT_UPLINK));
+	rule = kzalloc(sizeof(*rule), GFP_KERNEL);
+	if (!rule)
+		return -ENOMEM;
 
 	/* Modify flow source to recirculate packet */
 	data = mlx5_eswitch_get_vport_metadata_for_set(esw, esw_attr->rx_tun_attr->decap_vport);
@@ -219,13 +136,14 @@ static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
 
 	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST | MLX5_FLOW_CONTEXT_ACTION_MOD_HDR;
 	flow_act.flags = FLOW_ACT_IGNORE_FLOW_LEVEL | FLOW_ACT_NO_APPEND;
+	flow_act.fg = e->recirc_grp;
 	dest.type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
 	dest.ft = mlx5_chains_get_table(chains, 0, 1, 0);
 	if (IS_ERR(dest.ft)) {
 		err = PTR_ERR(dest.ft);
 		goto err_table;
 	}
-	handle = mlx5_add_flow_rules(e->ft, rule_spec, &flow_act, &dest, 1);
+	handle = mlx5_add_flow_rules(e->ft, NULL, &flow_act, &dest, 1);
 	if (IS_ERR(handle)) {
 		err = PTR_ERR(handle);
 		goto err_handle;
@@ -233,14 +151,10 @@ static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
 
 	mlx5e_mod_hdr_dealloc(&mod_acts);
 	rule->handle = handle;
-	rule->vni = esw_attr->rx_tun_attr->vni;
 	rule->mh = flow_act.modify_hdr;
-	memcpy(&rule->dst_ip, &esw_attr->rx_tun_attr->dst_ip,
-	       sizeof(esw_attr->rx_tun_attr->dst_ip));
 	refcount_set(&rule->refcnt, 1);
-	list_add(&rule->list, &e->recirc_rules);
-	e->recirc_cnt++;
-	goto out;
+	e->recirc_rule = rule;
+	return 0;
 
 err_handle:
 	mlx5_chains_put_table(chains, 0, 1, 0);
@@ -250,89 +164,44 @@ static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
 err_mod_hdr_regc1:
 	mlx5e_mod_hdr_dealloc(&mod_acts);
 err_mod_hdr_regc0:
-err_ethertype:
 	kfree(rule);
-out:
-	kvfree(rule_spec);
 	return err;
 }
 
 static void mlx5_esw_indir_table_rule_put(struct mlx5_eswitch *esw,
-					  struct mlx5_flow_attr *attr,
 					  struct mlx5_esw_indir_table_entry *e)
 {
-	struct mlx5_esw_flow_attr *esw_attr = attr->esw_attr;
+	struct mlx5_esw_indir_table_rule *rule = e->recirc_rule;
 	struct mlx5_fs_chains *chains = esw_chains(esw);
-	struct mlx5_esw_indir_table_rule *rule;
 
-	list_for_each_entry(rule, &e->recirc_rules, list)
-		if (rule->vni == esw_attr->rx_tun_attr->vni &&
-		    !memcmp(&rule->dst_ip, &esw_attr->rx_tun_attr->dst_ip,
-			    sizeof(esw_attr->rx_tun_attr->dst_ip)))
-			goto found;
-
-	return;
+	if (!rule)
+		return;
 
-found:
 	if (!refcount_dec_and_test(&rule->refcnt))
 		return;
 
 	mlx5_del_flow_rules(rule->handle);
 	mlx5_chains_put_table(chains, 0, 1, 0);
 	mlx5_modify_header_dealloc(esw->dev, rule->mh);
-	list_del(&rule->list);
 	kfree(rule);
-	e->recirc_cnt--;
+	e->recirc_rule = NULL;
 }
 
-static int mlx5_create_indir_recirc_group(struct mlx5_eswitch *esw,
-					  struct mlx5_flow_attr *attr,
-					  struct mlx5_flow_spec *spec,
-					  struct mlx5_esw_indir_table_entry *e)
+static int mlx5_create_indir_recirc_group(struct mlx5_esw_indir_table_entry *e)
 {
 	int err = 0, inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
-	u32 *in, *match;
+	u32 *in;
 
 	in = kvzalloc(inlen, GFP_KERNEL);
 	if (!in)
 		return -ENOMEM;
 
-	MLX5_SET(create_flow_group_in, in, match_criteria_enable, MLX5_MATCH_OUTER_HEADERS |
-		 MLX5_MATCH_MISC_PARAMETERS | MLX5_MATCH_MISC_PARAMETERS_2);
-	match = MLX5_ADDR_OF(create_flow_group_in, in, match_criteria);
-
-	if (MLX5_CAP_FLOWTABLE_NIC_RX(esw->dev, ft_field_support.outer_ip_version))
-		MLX5_SET(fte_match_param, match, outer_headers.ip_version, 0xf);
-	else
-		MLX5_SET_TO_ONES(fte_match_param, match, outer_headers.ethertype);
-
-	if (attr->ip_version == 4) {
-		MLX5_SET_TO_ONES(fte_match_param, match,
-				 outer_headers.dst_ipv4_dst_ipv6.ipv4_layout.ipv4);
-	} else if (attr->ip_version == 6) {
-		memset(MLX5_ADDR_OF(fte_match_param, match,
-				    outer_headers.dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
-		       0xff, sizeof(struct in6_addr));
-	} else {
-		err = -EOPNOTSUPP;
-		goto out;
-	}
-
-	MLX5_SET_TO_ONES(fte_match_param, match, misc_parameters.vxlan_vni);
-	MLX5_SET(fte_match_param, match, misc_parameters_2.metadata_reg_c_0,
-		 mlx5_eswitch_get_vport_metadata_mask());
 	MLX5_SET(create_flow_group_in, in, start_flow_index, 0);
-	MLX5_SET(create_flow_group_in, in, end_flow_index, MLX5_ESW_INDIR_TABLE_RECIRC_IDX_MAX);
+	MLX5_SET(create_flow_group_in, in, end_flow_index, MLX5_ESW_INDIR_TABLE_RECIRC_IDX);
 	e->recirc_grp = mlx5_create_flow_group(e->ft, in);
-	if (IS_ERR(e->recirc_grp)) {
+	if (IS_ERR(e->recirc_grp))
 		err = PTR_ERR(e->recirc_grp);
-		goto out;
-	}
 
-	INIT_LIST_HEAD(&e->recirc_rules);
-	e->recirc_cnt = 0;
-
-out:
 	kvfree(in);
 	return err;
 }
@@ -366,6 +235,7 @@ static int mlx5_create_indir_fwd_group(struct mlx5_eswitch *esw,
 	}
 
 	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
+	flow_act.fg = e->fwd_grp;
 	dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
 	dest.vport.num = e->vport;
 	dest.vport.vhca_id = MLX5_CAP_GEN(esw->dev, vhca_id);
@@ -384,7 +254,7 @@ static int mlx5_create_indir_fwd_group(struct mlx5_eswitch *esw,
 
 static struct mlx5_esw_indir_table_entry *
 mlx5_esw_indir_table_entry_create(struct mlx5_eswitch *esw, struct mlx5_flow_attr *attr,
-				  struct mlx5_flow_spec *spec, u16 vport, bool decap)
+				  u16 vport, bool decap)
 {
 	struct mlx5_flow_table_attr ft_attr = {};
 	struct mlx5_flow_namespace *root_ns;
@@ -412,15 +282,14 @@ mlx5_esw_indir_table_entry_create(struct mlx5_eswitch *esw, struct mlx5_flow_att
 	}
 	e->ft = ft;
 	e->vport = vport;
-	e->ip_version = attr->ip_version;
 	e->fwd_ref = !decap;
 
-	err = mlx5_create_indir_recirc_group(esw, attr, spec, e);
+	err = mlx5_create_indir_recirc_group(e);
 	if (err)
 		goto recirc_grp_err;
 
 	if (decap) {
-		err = mlx5_esw_indir_table_rule_get(esw, attr, spec, e);
+		err = mlx5_esw_indir_table_rule_get(esw, attr, e);
 		if (err)
 			goto recirc_rule_err;
 	}
@@ -430,13 +299,13 @@ mlx5_esw_indir_table_entry_create(struct mlx5_eswitch *esw, struct mlx5_flow_att
 		goto fwd_grp_err;
 
 	hash_add(esw->fdb_table.offloads.indir->table, &e->hlist,
-		 vport << 16 | attr->ip_version);
+		 vport << 16);
 
 	return e;
 
 fwd_grp_err:
 	if (decap)
-		mlx5_esw_indir_table_rule_put(esw, attr, e);
+		mlx5_esw_indir_table_rule_put(esw, e);
 recirc_rule_err:
 	mlx5_destroy_flow_group(e->recirc_grp);
 recirc_grp_err:
@@ -447,13 +316,13 @@ mlx5_esw_indir_table_entry_create(struct mlx5_eswitch *esw, struct mlx5_flow_att
 }
 
 static struct mlx5_esw_indir_table_entry *
-mlx5_esw_indir_table_entry_lookup(struct mlx5_eswitch *esw, u16 vport, u8 ip_version)
+mlx5_esw_indir_table_entry_lookup(struct mlx5_eswitch *esw, u16 vport)
 {
 	struct mlx5_esw_indir_table_entry *e;
-	u32 key = vport << 16 | ip_version;
+	u32 key = vport << 16;
 
 	hash_for_each_possible(esw->fdb_table.offloads.indir->table, e, hlist, key)
-		if (e->vport == vport && e->ip_version == ip_version)
+		if (e->vport == vport)
 			return e;
 
 	return NULL;
@@ -461,24 +330,23 @@ mlx5_esw_indir_table_entry_lookup(struct mlx5_eswitch *esw, u16 vport, u8 ip_ver
 
 struct mlx5_flow_table *mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
 						 struct mlx5_flow_attr *attr,
-						 struct mlx5_flow_spec *spec,
 						 u16 vport, bool decap)
 {
 	struct mlx5_esw_indir_table_entry *e;
 	int err;
 
 	mutex_lock(&esw->fdb_table.offloads.indir->lock);
-	e = mlx5_esw_indir_table_entry_lookup(esw, vport, attr->ip_version);
+	e = mlx5_esw_indir_table_entry_lookup(esw, vport);
 	if (e) {
 		if (!decap) {
 			e->fwd_ref++;
 		} else {
-			err = mlx5_esw_indir_table_rule_get(esw, attr, spec, e);
+			err = mlx5_esw_indir_table_rule_get(esw, attr, e);
 			if (err)
 				goto out_err;
 		}
 	} else {
-		e = mlx5_esw_indir_table_entry_create(esw, attr, spec, vport, decap);
+		e = mlx5_esw_indir_table_entry_create(esw, attr, vport, decap);
 		if (IS_ERR(e)) {
 			err = PTR_ERR(e);
 			esw_warn(esw->dev, "Failed to create indirection table, err %d.\n", err);
@@ -494,22 +362,21 @@ struct mlx5_flow_table *mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
 }
 
 void mlx5_esw_indir_table_put(struct mlx5_eswitch *esw,
-			      struct mlx5_flow_attr *attr,
 			      u16 vport, bool decap)
 {
 	struct mlx5_esw_indir_table_entry *e;
 
 	mutex_lock(&esw->fdb_table.offloads.indir->lock);
-	e = mlx5_esw_indir_table_entry_lookup(esw, vport, attr->ip_version);
+	e = mlx5_esw_indir_table_entry_lookup(esw, vport);
 	if (!e)
 		goto out;
 
 	if (!decap)
 		e->fwd_ref--;
 	else
-		mlx5_esw_indir_table_rule_put(esw, attr, e);
+		mlx5_esw_indir_table_rule_put(esw, e);
 
-	if (e->fwd_ref || e->recirc_cnt)
+	if (e->fwd_ref || e->recirc_rule)
 		goto out;
 
 	hash_del(&e->hlist);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.h b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.h
index 21d56b49d14b..036f5b3a341b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.h
@@ -13,10 +13,8 @@ mlx5_esw_indir_table_destroy(struct mlx5_esw_indir_table *indir);
 
 struct mlx5_flow_table *mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
 						 struct mlx5_flow_attr *attr,
-						 struct mlx5_flow_spec *spec,
 						 u16 vport, bool decap);
 void mlx5_esw_indir_table_put(struct mlx5_eswitch *esw,
-			      struct mlx5_flow_attr *attr,
 			      u16 vport, bool decap);
 
 bool
@@ -44,7 +42,6 @@ mlx5_esw_indir_table_destroy(struct mlx5_esw_indir_table *indir)
 static inline struct mlx5_flow_table *
 mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
 			 struct mlx5_flow_attr *attr,
-			 struct mlx5_flow_spec *spec,
 			 u16 vport, bool decap)
 {
 	return ERR_PTR(-EOPNOTSUPP);
@@ -52,7 +49,6 @@ mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
 
 static inline void
 mlx5_esw_indir_table_put(struct mlx5_eswitch *esw,
-			 struct mlx5_flow_attr *attr,
 			 u16 vport, bool decap)
 {
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index c981fa77f439..4eeb2dcbfc0f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -179,15 +179,14 @@ mlx5_eswitch_set_rule_source_port(struct mlx5_eswitch *esw,
 
 static int
 esw_setup_decap_indir(struct mlx5_eswitch *esw,
-		      struct mlx5_flow_attr *attr,
-		      struct mlx5_flow_spec *spec)
+		      struct mlx5_flow_attr *attr)
 {
 	struct mlx5_flow_table *ft;
 
 	if (!(attr->flags & MLX5_ATTR_FLAG_SRC_REWRITE))
 		return -EOPNOTSUPP;
 
-	ft = mlx5_esw_indir_table_get(esw, attr, spec,
+	ft = mlx5_esw_indir_table_get(esw, attr,
 				      mlx5_esw_indir_table_decap_vport(attr), true);
 	return PTR_ERR_OR_ZERO(ft);
 }
@@ -197,7 +196,7 @@ esw_cleanup_decap_indir(struct mlx5_eswitch *esw,
 			struct mlx5_flow_attr *attr)
 {
 	if (mlx5_esw_indir_table_decap_vport(attr))
-		mlx5_esw_indir_table_put(esw, attr,
+		mlx5_esw_indir_table_put(esw,
 					 mlx5_esw_indir_table_decap_vport(attr),
 					 true);
 }
@@ -235,7 +234,6 @@ esw_setup_ft_dest(struct mlx5_flow_destination *dest,
 		  struct mlx5_flow_act *flow_act,
 		  struct mlx5_eswitch *esw,
 		  struct mlx5_flow_attr *attr,
-		  struct mlx5_flow_spec *spec,
 		  int i)
 {
 	flow_act->flags |= FLOW_ACT_IGNORE_FLOW_LEVEL;
@@ -243,7 +241,7 @@ esw_setup_ft_dest(struct mlx5_flow_destination *dest,
 	dest[i].ft = attr->dest_ft;
 
 	if (mlx5_esw_indir_table_decap_vport(attr))
-		return esw_setup_decap_indir(esw, attr, spec);
+		return esw_setup_decap_indir(esw, attr);
 	return 0;
 }
 
@@ -298,7 +296,7 @@ static void esw_put_dest_tables_loop(struct mlx5_eswitch *esw, struct mlx5_flow_
 			mlx5_chains_put_table(chains, 0, 1, 0);
 		else if (mlx5_esw_indir_table_needed(esw, attr, esw_attr->dests[i].rep->vport,
 						     esw_attr->dests[i].mdev))
-			mlx5_esw_indir_table_put(esw, attr, esw_attr->dests[i].rep->vport,
+			mlx5_esw_indir_table_put(esw, esw_attr->dests[i].rep->vport,
 						 false);
 }
 
@@ -384,7 +382,6 @@ esw_setup_indir_table(struct mlx5_flow_destination *dest,
 		      struct mlx5_flow_act *flow_act,
 		      struct mlx5_eswitch *esw,
 		      struct mlx5_flow_attr *attr,
-		      struct mlx5_flow_spec *spec,
 		      bool ignore_flow_lvl,
 		      int *i)
 {
@@ -399,7 +396,7 @@ esw_setup_indir_table(struct mlx5_flow_destination *dest,
 			flow_act->flags |= FLOW_ACT_IGNORE_FLOW_LEVEL;
 		dest[*i].type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
 
-		dest[*i].ft = mlx5_esw_indir_table_get(esw, attr, spec,
+		dest[*i].ft = mlx5_esw_indir_table_get(esw, attr,
 						       esw_attr->dests[j].rep->vport, false);
 		if (IS_ERR(dest[*i].ft)) {
 			err = PTR_ERR(dest[*i].ft);
@@ -408,7 +405,7 @@ esw_setup_indir_table(struct mlx5_flow_destination *dest,
 	}
 
 	if (mlx5_esw_indir_table_decap_vport(attr)) {
-		err = esw_setup_decap_indir(esw, attr, spec);
+		err = esw_setup_decap_indir(esw, attr);
 		if (err)
 			goto err_indir_tbl_get;
 	}
@@ -511,14 +508,14 @@ esw_setup_dests(struct mlx5_flow_destination *dest,
 		err = esw_setup_mtu_dest(dest, &attr->meter_attr, *i);
 		(*i)++;
 	} else if (esw_is_indir_table(esw, attr)) {
-		err = esw_setup_indir_table(dest, flow_act, esw, attr, spec, true, i);
+		err = esw_setup_indir_table(dest, flow_act, esw, attr, true, i);
 	} else if (esw_is_chain_src_port_rewrite(esw, esw_attr)) {
 		err = esw_setup_chain_src_port_rewrite(dest, flow_act, esw, chains, attr, i);
 	} else {
 		*i = esw_setup_vport_dests(dest, flow_act, esw, esw_attr, *i);
 
 		if (attr->dest_ft) {
-			err = esw_setup_ft_dest(dest, flow_act, esw, attr, spec, *i);
+			err = esw_setup_ft_dest(dest, flow_act, esw, attr, *i);
 			(*i)++;
 		} else if (attr->dest_chain) {
 			err = esw_setup_chain_dest(dest, flow_act, chains, attr->dest_chain,
@@ -727,7 +724,7 @@ mlx5_eswitch_add_fwd_rule(struct mlx5_eswitch *esw,
 	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
 	for (i = 0; i < esw_attr->split_count; i++) {
 		if (esw_is_indir_table(esw, attr))
-			err = esw_setup_indir_table(dest, &flow_act, esw, attr, spec, false, &i);
+			err = esw_setup_indir_table(dest, &flow_act, esw, attr, false, &i);
 		else if (esw_is_chain_src_port_rewrite(esw, esw_attr))
 			err = esw_setup_chain_src_port_rewrite(dest, &flow_act, esw, chains, attr,
 							       &i);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 14/15] net/mlx5e: Remove redundant allocation of spec in create indirect fwd group
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (12 preceding siblings ...)
  2023-01-18 18:36 ` [net-next 13/15] net/mlx5e: Support Geneve and GRE with VF tunnel offload Saeed Mahameed
@ 2023-01-18 18:36 ` Saeed Mahameed
  2023-01-18 21:50   ` Jacob Keller
  2023-01-18 18:36 ` [net-next 15/15] net/mlx5e: Use read lock for eswitch get callbacks Saeed Mahameed
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:36 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Maor Dickman, Roi Dayan

From: Maor Dickman <maord@nvidia.com>

mlx5_add_flow_rules supports creating rules without any matches by passing NULL
pointer instead of spec, if NULL is passed it will use a static empty spec.
This make allocation of spec in mlx5_create_indir_fwd_group unnecessary.

Remove the redundant allocation.

Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/esw/indir_table.c  | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
index 8a94870c5b43..9959e9fd15a1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
@@ -212,19 +212,12 @@ static int mlx5_create_indir_fwd_group(struct mlx5_eswitch *esw,
 	int err = 0, inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
 	struct mlx5_flow_destination dest = {};
 	struct mlx5_flow_act flow_act = {};
-	struct mlx5_flow_spec *spec;
 	u32 *in;
 
 	in = kvzalloc(inlen, GFP_KERNEL);
 	if (!in)
 		return -ENOMEM;
 
-	spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
-	if (!spec) {
-		kvfree(in);
-		return -ENOMEM;
-	}
-
 	/* Hold one entry */
 	MLX5_SET(create_flow_group_in, in, start_flow_index, MLX5_ESW_INDIR_TABLE_FWD_IDX);
 	MLX5_SET(create_flow_group_in, in, end_flow_index, MLX5_ESW_INDIR_TABLE_FWD_IDX);
@@ -240,14 +233,13 @@ static int mlx5_create_indir_fwd_group(struct mlx5_eswitch *esw,
 	dest.vport.num = e->vport;
 	dest.vport.vhca_id = MLX5_CAP_GEN(esw->dev, vhca_id);
 	dest.vport.flags = MLX5_FLOW_DEST_VPORT_VHCA_ID;
-	e->fwd_rule = mlx5_add_flow_rules(e->ft, spec, &flow_act, &dest, 1);
+	e->fwd_rule = mlx5_add_flow_rules(e->ft, NULL, &flow_act, &dest, 1);
 	if (IS_ERR(e->fwd_rule)) {
 		mlx5_destroy_flow_group(e->fwd_grp);
 		err = PTR_ERR(e->fwd_rule);
 	}
 
 err_out:
-	kvfree(spec);
 	kvfree(in);
 	return err;
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [net-next 15/15] net/mlx5e: Use read lock for eswitch get callbacks
  2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
                   ` (13 preceding siblings ...)
  2023-01-18 18:36 ` [net-next 14/15] net/mlx5e: Remove redundant allocation of spec in create indirect fwd group Saeed Mahameed
@ 2023-01-18 18:36 ` Saeed Mahameed
  2023-01-18 21:51   ` Jacob Keller
  14 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2023-01-18 18:36 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Leon Romanovsky

From: Leon Romanovsky <leonro@nvidia.com>

In commit 367dfa121205 ("net/mlx5: Remove devl_unlock from
mlx5_eswtich_mode_callback_enter") all functions were converted
to use write lock without relation to their actual purpose.

Change the devlink eswitch getters to use read and not write locks.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c   | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 4eeb2dcbfc0f..5fb9d5e99734 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -3572,9 +3572,9 @@ int mlx5_devlink_eswitch_mode_get(struct devlink *devlink, u16 *mode)
 	if (IS_ERR(esw))
 		return PTR_ERR(esw);
 
-	down_write(&esw->mode_lock);
+	down_read(&esw->mode_lock);
 	err = esw_mode_to_devlink(esw->mode, mode);
-	up_write(&esw->mode_lock);
+	up_read(&esw->mode_lock);
 	return err;
 }
 
@@ -3672,9 +3672,9 @@ int mlx5_devlink_eswitch_inline_mode_get(struct devlink *devlink, u8 *mode)
 	if (IS_ERR(esw))
 		return PTR_ERR(esw);
 
-	down_write(&esw->mode_lock);
+	down_read(&esw->mode_lock);
 	err = esw_inline_mode_to_devlink(esw->offloads.inline_mode, mode);
-	up_write(&esw->mode_lock);
+	up_read(&esw->mode_lock);
 	return err;
 }
 
@@ -3746,9 +3746,9 @@ int mlx5_devlink_eswitch_encap_mode_get(struct devlink *devlink,
 	if (IS_ERR(esw))
 		return PTR_ERR(esw);
 
-	down_write(&esw->mode_lock);
+	down_read(&esw->mode_lock);
 	*encap = esw->offloads.encap;
-	up_write(&esw->mode_lock);
+	up_read(&esw->mode_lock);
 	return 0;
 }
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [net-next 01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB
  2023-01-18 18:35 ` [net-next 01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB Saeed Mahameed
@ 2023-01-18 21:31   ` Jacob Keller
  2023-01-20  4:00   ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:31 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Rahul Rameshbabu,
	kernel test robot



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> 
> Send WQEBB size is 64 bytes and the max number of WQEBBs for an SQ is 255.
> For 16KB pages and greater, there is always sufficient spaces for all
> WQEBBs of an SQ. Cast mlx5e_get_max_sq_wqebbs(mdev) to u16. Prevents
> -Wtautological-constant-out-of-range-compare warnings from occurring when
> PAGE_SIZE >= 16KB.
> 
> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> Reported-by: kernel test robot <lkp@intel.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> index 853f312cd757..5578f92f7e0f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> @@ -445,7 +445,7 @@ mlx5e_set_eseg_swp(struct sk_buff *skb, struct mlx5_wqe_eth_seg *eseg,
>  
>  static inline u16 mlx5e_stop_room_for_wqe(struct mlx5_core_dev *mdev, u16 wqe_size)
>  {
> -	WARN_ON_ONCE(PAGE_SIZE / MLX5_SEND_WQE_BB < mlx5e_get_max_sq_wqebbs(mdev));
> +	WARN_ON_ONCE(PAGE_SIZE / MLX5_SEND_WQE_BB < (u16)mlx5e_get_max_sq_wqebbs(mdev));
>  
>  	/* A WQE must not cross the page boundary, hence two conditions:
>  	 * 1. Its size must not exceed the page size.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 02/15] net/mlx5: Suppress error logging on UCTX creation
  2023-01-18 18:35 ` [net-next 02/15] net/mlx5: Suppress error logging on UCTX creation Saeed Mahameed
@ 2023-01-18 21:32   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:32 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Yishai Hadas, Maor Gottlieb



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Yishai Hadas <yishaih@nvidia.com>
> 
> Suppress error logging that can be triggered by userspace upon DEVX UCTX
> creation.
> 
> The reason that it's not suppressed today with the uid check to suppress
> DEVX is that MLX5_CMD_OP_CREATE_UCTX command still doesn't have a uid as
> it comes to create it..
> 

uid isn't added until later in the create flow. ok.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> index c3c8a7148723..382d02f6619c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> @@ -813,7 +813,8 @@ static void cmd_status_print(struct mlx5_core_dev *dev, void *in, void *out)
>  	op_mod = MLX5_GET(mbox_in, in, op_mod);
>  	uid    = MLX5_GET(mbox_in, in, uid);
>  
> -	if (!uid && opcode != MLX5_CMD_OP_DESTROY_MKEY)
> +	if (!uid && opcode != MLX5_CMD_OP_DESTROY_MKEY &&
> +	    opcode != MLX5_CMD_OP_CREATE_UCTX)
>  		mlx5_cmd_out_err(dev, opcode, op_mod, out);
>  }
>  

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-18 18:35 ` [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control Saeed Mahameed
@ 2023-01-18 21:33   ` Jacob Keller
  2023-01-20  3:46     ` Jakub Kicinski
  0 siblings, 1 reply; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:33 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Rahul Rameshbabu, Gal Pressman



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> 
> The adjtime function supports using hardware to set the clock offset when
> the delta was supported by the hardware. When the delta is not supported by
> the hardware, the driver handles adjusting the clock. The newly-introduced
> adjphase function is similar to the adjtime function, except it guarantees
> that a provided clock offset will be used directly by the hardware to
> adjust the PTP clock. When the range is not acceptable by the hardware, an
> error is returned.
> 

Makes sense. Once you've verified that the delta is within the accepted
range you can just re-use the existing adjtime function.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> Reviewed-by: Gal Pressman <gal@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> index 69318b143268..ecdff26a22b0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> @@ -326,6 +326,14 @@ static int mlx5_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
>  	return 0;
>  }
>  
> +static int mlx5_ptp_adjphase(struct ptp_clock_info *ptp, s32 delta)
> +{
> +	if (delta < S16_MIN || delta > S16_MAX)
> +		return -ERANGE;
> +
> +	return mlx5_ptp_adjtime(ptp, delta);
> +}
> +
>  static int mlx5_ptp_adjfreq_real_time(struct mlx5_core_dev *mdev, s32 freq)
>  {
>  	u32 in[MLX5_ST_SZ_DW(mtutc_reg)] = {};
> @@ -688,6 +696,7 @@ static const struct ptp_clock_info mlx5_ptp_clock_info = {
>  	.n_pins		= 0,
>  	.pps		= 0,
>  	.adjfine	= mlx5_ptp_adjfine,
> +	.adjphase	= mlx5_ptp_adjphase,
>  	.adjtime	= mlx5_ptp_adjtime,
>  	.gettimex64	= mlx5_ptp_gettimex,
>  	.settime64	= mlx5_ptp_settime,

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 04/15] net/mlx5: Add hardware extended range support for PTP adjtime and adjphase
  2023-01-18 18:35 ` [net-next 04/15] net/mlx5: Add hardware extended range support for PTP adjtime and adjphase Saeed Mahameed
@ 2023-01-18 21:35   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:35 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Rahul Rameshbabu, Gal Pressman



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> 
> Capable hardware can use an extended range for offsetting the clock. An
> extended range of [-200000,200000] is used instead of [-32768,32767] for
> the delta/phase parameter of the adjtime/adjphase ptp_clock_info callbacks.
> 
> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> Reviewed-by: Gal Pressman <gal@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> ---
>  .../ethernet/mellanox/mlx5/core/lib/clock.c   | 34 +++++++++++++++++--
>  include/linux/mlx5/mlx5_ifc.h                 |  4 ++-
>  2 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> index ecdff26a22b0..75510a12ab02 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> @@ -69,6 +69,13 @@ enum {
>  	MLX5_MTPPS_FS_OUT_PULSE_DURATION_NS     = BIT(0xa),
>  };
>  
> +enum {
> +	MLX5_MTUTC_OPERATION_ADJUST_TIME_MIN          = S16_MIN,
> +	MLX5_MTUTC_OPERATION_ADJUST_TIME_MAX          = S16_MAX,
> +	MLX5_MTUTC_OPERATION_ADJUST_TIME_EXTENDED_MIN = -200000,
> +	MLX5_MTUTC_OPERATION_ADJUST_TIME_EXTENDED_MAX = 200000,
> +};
> +
>  static bool mlx5_real_time_mode(struct mlx5_core_dev *mdev)
>  {
>  	return (mlx5_is_real_time_rq(mdev) || mlx5_is_real_time_sq(mdev));
> @@ -86,6 +93,22 @@ static bool mlx5_modify_mtutc_allowed(struct mlx5_core_dev *mdev)
>  	return MLX5_CAP_MCAM_FEATURE(mdev, ptpcyc2realtime_modify);
>  }
>  
> +static bool mlx5_is_mtutc_time_adj_cap(struct mlx5_core_dev *mdev, s64 delta)
> +{
> +	s64 min = MLX5_MTUTC_OPERATION_ADJUST_TIME_MIN;
> +	s64 max = MLX5_MTUTC_OPERATION_ADJUST_TIME_MAX;
> +
> +	if (MLX5_CAP_MCAM_FEATURE(mdev, mtutc_time_adjustment_extended_range)) {
> +		min = MLX5_MTUTC_OPERATION_ADJUST_TIME_EXTENDED_MIN;
> +		max = MLX5_MTUTC_OPERATION_ADJUST_TIME_EXTENDED_MAX;
> +	}
> +
> +	if (delta < min || delta > max)
> +		return false;
> +
> +	return true;
> +}
> +
>  static int mlx5_set_mtutc(struct mlx5_core_dev *dev, u32 *mtutc, u32 size)
>  {
>  	u32 out[MLX5_ST_SZ_DW(mtutc_reg)] = {};
> @@ -288,8 +311,8 @@ static int mlx5_ptp_adjtime_real_time(struct mlx5_core_dev *mdev, s64 delta)
>  	if (!mlx5_modify_mtutc_allowed(mdev))
>  		return 0;
>  
> -	/* HW time adjustment range is s16. If out of range, settime instead */
> -	if (delta < S16_MIN || delta > S16_MAX) {
> +	/* HW time adjustment range is checked. If out of range, settime instead */
> +	if (!mlx5_is_mtutc_time_adj_cap(mdev, delta)) {
>  		struct timespec64 ts;
>  		s64 ns;
>  
> @@ -328,7 +351,12 @@ static int mlx5_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
>  
>  static int mlx5_ptp_adjphase(struct ptp_clock_info *ptp, s32 delta)
>  {
> -	if (delta < S16_MIN || delta > S16_MAX)
> +	struct mlx5_clock *clock = container_of(ptp, struct mlx5_clock, ptp_info);
> +	struct mlx5_core_dev *mdev;
> +
> +	mdev = container_of(clock, struct mlx5_core_dev, clock);
> +
> +	if (!mlx5_is_mtutc_time_adj_cap(mdev, delta))
>  		return -ERANGE;
>  
>  	return mlx5_ptp_adjtime(ptp, delta);
> diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
> index a84bdeeed2c6..0b102c651fe2 100644
> --- a/include/linux/mlx5/mlx5_ifc.h
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -9941,7 +9941,9 @@ struct mlx5_ifc_pcam_reg_bits {
>  };
>  
>  struct mlx5_ifc_mcam_enhanced_features_bits {
> -	u8         reserved_at_0[0x5d];
> +	u8         reserved_at_0[0x51];
> +	u8         mtutc_time_adjustment_extended_range[0x1];
> +	u8         reserved_at_52[0xb];
>  	u8         mcia_32dwords[0x1];
>  	u8         out_pulse_duration_ns[0x1];
>  	u8         npps_period[0x1];

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 05/15] net/mlx5: E-switch, Remove redundant comment about meta rules
  2023-01-18 18:35 ` [net-next 05/15] net/mlx5: E-switch, Remove redundant comment about meta rules Saeed Mahameed
@ 2023-01-18 21:40   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:40 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Roi Dayan <roid@nvidia.com>
> 
> Meta rules are created/destroyed per vport and not in eswitch
> init/destroy.
> 
> Signed-off-by: Roi Dayan <roid@nvidia.com>
> Reviewed-by: Maor Dickman <maord@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index bbb6dab3b21f..05a352d8d13f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> @@ -1406,9 +1406,7 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
>  	mlx5_eswitch_unload_vf_vports(esw, esw->esw_funcs.num_vfs);
>  	if (clear_vf)
>  		mlx5_eswitch_clear_vf_vports_info(esw);
> -	/* If disabling sriov in switchdev mode, free meta rules here
> -	 * because it depends on num_vfs.
> -	 */
> +

The comment was added in f019679ea5f2 ("net/mlx5: E-switch, Remove
dependency between sriov and eswitch mode") which included the line
esw_offloads_del_send_to_vport_meta_rules(esw);

That was removed in 430e2d5e2a98 ("net/mlx5: E-Switch, Move send to
vport meta rule creation")

This could be viewed as:

Fixes: 430e2d5e2a98 ("net/mlx5: E-Switch, Move send to vport meta rule
creation")

Makes sense.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

>  	if (esw->mode == MLX5_ESWITCH_OFFzLOADS) {
>  		struct devlink *devlink = priv_to_devlink(esw->dev);
>  

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 06/15] net/mlx5e: Fail with messages when params are not valid for XSK
  2023-01-18 18:35 ` [net-next 06/15] net/mlx5e: Fail with messages when params are not valid for XSK Saeed Mahameed
@ 2023-01-18 21:41   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:41 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Adham Faris



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Adham Faris <afaris@nvidia.com>
> 
> Current XSK prerequisites validation implementation
> (setup.c/mlx5e_validate_xsk_param()) fails silently when xsk
> prerequisites are not fulfilled.
> Add error messages to the kernel log to help the user understand what
> went wrong when params are not valid for XSK.
> 
> Signed-off-by: Adham Faris <afaris@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

Makes sense. I'm not familiar with the XSK configuration but I assume we
don't have netlink_ext_ack or similar for error message reporting here.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 07/15] net/mlx5e: Add warning when log WQE size is smaller than log stride size
  2023-01-18 18:35 ` [net-next 07/15] net/mlx5e: Add warning when log WQE size is smaller than log stride size Saeed Mahameed
@ 2023-01-18 21:42   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:42 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Adham Faris



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Adham Faris <afaris@nvidia.com>
> 
> Add warning macro in the function mlx5e_mpwqe_get_log_num_strides()
> when log WQE size is smaller than log stride size. Theoretically this
> should not happen.
> 
> Signed-off-by: Adham Faris <afaris@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en/params.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> index a17b768b81f1..53d2979e9457 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> @@ -411,9 +411,14 @@ u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
>  {
>  	enum mlx5e_mpwrq_umr_mode umr_mode = mlx5e_mpwrq_umr_mode(mdev, xsk);
>  	u8 page_shift = mlx5e_mpwrq_page_shift(mdev, xsk);
> -
> -	return mlx5e_mpwrq_log_wqe_sz(mdev, page_shift, umr_mode) -
> -		mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk);
> +	u8 log_wqe_size, log_stride_size;
> +
> +	log_wqe_size = mlx5e_mpwrq_log_wqe_sz(mdev, page_shift, umr_mode);
> +	log_stride_size = mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk);
> +	WARN(log_wqe_size < log_stride_size,
> +	     "Log WQE size %u < log stride size %u (page shift %u, umr mode %d, xsk on? %d)\n",
> +	     log_wqe_size, log_stride_size, page_shift, umr_mode, !!xsk);
> +	return log_wqe_size - log_stride_size;
>  }
>  
>  u8 mlx5e_mpwqe_get_min_wqe_bulk(unsigned int wq_sz)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 08/15] net/mlx5e: TC, Pass flow attr to attach/detach mod hdr functions
  2023-01-18 18:35 ` [net-next 08/15] net/mlx5e: TC, Pass flow attr to attach/detach mod hdr functions Saeed Mahameed
@ 2023-01-18 21:42   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:42 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Roi Dayan <roid@nvidia.com>
> 
> In preparation to remove duplicate functions handling mod hdr allocation
> and the fact that modify hdr should be per flow attr and not flow
> pass flow attr to the attach and detach mod hdr funcs.
> 
> Signed-off-by: Roi Dayan <roid@nvidia.com>
> Reviewed-by: Maor Dickman <maord@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> ---

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 09/15] net/mlx5e: TC, Add tc prefix to attach/detach hdr functions
  2023-01-18 18:35 ` [net-next 09/15] net/mlx5e: TC, Add tc prefix to attach/detach " Saeed Mahameed
@ 2023-01-18 21:44   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:44 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Roi Dayan <roid@nvidia.com>
> 
> Currently there are confusing names for attach/detach functions.
> 
> mlx5e_attach_mod_hdr() vs mlx5e_mod_hdr_attach()
> mlx5e_detach_mod_hdr() vs mlx5e_mod_hdr_detach()
> 
> Add tc prefix to the functions that are in en_tc.c to separate
> from the functions in mod_hdr.c which has the mod_hdr prefix.
> 

Makes it a bit easier to distinguish the pairs.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 10/15] net/mlx5e: TC, Use common function allocating flow mod hdr or encap mod hdr
  2023-01-18 18:35 ` [net-next 10/15] net/mlx5e: TC, Use common function allocating flow mod hdr or encap mod hdr Saeed Mahameed
@ 2023-01-18 21:45   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:45 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Roi Dayan <roid@nvidia.com>
> 
> Use mlx5e_tc_attach_mod_hdr() when allocating encap mod hdr and
> remove mlx5e_tc_add_flow_mod_hdr() which is not being used now.
> 
> Signed-off-by: Roi Dayan <roid@nvidia.com>
> Reviewed-by: Maor Dickman <maord@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 11/15] net/mlx5e: Warn when destroying mod hdr hash table that is not empty
  2023-01-18 18:35 ` [net-next 11/15] net/mlx5e: Warn when destroying mod hdr hash table that is not empty Saeed Mahameed
@ 2023-01-18 21:45   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:45 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan, Maor Dickman



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Roi Dayan <roid@nvidia.com>
> 
> To avoid memory leaks add a warn when destroying mod hdr hash table
> but the hash table is not empty.
> 

Strictly this is to help catch a memory leak, as the WARN_ON itself
doesn't prevent any leaking.. It does make it much easier to spot though.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> Signed-off-by: Roi Dayan <roid@nvidia.com>
> Reviewed-by: Maor Dickman <maord@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c b/drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c
> index 17325c5d6516..cf60f0a3ff23 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/mod_hdr.c
> @@ -47,6 +47,7 @@ void mlx5e_mod_hdr_tbl_init(struct mod_hdr_tbl *tbl)
>  
>  void mlx5e_mod_hdr_tbl_destroy(struct mod_hdr_tbl *tbl)
>  {
> +	WARN_ON(!hash_empty(tbl->hlist));
>  	mutex_destroy(&tbl->lock);
>  }
>  

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 12/15] net/mlx5: E-Switch, Fix typo for egress
  2023-01-18 18:35 ` [net-next 12/15] net/mlx5: E-Switch, Fix typo for egress Saeed Mahameed
@ 2023-01-18 21:46   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:46 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Roi Dayan



On 1/18/2023 10:35 AM, Saeed Mahameed wrote:
> From: Roi Dayan <roid@nvidia.com>
> 
> Fix engress to egress.
> 
> Signed-off-by: Roi Dayan <roid@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 13/15] net/mlx5e: Support Geneve and GRE with VF tunnel offload
  2023-01-18 18:36 ` [net-next 13/15] net/mlx5e: Support Geneve and GRE with VF tunnel offload Saeed Mahameed
@ 2023-01-18 21:48   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:48 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Maor Dickman, Paul Blakey,
	Roi Dayan



On 1/18/2023 10:36 AM, Saeed Mahameed wrote:
> From: Maor Dickman <maord@nvidia.com>
> 
> Today VF tunnel offload (tunnel endpoint is on VF) is implemented
> by indirect table which use rules that match on VXLAN VNI to
> recirculated to root table, this limit the support for only
> VXLAN tunnels.
> 
> This patch change indirect table to use one single match all rule
> to recirculated to root table which is added when any tunnel decap
> rule is added with tunnel endpoint is VF. This allow support of
> Geneve and GRE with this configuration.
> 
> Signed-off-by: Maor Dickman <maord@nvidia.com>
> Reviewed-by: Paul Blakey <paulb@nvidia.com>
> Reviewed-by: Roi Dayan <roid@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> ---
>  .../ethernet/mellanox/mlx5/core/en/tc_tun.c   |   2 -
>  .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   9 +-
>  .../net/ethernet/mellanox/mlx5/core/en_tc.h   |   2 -
>  .../mellanox/mlx5/core/esw/indir_table.c      | 203 +++---------------
>  .../mellanox/mlx5/core/esw/indir_table.h      |   4 -
>  .../mellanox/mlx5/core/eswitch_offloads.c     |  23 +-
>  6 files changed, 48 insertions(+), 195 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
> index e6f64d890fb3..83bb0811e774 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
> @@ -745,8 +745,6 @@ int mlx5e_tc_tun_route_lookup(struct mlx5e_priv *priv,
>  		if (err)
>  			goto out;
>  
> -		esw_attr->rx_tun_attr->vni = MLX5_GET(fte_match_param, spec->match_value,
> -						      misc_parameters.vxlan_vni);
>  		esw_attr->rx_tun_attr->decap_vport = vport_num;
>  	} else if (netif_is_ovs_master(attr.route_dev) && mlx5e_tc_int_port_supported(esw)) {
>  		int_port = mlx5e_tc_int_port_get(mlx5e_get_int_port_priv(priv),
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> index c978f4b12131..c8377b4c8c8e 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> @@ -2595,13 +2595,13 @@ static int parse_tunnel_attr(struct mlx5e_priv *priv,
>  		err = mlx5e_tc_set_attr_rx_tun(flow, spec);
>  		if (err)
>  			return err;
> -	} else if (tunnel && tunnel->tunnel_type == MLX5E_TC_TUNNEL_TYPE_VXLAN) {
> +	} else if (tunnel) {
>  		struct mlx5_flow_spec *tmp_spec;
>  
>  		tmp_spec = kvzalloc(sizeof(*tmp_spec), GFP_KERNEL);
>  		if (!tmp_spec) {
> -			NL_SET_ERR_MSG_MOD(extack, "Failed to allocate memory for vxlan tmp spec");
> -			netdev_warn(priv->netdev, "Failed to allocate memory for vxlan tmp spec");
> +			NL_SET_ERR_MSG_MOD(extack, "Failed to allocate memory for tunnel tmp spec");
> +			netdev_warn(priv->netdev, "Failed to allocate memory for tunnel tmp spec");
>  			return -ENOMEM;
>  		}
>  		memcpy(tmp_spec, spec, sizeof(*tmp_spec));
> @@ -4623,9 +4623,6 @@ __mlx5e_add_fdb_flow(struct mlx5e_priv *priv,
>  	if (err)
>  		goto err_free;
>  
> -	/* always set IP version for indirect table handling */
> -	flow->attr->ip_version = mlx5e_tc_get_ip_version(&parse_attr->spec, true);
> -
>  	err = parse_tc_fdb_actions(priv, &rule->action, flow, extack);
>  	if (err)
>  		goto err_free;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
> index 66e8d9d89082..ce516dc7f3fd 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
> @@ -84,7 +84,6 @@ struct mlx5_flow_attr {
>  	struct mlx5_flow_table *dest_ft;
>  	u8 inner_match_level;
>  	u8 outer_match_level;
> -	u8 ip_version;
>  	u8 tun_ip_version;
>  	int tunnel_id; /* mapped tunnel id */
>  	u32 flags;
> @@ -136,7 +135,6 @@ struct mlx5_rx_tun_attr {
>  		__be32 v4;
>  		struct in6_addr v6;
>  	} dst_ip; /* Valid if decap_vport is not zero */
> -	u32 vni;
>  };
>  
>  #define MLX5E_TC_TABLE_CHAIN_TAG_BITS 16
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
> index c9a91158e99c..8a94870c5b43 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
> @@ -16,18 +16,12 @@
>  #include "lib/fs_chains.h"
>  #include "en/mod_hdr.h"
>  
> -#define MLX5_ESW_INDIR_TABLE_SIZE 128
> -#define MLX5_ESW_INDIR_TABLE_RECIRC_IDX_MAX (MLX5_ESW_INDIR_TABLE_SIZE - 2)
> +#define MLX5_ESW_INDIR_TABLE_SIZE 2
> +#define MLX5_ESW_INDIR_TABLE_RECIRC_IDX (MLX5_ESW_INDIR_TABLE_SIZE - 2)
>  #define MLX5_ESW_INDIR_TABLE_FWD_IDX (MLX5_ESW_INDIR_TABLE_SIZE - 1)
>  
>  struct mlx5_esw_indir_table_rule {
> -	struct list_head list;
>  	struct mlx5_flow_handle *handle;
> -	union {
> -		__be32 v4;
> -		struct in6_addr v6;
> -	} dst_ip;
> -	u32 vni;
>  	struct mlx5_modify_hdr *mh;
>  	refcount_t refcnt;
>  };
> @@ -38,12 +32,10 @@ struct mlx5_esw_indir_table_entry {
>  	struct mlx5_flow_group *recirc_grp;
>  	struct mlx5_flow_group *fwd_grp;
>  	struct mlx5_flow_handle *fwd_rule;
> -	struct list_head recirc_rules;
> -	int recirc_cnt;
> +	struct mlx5_esw_indir_table_rule *recirc_rule;
>  	int fwd_ref;
>  
>  	u16 vport;
> -	u8 ip_version;
>  };
>  
>  struct mlx5_esw_indir_table {
> @@ -89,7 +81,6 @@ mlx5_esw_indir_table_needed(struct mlx5_eswitch *esw,
>  	return esw_attr->in_rep->vport == MLX5_VPORT_UPLINK &&
>  		vf_sf_vport &&
>  		esw->dev == dest_mdev &&
> -		attr->ip_version &&
>  		attr->flags & MLX5_ATTR_FLAG_SRC_REWRITE;
>  }
>  
> @@ -101,27 +92,8 @@ mlx5_esw_indir_table_decap_vport(struct mlx5_flow_attr *attr)
>  	return esw_attr->rx_tun_attr ? esw_attr->rx_tun_attr->decap_vport : 0;
>  }
>  
> -static struct mlx5_esw_indir_table_rule *
> -mlx5_esw_indir_table_rule_lookup(struct mlx5_esw_indir_table_entry *e,
> -				 struct mlx5_esw_flow_attr *attr)
> -{
> -	struct mlx5_esw_indir_table_rule *rule;
> -
> -	list_for_each_entry(rule, &e->recirc_rules, list)
> -		if (rule->vni == attr->rx_tun_attr->vni &&
> -		    !memcmp(&rule->dst_ip, &attr->rx_tun_attr->dst_ip,
> -			    sizeof(attr->rx_tun_attr->dst_ip)))
> -			goto found;
> -	return NULL;
> -
> -found:
> -	refcount_inc(&rule->refcnt);
> -	return rule;
> -}
> -
>  static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
>  					 struct mlx5_flow_attr *attr,
> -					 struct mlx5_flow_spec *spec,
>  					 struct mlx5_esw_indir_table_entry *e)
>  {
>  	struct mlx5_esw_flow_attr *esw_attr = attr->esw_attr;
> @@ -130,73 +102,18 @@ static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
>  	struct mlx5_flow_destination dest = {};
>  	struct mlx5_esw_indir_table_rule *rule;
>  	struct mlx5_flow_act flow_act = {};
> -	struct mlx5_flow_spec *rule_spec;
>  	struct mlx5_flow_handle *handle;
>  	int err = 0;
>  	u32 data;
>  
> -	rule = mlx5_esw_indir_table_rule_lookup(e, esw_attr);
> -	if (rule)
> +	if (e->recirc_rule) {
> +		refcount_inc(&e->recirc_rule->refcnt);
>  		return 0;
> -
> -	if (e->recirc_cnt == MLX5_ESW_INDIR_TABLE_RECIRC_IDX_MAX)
> -		return -EINVAL;
> -
> -	rule_spec = kvzalloc(sizeof(*rule_spec), GFP_KERNEL);
> -	if (!rule_spec)
> -		return -ENOMEM;
> -
> -	rule = kzalloc(sizeof(*rule), GFP_KERNEL);
> -	if (!rule) {
> -		err = -ENOMEM;
> -		goto out;
>  	}
>  
> -	rule_spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS |
> -					   MLX5_MATCH_MISC_PARAMETERS |
> -					   MLX5_MATCH_MISC_PARAMETERS_2;
> -	if (MLX5_CAP_FLOWTABLE_NIC_RX(esw->dev, ft_field_support.outer_ip_version)) {
> -		MLX5_SET(fte_match_param, rule_spec->match_criteria,
> -			 outer_headers.ip_version, 0xf);
> -		MLX5_SET(fte_match_param, rule_spec->match_value, outer_headers.ip_version,
> -			 attr->ip_version);
> -	} else if (attr->ip_version) {
> -		MLX5_SET_TO_ONES(fte_match_param, rule_spec->match_criteria,
> -				 outer_headers.ethertype);
> -		MLX5_SET(fte_match_param, rule_spec->match_value, outer_headers.ethertype,
> -			 (attr->ip_version == 4 ? ETH_P_IP : ETH_P_IPV6));
> -	} else {
> -		err = -EOPNOTSUPP;
> -		goto err_ethertype;
> -	}
> -
> -	if (attr->ip_version == 4) {
> -		MLX5_SET_TO_ONES(fte_match_param, rule_spec->match_criteria,
> -				 outer_headers.dst_ipv4_dst_ipv6.ipv4_layout.ipv4);
> -		MLX5_SET(fte_match_param, rule_spec->match_value,
> -			 outer_headers.dst_ipv4_dst_ipv6.ipv4_layout.ipv4,
> -			 ntohl(esw_attr->rx_tun_attr->dst_ip.v4));
> -	} else if (attr->ip_version == 6) {
> -		int len = sizeof(struct in6_addr);
> -
> -		memset(MLX5_ADDR_OF(fte_match_param, rule_spec->match_criteria,
> -				    outer_headers.dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
> -		       0xff, len);
> -		memcpy(MLX5_ADDR_OF(fte_match_param, rule_spec->match_value,
> -				    outer_headers.dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
> -		       &esw_attr->rx_tun_attr->dst_ip.v6, len);
> -	}
> -
> -	MLX5_SET_TO_ONES(fte_match_param, rule_spec->match_criteria,
> -			 misc_parameters.vxlan_vni);
> -	MLX5_SET(fte_match_param, rule_spec->match_value, misc_parameters.vxlan_vni,
> -		 MLX5_GET(fte_match_param, spec->match_value, misc_parameters.vxlan_vni));
> -
> -	MLX5_SET(fte_match_param, rule_spec->match_criteria,
> -		 misc_parameters_2.metadata_reg_c_0, mlx5_eswitch_get_vport_metadata_mask());
> -	MLX5_SET(fte_match_param, rule_spec->match_value, misc_parameters_2.metadata_reg_c_0,
> -		 mlx5_eswitch_get_vport_metadata_for_match(esw_attr->in_mdev->priv.eswitch,
> -							   MLX5_VPORT_UPLINK));
> +	rule = kzalloc(sizeof(*rule), GFP_KERNEL);
> +	if (!rule)
> +		return -ENOMEM;
>  
>  	/* Modify flow source to recirculate packet */
>  	data = mlx5_eswitch_get_vport_metadata_for_set(esw, esw_attr->rx_tun_attr->decap_vport);
> @@ -219,13 +136,14 @@ static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
>  
>  	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST | MLX5_FLOW_CONTEXT_ACTION_MOD_HDR;
>  	flow_act.flags = FLOW_ACT_IGNORE_FLOW_LEVEL | FLOW_ACT_NO_APPEND;
> +	flow_act.fg = e->recirc_grp;
>  	dest.type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
>  	dest.ft = mlx5_chains_get_table(chains, 0, 1, 0);
>  	if (IS_ERR(dest.ft)) {
>  		err = PTR_ERR(dest.ft);
>  		goto err_table;
>  	}
> -	handle = mlx5_add_flow_rules(e->ft, rule_spec, &flow_act, &dest, 1);
> +	handle = mlx5_add_flow_rules(e->ft, NULL, &flow_act, &dest, 1);
>  	if (IS_ERR(handle)) {
>  		err = PTR_ERR(handle);
>  		goto err_handle;
> @@ -233,14 +151,10 @@ static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
>  
>  	mlx5e_mod_hdr_dealloc(&mod_acts);
>  	rule->handle = handle;
> -	rule->vni = esw_attr->rx_tun_attr->vni;
>  	rule->mh = flow_act.modify_hdr;
> -	memcpy(&rule->dst_ip, &esw_attr->rx_tun_attr->dst_ip,
> -	       sizeof(esw_attr->rx_tun_attr->dst_ip));
>  	refcount_set(&rule->refcnt, 1);
> -	list_add(&rule->list, &e->recirc_rules);
> -	e->recirc_cnt++;
> -	goto out;
> +	e->recirc_rule = rule;
> +	return 0;
>  
>  err_handle:
>  	mlx5_chains_put_table(chains, 0, 1, 0);
> @@ -250,89 +164,44 @@ static int mlx5_esw_indir_table_rule_get(struct mlx5_eswitch *esw,
>  err_mod_hdr_regc1:
>  	mlx5e_mod_hdr_dealloc(&mod_acts);
>  err_mod_hdr_regc0:
> -err_ethertype:
>  	kfree(rule);
> -out:
> -	kvfree(rule_spec);
>  	return err;
>  }
>  
>  static void mlx5_esw_indir_table_rule_put(struct mlx5_eswitch *esw,
> -					  struct mlx5_flow_attr *attr,
>  					  struct mlx5_esw_indir_table_entry *e)
>  {
> -	struct mlx5_esw_flow_attr *esw_attr = attr->esw_attr;
> +	struct mlx5_esw_indir_table_rule *rule = e->recirc_rule;
>  	struct mlx5_fs_chains *chains = esw_chains(esw);
> -	struct mlx5_esw_indir_table_rule *rule;
>  
> -	list_for_each_entry(rule, &e->recirc_rules, list)
> -		if (rule->vni == esw_attr->rx_tun_attr->vni &&
> -		    !memcmp(&rule->dst_ip, &esw_attr->rx_tun_attr->dst_ip,
> -			    sizeof(esw_attr->rx_tun_attr->dst_ip)))
> -			goto found;
> -
> -	return;
> +	if (!rule)
> +		return;
>  
> -found:
>  	if (!refcount_dec_and_test(&rule->refcnt))
>  		return;
>  
>  	mlx5_del_flow_rules(rule->handle);
>  	mlx5_chains_put_table(chains, 0, 1, 0);
>  	mlx5_modify_header_dealloc(esw->dev, rule->mh);
> -	list_del(&rule->list);
>  	kfree(rule);
> -	e->recirc_cnt--;
> +	e->recirc_rule = NULL;
>  }
>  
> -static int mlx5_create_indir_recirc_group(struct mlx5_eswitch *esw,
> -					  struct mlx5_flow_attr *attr,
> -					  struct mlx5_flow_spec *spec,
> -					  struct mlx5_esw_indir_table_entry *e)
> +static int mlx5_create_indir_recirc_group(struct mlx5_esw_indir_table_entry *e)
>  {
>  	int err = 0, inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
> -	u32 *in, *match;
> +	u32 *in;
>  
>  	in = kvzalloc(inlen, GFP_KERNEL);
>  	if (!in)
>  		return -ENOMEM;
>  
> -	MLX5_SET(create_flow_group_in, in, match_criteria_enable, MLX5_MATCH_OUTER_HEADERS |
> -		 MLX5_MATCH_MISC_PARAMETERS | MLX5_MATCH_MISC_PARAMETERS_2);
> -	match = MLX5_ADDR_OF(create_flow_group_in, in, match_criteria);
> -
> -	if (MLX5_CAP_FLOWTABLE_NIC_RX(esw->dev, ft_field_support.outer_ip_version))
> -		MLX5_SET(fte_match_param, match, outer_headers.ip_version, 0xf);
> -	else
> -		MLX5_SET_TO_ONES(fte_match_param, match, outer_headers.ethertype);
> -
> -	if (attr->ip_version == 4) {
> -		MLX5_SET_TO_ONES(fte_match_param, match,
> -				 outer_headers.dst_ipv4_dst_ipv6.ipv4_layout.ipv4);
> -	} else if (attr->ip_version == 6) {
> -		memset(MLX5_ADDR_OF(fte_match_param, match,
> -				    outer_headers.dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
> -		       0xff, sizeof(struct in6_addr));
> -	} else {
> -		err = -EOPNOTSUPP;
> -		goto out;
> -	}
> -
> -	MLX5_SET_TO_ONES(fte_match_param, match, misc_parameters.vxlan_vni);
> -	MLX5_SET(fte_match_param, match, misc_parameters_2.metadata_reg_c_0,
> -		 mlx5_eswitch_get_vport_metadata_mask());
>  	MLX5_SET(create_flow_group_in, in, start_flow_index, 0);
> -	MLX5_SET(create_flow_group_in, in, end_flow_index, MLX5_ESW_INDIR_TABLE_RECIRC_IDX_MAX);
> +	MLX5_SET(create_flow_group_in, in, end_flow_index, MLX5_ESW_INDIR_TABLE_RECIRC_IDX);
>  	e->recirc_grp = mlx5_create_flow_group(e->ft, in);
> -	if (IS_ERR(e->recirc_grp)) {
> +	if (IS_ERR(e->recirc_grp))
>  		err = PTR_ERR(e->recirc_grp);
> -		goto out;
> -	}
>  
> -	INIT_LIST_HEAD(&e->recirc_rules);
> -	e->recirc_cnt = 0;
> -
> -out:
>  	kvfree(in);
>  	return err;
>  }
> @@ -366,6 +235,7 @@ static int mlx5_create_indir_fwd_group(struct mlx5_eswitch *esw,
>  	}
>  
>  	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
> +	flow_act.fg = e->fwd_grp;
>  	dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
>  	dest.vport.num = e->vport;
>  	dest.vport.vhca_id = MLX5_CAP_GEN(esw->dev, vhca_id);
> @@ -384,7 +254,7 @@ static int mlx5_create_indir_fwd_group(struct mlx5_eswitch *esw,
>  
>  static struct mlx5_esw_indir_table_entry *
>  mlx5_esw_indir_table_entry_create(struct mlx5_eswitch *esw, struct mlx5_flow_attr *attr,
> -				  struct mlx5_flow_spec *spec, u16 vport, bool decap)
> +				  u16 vport, bool decap)
>  {
>  	struct mlx5_flow_table_attr ft_attr = {};
>  	struct mlx5_flow_namespace *root_ns;
> @@ -412,15 +282,14 @@ mlx5_esw_indir_table_entry_create(struct mlx5_eswitch *esw, struct mlx5_flow_att
>  	}
>  	e->ft = ft;
>  	e->vport = vport;
> -	e->ip_version = attr->ip_version;
>  	e->fwd_ref = !decap;
>  
> -	err = mlx5_create_indir_recirc_group(esw, attr, spec, e);
> +	err = mlx5_create_indir_recirc_group(e);
>  	if (err)
>  		goto recirc_grp_err;
>  
>  	if (decap) {
> -		err = mlx5_esw_indir_table_rule_get(esw, attr, spec, e);
> +		err = mlx5_esw_indir_table_rule_get(esw, attr, e);
>  		if (err)
>  			goto recirc_rule_err;
>  	}
> @@ -430,13 +299,13 @@ mlx5_esw_indir_table_entry_create(struct mlx5_eswitch *esw, struct mlx5_flow_att
>  		goto fwd_grp_err;
>  
>  	hash_add(esw->fdb_table.offloads.indir->table, &e->hlist,
> -		 vport << 16 | attr->ip_version);
> +		 vport << 16);
>  
>  	return e;
>  
>  fwd_grp_err:
>  	if (decap)
> -		mlx5_esw_indir_table_rule_put(esw, attr, e);
> +		mlx5_esw_indir_table_rule_put(esw, e);
>  recirc_rule_err:
>  	mlx5_destroy_flow_group(e->recirc_grp);
>  recirc_grp_err:
> @@ -447,13 +316,13 @@ mlx5_esw_indir_table_entry_create(struct mlx5_eswitch *esw, struct mlx5_flow_att
>  }
>  
>  static struct mlx5_esw_indir_table_entry *
> -mlx5_esw_indir_table_entry_lookup(struct mlx5_eswitch *esw, u16 vport, u8 ip_version)
> +mlx5_esw_indir_table_entry_lookup(struct mlx5_eswitch *esw, u16 vport)
>  {
>  	struct mlx5_esw_indir_table_entry *e;
> -	u32 key = vport << 16 | ip_version;
> +	u32 key = vport << 16;
>  
>  	hash_for_each_possible(esw->fdb_table.offloads.indir->table, e, hlist, key)
> -		if (e->vport == vport && e->ip_version == ip_version)
> +		if (e->vport == vport)
>  			return e;
>  
>  	return NULL;
> @@ -461,24 +330,23 @@ mlx5_esw_indir_table_entry_lookup(struct mlx5_eswitch *esw, u16 vport, u8 ip_ver
>  
>  struct mlx5_flow_table *mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
>  						 struct mlx5_flow_attr *attr,
> -						 struct mlx5_flow_spec *spec,
>  						 u16 vport, bool decap)
>  {
>  	struct mlx5_esw_indir_table_entry *e;
>  	int err;
>  
>  	mutex_lock(&esw->fdb_table.offloads.indir->lock);
> -	e = mlx5_esw_indir_table_entry_lookup(esw, vport, attr->ip_version);
> +	e = mlx5_esw_indir_table_entry_lookup(esw, vport);
>  	if (e) {
>  		if (!decap) {
>  			e->fwd_ref++;
>  		} else {
> -			err = mlx5_esw_indir_table_rule_get(esw, attr, spec, e);
> +			err = mlx5_esw_indir_table_rule_get(esw, attr, e);
>  			if (err)
>  				goto out_err;
>  		}
>  	} else {
> -		e = mlx5_esw_indir_table_entry_create(esw, attr, spec, vport, decap);
> +		e = mlx5_esw_indir_table_entry_create(esw, attr, vport, decap);
>  		if (IS_ERR(e)) {
>  			err = PTR_ERR(e);
>  			esw_warn(esw->dev, "Failed to create indirection table, err %d.\n", err);
> @@ -494,22 +362,21 @@ struct mlx5_flow_table *mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
>  }
>  
>  void mlx5_esw_indir_table_put(struct mlx5_eswitch *esw,
> -			      struct mlx5_flow_attr *attr,
>  			      u16 vport, bool decap)
>  {
>  	struct mlx5_esw_indir_table_entry *e;
>  
>  	mutex_lock(&esw->fdb_table.offloads.indir->lock);
> -	e = mlx5_esw_indir_table_entry_lookup(esw, vport, attr->ip_version);
> +	e = mlx5_esw_indir_table_entry_lookup(esw, vport);
>  	if (!e)
>  		goto out;
>  
>  	if (!decap)
>  		e->fwd_ref--;
>  	else
> -		mlx5_esw_indir_table_rule_put(esw, attr, e);
> +		mlx5_esw_indir_table_rule_put(esw, e);
>  
> -	if (e->fwd_ref || e->recirc_cnt)
> +	if (e->fwd_ref || e->recirc_rule)
>  		goto out;
>  
>  	hash_del(&e->hlist);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.h b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.h
> index 21d56b49d14b..036f5b3a341b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.h
> @@ -13,10 +13,8 @@ mlx5_esw_indir_table_destroy(struct mlx5_esw_indir_table *indir);
>  
>  struct mlx5_flow_table *mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
>  						 struct mlx5_flow_attr *attr,
> -						 struct mlx5_flow_spec *spec,
>  						 u16 vport, bool decap);
>  void mlx5_esw_indir_table_put(struct mlx5_eswitch *esw,
> -			      struct mlx5_flow_attr *attr,
>  			      u16 vport, bool decap);
>  
>  bool
> @@ -44,7 +42,6 @@ mlx5_esw_indir_table_destroy(struct mlx5_esw_indir_table *indir)
>  static inline struct mlx5_flow_table *
>  mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
>  			 struct mlx5_flow_attr *attr,
> -			 struct mlx5_flow_spec *spec,
>  			 u16 vport, bool decap)
>  {
>  	return ERR_PTR(-EOPNOTSUPP);
> @@ -52,7 +49,6 @@ mlx5_esw_indir_table_get(struct mlx5_eswitch *esw,
>  
>  static inline void
>  mlx5_esw_indir_table_put(struct mlx5_eswitch *esw,
> -			 struct mlx5_flow_attr *attr,
>  			 u16 vport, bool decap)
>  {
>  }
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index c981fa77f439..4eeb2dcbfc0f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -179,15 +179,14 @@ mlx5_eswitch_set_rule_source_port(struct mlx5_eswitch *esw,
>  
>  static int
>  esw_setup_decap_indir(struct mlx5_eswitch *esw,
> -		      struct mlx5_flow_attr *attr,
> -		      struct mlx5_flow_spec *spec)
> +		      struct mlx5_flow_attr *attr)
>  {
>  	struct mlx5_flow_table *ft;
>  
>  	if (!(attr->flags & MLX5_ATTR_FLAG_SRC_REWRITE))
>  		return -EOPNOTSUPP;
>  
> -	ft = mlx5_esw_indir_table_get(esw, attr, spec,
> +	ft = mlx5_esw_indir_table_get(esw, attr,
>  				      mlx5_esw_indir_table_decap_vport(attr), true);
>  	return PTR_ERR_OR_ZERO(ft);
>  }
> @@ -197,7 +196,7 @@ esw_cleanup_decap_indir(struct mlx5_eswitch *esw,
>  			struct mlx5_flow_attr *attr)
>  {
>  	if (mlx5_esw_indir_table_decap_vport(attr))
> -		mlx5_esw_indir_table_put(esw, attr,
> +		mlx5_esw_indir_table_put(esw,
>  					 mlx5_esw_indir_table_decap_vport(attr),
>  					 true);
>  }
> @@ -235,7 +234,6 @@ esw_setup_ft_dest(struct mlx5_flow_destination *dest,
>  		  struct mlx5_flow_act *flow_act,
>  		  struct mlx5_eswitch *esw,
>  		  struct mlx5_flow_attr *attr,
> -		  struct mlx5_flow_spec *spec,
>  		  int i)
>  {
>  	flow_act->flags |= FLOW_ACT_IGNORE_FLOW_LEVEL;
> @@ -243,7 +241,7 @@ esw_setup_ft_dest(struct mlx5_flow_destination *dest,
>  	dest[i].ft = attr->dest_ft;
>  
>  	if (mlx5_esw_indir_table_decap_vport(attr))
> -		return esw_setup_decap_indir(esw, attr, spec);
> +		return esw_setup_decap_indir(esw, attr);
>  	return 0;
>  }
>  
> @@ -298,7 +296,7 @@ static void esw_put_dest_tables_loop(struct mlx5_eswitch *esw, struct mlx5_flow_
>  			mlx5_chains_put_table(chains, 0, 1, 0);
>  		else if (mlx5_esw_indir_table_needed(esw, attr, esw_attr->dests[i].rep->vport,
>  						     esw_attr->dests[i].mdev))
> -			mlx5_esw_indir_table_put(esw, attr, esw_attr->dests[i].rep->vport,
> +			mlx5_esw_indir_table_put(esw, esw_attr->dests[i].rep->vport,
>  						 false);
>  }
>  
> @@ -384,7 +382,6 @@ esw_setup_indir_table(struct mlx5_flow_destination *dest,
>  		      struct mlx5_flow_act *flow_act,
>  		      struct mlx5_eswitch *esw,
>  		      struct mlx5_flow_attr *attr,
> -		      struct mlx5_flow_spec *spec,
>  		      bool ignore_flow_lvl,
>  		      int *i)
>  {
> @@ -399,7 +396,7 @@ esw_setup_indir_table(struct mlx5_flow_destination *dest,
>  			flow_act->flags |= FLOW_ACT_IGNORE_FLOW_LEVEL;
>  		dest[*i].type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
>  
> -		dest[*i].ft = mlx5_esw_indir_table_get(esw, attr, spec,
> +		dest[*i].ft = mlx5_esw_indir_table_get(esw, attr,
>  						       esw_attr->dests[j].rep->vport, false);
>  		if (IS_ERR(dest[*i].ft)) {
>  			err = PTR_ERR(dest[*i].ft);
> @@ -408,7 +405,7 @@ esw_setup_indir_table(struct mlx5_flow_destination *dest,
>  	}
>  
>  	if (mlx5_esw_indir_table_decap_vport(attr)) {
> -		err = esw_setup_decap_indir(esw, attr, spec);
> +		err = esw_setup_decap_indir(esw, attr);
>  		if (err)
>  			goto err_indir_tbl_get;
>  	}
> @@ -511,14 +508,14 @@ esw_setup_dests(struct mlx5_flow_destination *dest,
>  		err = esw_setup_mtu_dest(dest, &attr->meter_attr, *i);
>  		(*i)++;
>  	} else if (esw_is_indir_table(esw, attr)) {
> -		err = esw_setup_indir_table(dest, flow_act, esw, attr, spec, true, i);
> +		err = esw_setup_indir_table(dest, flow_act, esw, attr, true, i);
>  	} else if (esw_is_chain_src_port_rewrite(esw, esw_attr)) {
>  		err = esw_setup_chain_src_port_rewrite(dest, flow_act, esw, chains, attr, i);
>  	} else {
>  		*i = esw_setup_vport_dests(dest, flow_act, esw, esw_attr, *i);
>  
>  		if (attr->dest_ft) {
> -			err = esw_setup_ft_dest(dest, flow_act, esw, attr, spec, *i);
> +			err = esw_setup_ft_dest(dest, flow_act, esw, attr, *i);
>  			(*i)++;
>  		} else if (attr->dest_chain) {
>  			err = esw_setup_chain_dest(dest, flow_act, chains, attr->dest_chain,
> @@ -727,7 +724,7 @@ mlx5_eswitch_add_fwd_rule(struct mlx5_eswitch *esw,
>  	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
>  	for (i = 0; i < esw_attr->split_count; i++) {
>  		if (esw_is_indir_table(esw, attr))
> -			err = esw_setup_indir_table(dest, &flow_act, esw, attr, spec, false, &i);
> +			err = esw_setup_indir_table(dest, &flow_act, esw, attr, false, &i);
>  		else if (esw_is_chain_src_port_rewrite(esw, esw_attr))
>  			err = esw_setup_chain_src_port_rewrite(dest, &flow_act, esw, chains, attr,
>  							       &i);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 14/15] net/mlx5e: Remove redundant allocation of spec in create indirect fwd group
  2023-01-18 18:36 ` [net-next 14/15] net/mlx5e: Remove redundant allocation of spec in create indirect fwd group Saeed Mahameed
@ 2023-01-18 21:50   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:50 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Maor Dickman, Roi Dayan



On 1/18/2023 10:36 AM, Saeed Mahameed wrote:
> From: Maor Dickman <maord@nvidia.com>
> 
> mlx5_add_flow_rules supports creating rules without any matches by passing NULL
> pointer instead of spec, if NULL is passed it will use a static empty spec.
> This make allocation of spec in mlx5_create_indir_fwd_group unnecessary.
> 
> Remove the redundant allocation.
> 

Yep. The code to handle this was added in 5c2aa8ae3a2c ("net/mlx5:
Accept flow rules without match").

Makes sense, saves an allocation and the zero_spec struct used was
already/always on the stack so no change there.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> Signed-off-by: Maor Dickman <maord@nvidia.com>
> Reviewed-by: Roi Dayan <roid@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/esw/indir_table.c  | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
> index 8a94870c5b43..9959e9fd15a1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/indir_table.c
> @@ -212,19 +212,12 @@ static int mlx5_create_indir_fwd_group(struct mlx5_eswitch *esw,
>  	int err = 0, inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
>  	struct mlx5_flow_destination dest = {};
>  	struct mlx5_flow_act flow_act = {};
> -	struct mlx5_flow_spec *spec;
>  	u32 *in;
>  
>  	in = kvzalloc(inlen, GFP_KERNEL);
>  	if (!in)
>  		return -ENOMEM;
>  
> -	spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
> -	if (!spec) {
> -		kvfree(in);
> -		return -ENOMEM;
> -	}
> -
>  	/* Hold one entry */
>  	MLX5_SET(create_flow_group_in, in, start_flow_index, MLX5_ESW_INDIR_TABLE_FWD_IDX);
>  	MLX5_SET(create_flow_group_in, in, end_flow_index, MLX5_ESW_INDIR_TABLE_FWD_IDX);
> @@ -240,14 +233,13 @@ static int mlx5_create_indir_fwd_group(struct mlx5_eswitch *esw,
>  	dest.vport.num = e->vport;
>  	dest.vport.vhca_id = MLX5_CAP_GEN(esw->dev, vhca_id);
>  	dest.vport.flags = MLX5_FLOW_DEST_VPORT_VHCA_ID;
> -	e->fwd_rule = mlx5_add_flow_rules(e->ft, spec, &flow_act, &dest, 1);
> +	e->fwd_rule = mlx5_add_flow_rules(e->ft, NULL, &flow_act, &dest, 1);
>  	if (IS_ERR(e->fwd_rule)) {
>  		mlx5_destroy_flow_group(e->fwd_grp);
>  		err = PTR_ERR(e->fwd_rule);
>  	}
>  
>  err_out:
> -	kvfree(spec);
>  	kvfree(in);
>  	return err;
>  }

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 15/15] net/mlx5e: Use read lock for eswitch get callbacks
  2023-01-18 18:36 ` [net-next 15/15] net/mlx5e: Use read lock for eswitch get callbacks Saeed Mahameed
@ 2023-01-18 21:51   ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-18 21:51 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Leon Romanovsky



On 1/18/2023 10:36 AM, Saeed Mahameed wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> In commit 367dfa121205 ("net/mlx5: Remove devl_unlock from
> mlx5_eswtich_mode_callback_enter") all functions were converted
> to use write lock without relation to their actual purpose.
> 
> Change the devlink eswitch getters to use read and not write locks.
> 
Makes sense.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-18 21:33   ` Jacob Keller
@ 2023-01-20  3:46     ` Jakub Kicinski
  2023-01-20  3:56       ` Rahul Rameshbabu
  0 siblings, 1 reply; 59+ messages in thread
From: Jakub Kicinski @ 2023-01-20  3:46 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Rahul Rameshbabu,
	Gal Pressman, Richard Cochran

On Wed, 18 Jan 2023 13:33:56 -0800 Jacob Keller wrote:
> > The adjtime function supports using hardware to set the clock offset when
> > the delta was supported by the hardware. When the delta is not supported by
> > the hardware, the driver handles adjusting the clock. The newly-introduced
> > adjphase function is similar to the adjtime function, except it guarantees
> > that a provided clock offset will be used directly by the hardware to
> > adjust the PTP clock. When the range is not acceptable by the hardware, an
> > error is returned.
> 
> Makes sense. Once you've verified that the delta is within the accepted
> range you can just re-use the existing adjtime function.

Seems like we should add a "max_time_adj" to struct ptp_clock_info
and let the core call adjphase if the offset is small enough to fit.
Instead of having all drivers redirect the calls internally.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20  3:46     ` Jakub Kicinski
@ 2023-01-20  3:56       ` Rahul Rameshbabu
  2023-01-20  4:03         ` Jakub Kicinski
  0 siblings, 1 reply; 59+ messages in thread
From: Rahul Rameshbabu @ 2023-01-20  3:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jacob Keller, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran

On Thu, 19 Jan, 2023 19:46:31 -0800 Jakub Kicinski <kuba@kernel.org> wrote:
> On Wed, 18 Jan 2023 13:33:56 -0800 Jacob Keller wrote:
>> > The adjtime function supports using hardware to set the clock offset when
>> > the delta was supported by the hardware. When the delta is not supported by
>> > the hardware, the driver handles adjusting the clock. The newly-introduced
>> > adjphase function is similar to the adjtime function, except it guarantees
>> > that a provided clock offset will be used directly by the hardware to
>> > adjust the PTP clock. When the range is not acceptable by the hardware, an
>> > error is returned.
>> 
>> Makes sense. Once you've verified that the delta is within the accepted
>> range you can just re-use the existing adjtime function.
>
> Seems like we should add a "max_time_adj" to struct ptp_clock_info
> and let the core call adjphase if the offset is small enough to fit.
> Instead of having all drivers redirect the calls internally.

With guidance from Saeed on this topic, I have a patch in the works for
advertising the max phase adjustment supported by a driver through the
use of the PTP_CLOCK_GETCAPS ioctl. This is how the ptp stack handles
advertising the max frequency supported by a driver today. In linuxptp,
this ioctl is wrapped in a function call for getting the max frequency
adjustment supported by a device before ptp is actually run. I believe a
similar logic should occur for phase (time) adjustments. This patch
would introduce a "max_phase_adj" in ptp_clock_info that would be
handled in ptp_clock_adjtime in the ptp core stack.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB
  2023-01-18 18:35 ` [net-next 01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB Saeed Mahameed
  2023-01-18 21:31   ` Jacob Keller
@ 2023-01-20  4:00   ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 59+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-01-20  4:00 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: davem, kuba, pabeni, edumazet, saeedm, netdev, tariqt, rrameshbabu, lkp

Hello:

This series was applied to netdev/net-next.git (master)
by Saeed Mahameed <saeedm@nvidia.com>:

On Wed, 18 Jan 2023 10:35:48 -0800 you wrote:
> From: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> 
> Send WQEBB size is 64 bytes and the max number of WQEBBs for an SQ is 255.
> For 16KB pages and greater, there is always sufficient spaces for all
> WQEBBs of an SQ. Cast mlx5e_get_max_sq_wqebbs(mdev) to u16. Prevents
> -Wtautological-constant-out-of-range-compare warnings from occurring when
> PAGE_SIZE >= 16KB.
> 
> [...]

Here is the summary with links:
  - [net-next,01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB
    https://git.kernel.org/netdev/net-next/c/022dbea0ea8e
  - [net-next,02/15] net/mlx5: Suppress error logging on UCTX creation
    https://git.kernel.org/netdev/net-next/c/d0f332dc9689
  - [net-next,03/15] net/mlx5: Add adjphase function to support hardware-only offset control
    https://git.kernel.org/netdev/net-next/c/8e11a68e2e8a
  - [net-next,04/15] net/mlx5: Add hardware extended range support for PTP adjtime and adjphase
    https://git.kernel.org/netdev/net-next/c/d3c8a33a5cad
  - [net-next,05/15] net/mlx5: E-switch, Remove redundant comment about meta rules
    https://git.kernel.org/netdev/net-next/c/1158b7d1c640
  - [net-next,06/15] net/mlx5e: Fail with messages when params are not valid for XSK
    https://git.kernel.org/netdev/net-next/c/130b12079f37
  - [net-next,07/15] net/mlx5e: Add warning when log WQE size is smaller than log stride size
    https://git.kernel.org/netdev/net-next/c/b80ae281277f
  - [net-next,08/15] net/mlx5e: TC, Pass flow attr to attach/detach mod hdr functions
    https://git.kernel.org/netdev/net-next/c/82b564802661
  - [net-next,09/15] net/mlx5e: TC, Add tc prefix to attach/detach hdr functions
    https://git.kernel.org/netdev/net-next/c/c43182e6db32
  - [net-next,10/15] net/mlx5e: TC, Use common function allocating flow mod hdr or encap mod hdr
    https://git.kernel.org/netdev/net-next/c/ef78b8d5d6f1
  - [net-next,11/15] net/mlx5e: Warn when destroying mod hdr hash table that is not empty
    https://git.kernel.org/netdev/net-next/c/2a1f4fed392b
  - [net-next,12/15] net/mlx5: E-Switch, Fix typo for egress
    https://git.kernel.org/netdev/net-next/c/55b458481d68
  - [net-next,13/15] net/mlx5e: Support Geneve and GRE with VF tunnel offload
    https://git.kernel.org/netdev/net-next/c/521933cdc4aa
  - [net-next,14/15] net/mlx5e: Remove redundant allocation of spec in create indirect fwd group
    https://git.kernel.org/netdev/net-next/c/42cd20044e85
  - [net-next,15/15] net/mlx5e: Use read lock for eswitch get callbacks
    https://git.kernel.org/netdev/net-next/c/efb4879f7623

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20  3:56       ` Rahul Rameshbabu
@ 2023-01-20  4:03         ` Jakub Kicinski
  2023-01-20  4:26           ` Rahul Rameshbabu
  0 siblings, 1 reply; 59+ messages in thread
From: Jakub Kicinski @ 2023-01-20  4:03 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: Jacob Keller, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran

On Thu, 19 Jan 2023 19:56:24 -0800 Rahul Rameshbabu wrote:
> >> Makes sense. Once you've verified that the delta is within the accepted
> >> range you can just re-use the existing adjtime function.  
> >
> > Seems like we should add a "max_time_adj" to struct ptp_clock_info
> > and let the core call adjphase if the offset is small enough to fit.
> > Instead of having all drivers redirect the calls internally.  
> 
> With guidance from Saeed on this topic, I have a patch in the works for
> advertising the max phase adjustment supported by a driver through the
> use of the PTP_CLOCK_GETCAPS ioctl. This is how the ptp stack handles
> advertising the max frequency supported by a driver today. In linuxptp,
> this ioctl is wrapped in a function call for getting the max frequency
> adjustment supported by a device before ptp is actually run. I believe a
> similar logic should occur for phase (time) adjustments. This patch
> would introduce a "max_phase_adj" in ptp_clock_info that would be
> handled in ptp_clock_adjtime in the ptp core stack.

Nice, can we make the core also call ->adjtime automatically if driver
doesn't define ->adjphase and the abs(delta) < .max_phase_adj ?

The other question is about the exact semantics of ->adjphase
- do all timecounter based clock implementations support it
by definition?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20  4:03         ` Jakub Kicinski
@ 2023-01-20  4:26           ` Rahul Rameshbabu
  2023-01-20  5:08             ` Jakub Kicinski
  2023-01-20 17:21             ` Jacob Keller
  0 siblings, 2 replies; 59+ messages in thread
From: Rahul Rameshbabu @ 2023-01-20  4:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jacob Keller, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng

On Thu, 19 Jan, 2023 20:03:43 -0800 Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 19 Jan 2023 19:56:24 -0800 Rahul Rameshbabu wrote:
>> >> Makes sense. Once you've verified that the delta is within the accepted
>> >> range you can just re-use the existing adjtime function.  
>> >
>> > Seems like we should add a "max_time_adj" to struct ptp_clock_info
>> > and let the core call adjphase if the offset is small enough to fit.
>> > Instead of having all drivers redirect the calls internally.  
>> 
>> With guidance from Saeed on this topic, I have a patch in the works for
>> advertising the max phase adjustment supported by a driver through the
>> use of the PTP_CLOCK_GETCAPS ioctl. This is how the ptp stack handles
>> advertising the max frequency supported by a driver today. In linuxptp,
>> this ioctl is wrapped in a function call for getting the max frequency
>> adjustment supported by a device before ptp is actually run. I believe a
>> similar logic should occur for phase (time) adjustments. This patch
>> would introduce a "max_phase_adj" in ptp_clock_info that would be
>> handled in ptp_clock_adjtime in the ptp core stack.
>
> Nice, can we make the core also call ->adjtime automatically if driver
> doesn't define ->adjphase and the abs(delta) < .max_phase_adj ?

One of my concerns with doing this is breaking userspace expectations.
In linuxptp, there is a configuration setting "write_phase_mode" and an
expectation that when adjphase is called, there will not be a fallback
to adjtime. This because adjphase is used in situations where small fine
tuning is explicitly needed, so the errors would indicate a logical or
situational error.

Quoting Vincent Cheng, the author of the adjphase functionality in the
ptp core stack.

-----BEGIN QUOTE-----
  adjtime modifies HW counter with a value to move the 1 PPS abruptly to new location.
  adjphase modifies the frequency to quickly nudge the 1 PPS to new location and also includes a HW filter to smooth out the adjustments and fine tune frequency.

  Continuous small offset adjustments using adjtime, likley see sudden shifts of the 1 PPS.  The 1 PPS probably disappears and re-appears.
  Continuous small offset adjustments using adjphase, should see continuous 1 PPS.

  adjtime is good for large offset corrections
  adjphase is good for small offset corrections to allow HW filter to control the frequency instead of relying on SW filter.
-----END QUOTE-----

https://lore.kernel.org/netdev/20220804132902.GA25315@renesas.com/

Using the mlx5 implementation as a reference, the mlx5 stack provides
the adjphase offset to the device. The device is then able to make the
sudden shift small offsets internally using the device's internal
functionality (offload). For the larger values that are handled using
adjtime, our devices perform just fine when we get the time from the
device, offset that time in the driver stack, and write back the new
time to the device.

>
> The other question is about the exact semantics of ->adjphase
> - do all timecounter based clock implementations support it
> by definition?

My understanding is no (though anyone is free to jump in to correct me
on this). Only implementations with support for precisely handling small
PPS corrections can support adjphase (being able to adjust small offsets
without causing same or worse drift).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20  4:26           ` Rahul Rameshbabu
@ 2023-01-20  5:08             ` Jakub Kicinski
  2023-01-20  5:22               ` Rahul Rameshbabu
  2023-01-20 17:21             ` Jacob Keller
  1 sibling, 1 reply; 59+ messages in thread
From: Jakub Kicinski @ 2023-01-20  5:08 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: Jacob Keller, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng

On Thu, 19 Jan 2023 20:26:04 -0800 Rahul Rameshbabu wrote:
> One of my concerns with doing this is breaking userspace expectations.
> In linuxptp, there is a configuration setting "write_phase_mode" and an
> expectation that when adjphase is called, there will not be a fallback
> to adjtime. This because adjphase is used in situations where small fine
> tuning is explicitly needed, so the errors would indicate a logical or
> situational error.

I don't mean fallback - just do what you do in mlx5 directly in 
the core. The driver already does:

if delta < MAX
	use precise method
else
	use coarse method

> Quoting Vincent Cheng, the author of the adjphase functionality in the
> ptp core stack.
> 
> -----BEGIN QUOTE-----
>   adjtime modifies HW counter with a value to move the 1 PPS abruptly to new location.
>   adjphase modifies the frequency to quickly nudge the 1 PPS to new location and also includes a HW filter to smooth out the adjustments and fine tune frequency.
> 
>   Continuous small offset adjustments using adjtime, likley see sudden shifts of the 1 PPS.  The 1 PPS probably disappears and re-appears.
>   Continuous small offset adjustments using adjphase, should see continuous 1 PPS.
> 
>   adjtime is good for large offset corrections
>   adjphase is good for small offset corrections to allow HW filter to control the frequency instead of relying on SW filter.

Hm, so are you saying that:

adjtime(delta):
	clock += delta

but:

adjfreq(delta):
	on clock tick & while delta > 0:
		clock += small_value
		delta -= small_value

because from looking at mlx5 driver code its unclear whether the
implementation does a precise but one shot adjustment or gradual
adjustments.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20  5:08             ` Jakub Kicinski
@ 2023-01-20  5:22               ` Rahul Rameshbabu
  2023-01-20  5:45                 ` Jakub Kicinski
  0 siblings, 1 reply; 59+ messages in thread
From: Rahul Rameshbabu @ 2023-01-20  5:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jacob Keller, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng

On Thu, 19 Jan, 2023 21:08:42 -0800 Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 19 Jan 2023 20:26:04 -0800 Rahul Rameshbabu wrote:
>> One of my concerns with doing this is breaking userspace expectations.
>> In linuxptp, there is a configuration setting "write_phase_mode" and an
>> expectation that when adjphase is called, there will not be a fallback
>> to adjtime. This because adjphase is used in situations where small fine
>> tuning is explicitly needed, so the errors would indicate a logical or
>> situational error.
>
> I don't mean fallback - just do what you do in mlx5 directly in 
> the core. The driver already does:
>
> if delta < MAX
> 	use precise method
> else
> 	use coarse method

Oh, I see. I was thinking we were discussing ADJ_OFFSET in the
ptp_clock_adjtime function in the ptp core stack. The suggestion you are
proposing would be for the ADJ_SETOFFSET operation, and I agree with
this. Thanks for the clarification.

>
>> Quoting Vincent Cheng, the author of the adjphase functionality in the
>> ptp core stack.
>> 
>> -----BEGIN QUOTE-----
>>   adjtime modifies HW counter with a value to move the 1 PPS abruptly to new location.
>>   adjphase modifies the frequency to quickly nudge the 1 PPS to new location
>> and also includes a HW filter to smooth out the adjustments and fine tune
>> frequency.
>> 
>>   Continuous small offset adjustments using adjtime, likley see sudden shifts
>> of the 1 PPS. The 1 PPS probably disappears and re-appears.
>>   Continuous small offset adjustments using adjphase, should see continuous 1 PPS.
>> 
>>   adjtime is good for large offset corrections
>>   adjphase is good for small offset corrections to allow HW filter to control
>> the frequency instead of relying on SW filter.
>
> Hm, so are you saying that:
>
> adjtime(delta):
> 	clock += delta
>
> but:
>
> adjfreq(delta):

Did you mean adjphase here?

> 	on clock tick & while delta > 0:
> 		clock += small_value
> 		delta -= small_value
>
> because from looking at mlx5 driver code its unclear whether the
> implementation does a precise but one shot adjustment or gradual
> adjustments.

The pseudo code your drafted is accurate otherwise. The lack of clarity
in our driver comes from leaving the responsibility of that smooth
gradual transition (to keep in sync with the clock frequency while
running) up to the device.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20  5:22               ` Rahul Rameshbabu
@ 2023-01-20  5:45                 ` Jakub Kicinski
  2023-01-20  6:02                   ` Rahul Rameshbabu
  0 siblings, 1 reply; 59+ messages in thread
From: Jakub Kicinski @ 2023-01-20  5:45 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: Jacob Keller, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng

On Thu, 19 Jan 2023 21:22:00 -0800 Rahul Rameshbabu wrote:
> > Hm, so are you saying that:
> >
> > adjtime(delta):
> > 	clock += delta
> >
> > but:
> >
> > adjfreq(delta):  
> 
> Did you mean adjphase here?

Yes, sorry

> > 	on clock tick & while delta > 0:
> > 		clock += small_value
> > 		delta -= small_value
> >
> > because from looking at mlx5 driver code its unclear whether the
> > implementation does a precise but one shot adjustment or gradual
> > adjustments.  
> 
> The pseudo code your drafted is accurate otherwise. The lack of clarity
> in our driver comes from leaving the responsibility of that smooth
> gradual transition (to keep in sync with the clock frequency while
> running) up to the device.

Ah, I see. That makes sense. This is how system clocks react too 
or is it a local PTP invention? I think it may be worth documenting 
in ptp_clock_kernel.h, most driver authors may not understand 
the difference between adjtime and adjphase :(

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20  5:45                 ` Jakub Kicinski
@ 2023-01-20  6:02                   ` Rahul Rameshbabu
  0 siblings, 0 replies; 59+ messages in thread
From: Rahul Rameshbabu @ 2023-01-20  6:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jacob Keller, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng

On Thu, 19 Jan, 2023 21:45:16 -0800 Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 19 Jan 2023 21:22:00 -0800 Rahul Rameshbabu wrote:
>> > Hm, so are you saying that:
>> >
>> > adjtime(delta):
>> > 	clock += delta
>> >
>> > but:
>> >
>> > adjfreq(delta):  
>> 
>> Did you mean adjphase here?
>
> Yes, sorry
>
>> > 	on clock tick & while delta > 0:
>> > 		clock += small_value
>> > 		delta -= small_value
>> >
>> > because from looking at mlx5 driver code its unclear whether the
>> > implementation does a precise but one shot adjustment or gradual
>> > adjustments.  
>> 
>> The pseudo code your drafted is accurate otherwise. The lack of clarity
>> in our driver comes from leaving the responsibility of that smooth
>> gradual transition (to keep in sync with the clock frequency while
>> running) up to the device.
>
> Ah, I see. That makes sense. This is how system clocks react too 
> or is it a local PTP invention? I think it may be worth documenting 
> in ptp_clock_kernel.h, most driver authors may not understand 
> the difference between adjtime and adjphase :(

I think this concept was uniquely proposed for ptp core stack. I do not
think I have seen this explicit differenciation before (but maybe
internal implementations do take a similar approach).

I agree with improving the documentation of adjphase vs adjtime in
ptp_clock_kernel.h

 * @adjphase:  Adjusts the phase offset of the hardware clock.
 *             parameter delta: Desired change in nanoseconds.
 *
 * @adjtime:  Shifts the time of the hardware clock.
 *            parameter delta: Desired change in nanoseconds.

This doesn't really help much in terms of understanding the difference
of adjusting the phase vs shifting the time. The quote from Vincent I
provided in my previous email was in response to a question raised by
Aya Levin who was working on timing related features in the mlx5 driver.

https://lore.kernel.org/netdev/228ceba4-47a8-49ef-994a-fe898cdc7fc1@nvidia.com/

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20  4:26           ` Rahul Rameshbabu
  2023-01-20  5:08             ` Jakub Kicinski
@ 2023-01-20 17:21             ` Jacob Keller
  2023-01-20 18:00               ` Rahul Rameshbabu
  1 sibling, 1 reply; 59+ messages in thread
From: Jacob Keller @ 2023-01-20 17:21 UTC (permalink / raw)
  To: Rahul Rameshbabu, Jakub Kicinski
  Cc: Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng



On 1/19/2023 8:26 PM, Rahul Rameshbabu wrote:
> On Thu, 19 Jan, 2023 20:03:43 -0800 Jakub Kicinski <kuba@kernel.org> wrote:
>> The other question is about the exact semantics of ->adjphase
>> - do all timecounter based clock implementations support it
>> by definition?
> 
> My understanding is no (though anyone is free to jump in to correct me
> on this). Only implementations with support for precisely handling small
> PPS corrections can support adjphase (being able to adjust small offsets
> without causing same or worse drift).

I guess I'm missing something here? timecounters allow adjusting time in
an atomic way. They don't lose any time when making an adjustment
because its a change to the wrapping around a fixed cycle counter.

How does that not comply with adjphase? and if it doesn't, then whats
the difference between adjphase and just correcting offset using adjfine
for frequency adjustment?

I guess adjusting phase will do the small corrections in hardware
(perhaps by temporarily adjusting the nominal frequency of the clock)
but will then return to the normal frequency once complete?

So adjphase is more than just being atomic...?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20 17:21             ` Jacob Keller
@ 2023-01-20 18:00               ` Rahul Rameshbabu
  2023-01-20 23:58                 ` Jacob Keller
  0 siblings, 1 reply; 59+ messages in thread
From: Rahul Rameshbabu @ 2023-01-20 18:00 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Jakub Kicinski, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng

On Fri, 20 Jan, 2023 09:21:09 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote:
> On 1/19/2023 8:26 PM, Rahul Rameshbabu wrote:
>> On Thu, 19 Jan, 2023 20:03:43 -0800 Jakub Kicinski <kuba@kernel.org> wrote:
>>> The other question is about the exact semantics of ->adjphase
>>> - do all timecounter based clock implementations support it
>>> by definition?
>> 
>> My understanding is no (though anyone is free to jump in to correct me
>> on this). Only implementations with support for precisely handling small
>> PPS corrections can support adjphase (being able to adjust small offsets
>> without causing same or worse drift).
>
> I guess I'm missing something here? timecounters allow adjusting time in
> an atomic way. They don't lose any time when making an adjustment
> because its a change to the wrapping around a fixed cycle counter.

If that's the case, wouldn't adjphase be implementable by every PTP
hardware clock device. I agree with the description here but not all
devices provide this atomic (phase offset control) functionality from my
understanding. More details in next section.

>
> How does that not comply with adjphase? and if it doesn't, then whats
> the difference between adjphase and just correcting offset using adjfine
> for frequency adjustment?

  Some PTP hardware clocks have a write phase mode that has a built-in
  hardware filtering capability. The write phase mode utilizes a phase
  offset control word instead of a frequency offset control word.

The above is from the original patch description for adjphase. My
understanding is that some PTP hardware clocks do not implement the
phase control word. Hence why I responded with no in my original post as
in no, not all PTP hardware devices are expected to support this
capability.

>
> I guess adjusting phase will do the small corrections in hardware
> (perhaps by temporarily adjusting the nominal frequency of the clock)
> but will then return to the normal frequency once complete?
>
> So adjphase is more than just being atomic...?

I assumed the phase control word is more than just a simple one-shot
(atomic) time offset done in the hardware (at least internally what the
hardware does when implementing this control word).

  adjtime modifies HW counter with a value to move the 1 PPS abruptly to new location.
  adjphase modifies the frequency to quickly nudge the 1 PPS to new location and also includes a HW filter to smooth out the adjustments and fine tune frequency.

  Continuous small offset adjustments using adjtime, likley see sudden shifts of the 1 PPS.  The 1 PPS probably disappears and re-appears.
  Continuous small offset adjustments using adjphase, should see continuous 1 PPS.

https://lore.kernel.org/lkml/20220804132902.GA25315@renesas.com/

Looking at the mlx5 driver implementation, it does look like a simple
atomic operation. That said, I wouldn't consider myself a ptp architect
and am mostly going off based on what I read from the patch series that
introduced the capability in the ptp core stack in the kernel. I do
think this needs to get updated in ptp_clock_kernel.h.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20 18:00               ` Rahul Rameshbabu
@ 2023-01-20 23:58                 ` Jacob Keller
  2023-01-21  0:06                   ` Jakub Kicinski
  2023-01-23  2:15                   ` Richard Cochran
  0 siblings, 2 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-20 23:58 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: Jakub Kicinski, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng



On 1/20/2023 10:00 AM, Rahul Rameshbabu wrote:
> I assumed the phase control word is more than just a simple one-shot
> (atomic) time offset done in the hardware (at least internally what the
> hardware does when implementing this control word).
> 
>   adjtime modifies HW counter with a value to move the 1 PPS abruptly to new location.
>   adjphase modifies the frequency to quickly nudge the 1 PPS to new location and also includes a HW filter to smooth out the adjustments and fine tune frequency.
> 
>   Continuous small offset adjustments using adjtime, likley see sudden shifts of the 1 PPS.  The 1 PPS probably disappears and re-appears.
>   Continuous small offset adjustments using adjphase, should see continuous 1 PPS.
> 

Sure. I guess what I don't understand is why or when one would want to
use adjphase instead of just performing frequency adjustment via the
.adjfine operation...

Especially since "gradual adjustment over time" means it will still be
slow to converge (just like adjusting frequency is today).

We should definitely improve the doc to explain the diff between them
and make sure that its more clear to driver implementations.

It also makes it harder to justify mapping small .adjtime to .adjphase,
as it seems like .adjphase isn't required to adjust the offset
immediately. Perhaps the adjustment size is small enough that isn't a
big problem?

Thanks,
Jake

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20 23:58                 ` Jacob Keller
@ 2023-01-21  0:06                   ` Jakub Kicinski
  2023-01-22 21:11                     ` Rahul Rameshbabu
  2023-01-23  2:15                   ` Richard Cochran
  1 sibling, 1 reply; 59+ messages in thread
From: Jakub Kicinski @ 2023-01-21  0:06 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Rahul Rameshbabu, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng

On Fri, 20 Jan 2023 15:58:25 -0800 Jacob Keller wrote:
> Sure. I guess what I don't understand is why or when one would want to
> use adjphase instead of just performing frequency adjustment via the
> .adjfine operation...
> 
> Especially since "gradual adjustment over time" means it will still be
> slow to converge (just like adjusting frequency is today).
> 
> We should definitely improve the doc to explain the diff between them
> and make sure that its more clear to driver implementations.

Fair point, I assumed that the adjustment is precise (as in the device
is programmed with both the extra addend and number of cycles over
which to use it). Yielding an exact adjustment.

But then again, I also thought that .adjtime is supposed to be a precise
single-shot nudge, while most drivers just do:

time = read_time()
time += delta
write_time(time)

:S  So yeah, let's document..

> It also makes it harder to justify mapping small .adjtime to .adjphase,
> as it seems like .adjphase isn't required to adjust the offset
> immediately. Perhaps the adjustment size is small enough that isn't a
> big problem?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-21  0:06                   ` Jakub Kicinski
@ 2023-01-22 21:11                     ` Rahul Rameshbabu
  2023-01-23  2:22                       ` Richard Cochran
  0 siblings, 1 reply; 59+ messages in thread
From: Rahul Rameshbabu @ 2023-01-22 21:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jacob Keller, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Richard Cochran, Vincent Cheng

Held off on replying any further till I verified my understanding with a
PTP architect on our side. Looks like my understanding was correct.

On Fri, 20 Jan, 2023 16:06:09 -0800 Jakub Kicinski <kuba@kernel.org> wrote:
> On Fri, 20 Jan 2023 15:58:25 -0800 Jacob Keller wrote:
>> Sure. I guess what I don't understand is why or when one would want to
>> use adjphase instead of just performing frequency adjustment via the
>> .adjfine operation...
>> 
>> Especially since "gradual adjustment over time" means it will still be
>> slow to converge (just like adjusting frequency is today).
>> 
>> We should definitely improve the doc to explain the diff between them
>> and make sure that its more clear to driver implementations.

Our PTP architect also was subtly confused by the differences of adjtime
and adjphase when the adjphase patch was first introduced. Vincent's
follow up helped with better understanding the difference.

The way NVIDIA devices internally handle adjphase is to adjust the
frequency for a short period of time to make the small time offset
adjustments smooth (using some internal calculations) where the time
offset "nudge" is applied but frequency is also adjusted to prevent
immediate drift after that time adjustment. However, we aren't sure if
this is the only approach possible to achieve accurate corrections for
small offset adjustments with adjphase, so I would suggest that the
documentation be updated to state something discussing that adjphase is
expected to support small jumps in offset precisely without necessarily
bringing up frequency manipulation potentially done to achieve this.

>
> Fair point, I assumed that the adjustment is precise (as in the device
> is programmed with both the extra addend and number of cycles over
> which to use it). Yielding an exact adjustment.
>
> But then again, I also thought that .adjtime is supposed to be a precise
> single-shot nudge, while most drivers just do:
>
> time = read_time()
> time += delta
> write_time(time)
>
> :S  So yeah, let's document..
>

Our architect used the following to compare adjtime and adjphase.

  time adjustment is a 'jump' in time while phase adjustment is a smooth correction.

I think the thing though is how that smooth correction for small offset
values is achieved is not very well defined/something the implementer
can define. Will update the comments for adjphase in my patch series
based on this.

>> It also makes it harder to justify mapping small .adjtime to .adjphase,
>> as it seems like .adjphase isn't required to adjust the offset
>> immediately. Perhaps the adjustment size is small enough that isn't a
>> big problem?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-20 23:58                 ` Jacob Keller
  2023-01-21  0:06                   ` Jakub Kicinski
@ 2023-01-23  2:15                   ` Richard Cochran
  2023-01-23 18:14                     ` Jacob Keller
  1 sibling, 1 reply; 59+ messages in thread
From: Richard Cochran @ 2023-01-23  2:15 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Rahul Rameshbabu, Jakub Kicinski, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Vincent Cheng

Wow, lots of confusion in this thread!

On Fri, Jan 20, 2023 at 03:58:25PM -0800, Jacob Keller wrote:

> Sure. I guess what I don't understand is why or when one would want to
> use adjphase instead of just performing frequency adjustment via the
> .adjfine operation...

The difference is in where the frequency adjust is calculated.  There
are two possibilities:

1. It may be done in user space.
   This is NTP timex's ADJ_FREQUENCY.
   PHC implements .adjfine

2. It may be left to kernel space.
   This is NTP timex's ADJ_OFFSET.
   PHC implements .adjphase

In case of #2, the hardware implements some kind of clock servo.  The
inputs to the servo are the reported phase offsets.  The output of the
servo is a frequency adjustment.

> Especially since "gradual adjustment over time" means it will still be
> slow to converge (just like adjusting frequency is today).

It depends on the servo parameters.
 
> We should definitely improve the doc to explain the diff between them
> and make sure that its more clear to driver implementations.

Sure.

> It also makes it harder to justify mapping small .adjtime to .adjphase,

adjtime is totally different.  Don't mix those two up:

- adjtime  sets the time (phase), plain and simple.

- adjphase feeds the phase offsets into a servo algorithm in hardware.

FWIW, the PHC subsystem handles the semantics of timex ADJ_FREQUENCY
and ADJ_OFFSET in exactly the same way as the NTP subsystem.


Thanks,
Richard

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-22 21:11                     ` Rahul Rameshbabu
@ 2023-01-23  2:22                       ` Richard Cochran
  2023-01-23  2:48                         ` Rahul Rameshbabu
  0 siblings, 1 reply; 59+ messages in thread
From: Richard Cochran @ 2023-01-23  2:22 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: Jakub Kicinski, Jacob Keller, Saeed Mahameed, David S. Miller,
	Paolo Abeni, Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan,
	Gal Pressman, Vincent Cheng

On Sun, Jan 22, 2023 at 01:11:57PM -0800, Rahul Rameshbabu wrote:

> The way NVIDIA devices internally handle adjphase is to adjust the
> frequency for a short period of time to make the small time offset
> adjustments smooth (using some internal calculations) where the time
> offset "nudge" is applied but frequency is also adjusted to prevent
> immediate drift after that time adjustment.

Whatever floats your boat.

> However, we aren't sure if
> this is the only approach possible to achieve accurate corrections for
> small offset adjustments with adjphase,

Typically one would implement a PI controller.

> so I would suggest that the
> documentation be updated to state something discussing that adjphase is
> expected to support small jumps in offset precisely without necessarily
> bringing up frequency manipulation potentially done to achieve this.

Sorry if the docs aren't clear.  (However, the use of NTP timex
ADJ_FREQUENCY and ADJ_OFFSET really should be clear to everyone
familiar with NTP.)

Bottom line: PHC that offers adjphase should implement a clock servo.

Thanks,
Richard






^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-23  2:22                       ` Richard Cochran
@ 2023-01-23  2:48                         ` Rahul Rameshbabu
  2023-01-23  2:58                           ` Richard Cochran
  0 siblings, 1 reply; 59+ messages in thread
From: Rahul Rameshbabu @ 2023-01-23  2:48 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Jakub Kicinski, Jacob Keller, Saeed Mahameed, David S. Miller,
	Paolo Abeni, Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan,
	Gal Pressman, Vincent Cheng

On Sun, 22 Jan, 2023 18:22:58 -0800 Richard Cochran <richardcochran@gmail.com> wrote:
> On Sun, Jan 22, 2023 at 01:11:57PM -0800, Rahul Rameshbabu wrote:
>
>> The way NVIDIA devices internally handle adjphase is to adjust the
>> frequency for a short period of time to make the small time offset
>> adjustments smooth (using some internal calculations) where the time
>> offset "nudge" is applied but frequency is also adjusted to prevent
>> immediate drift after that time adjustment.
>
> Whatever floats your boat.
>
>> However, we aren't sure if
>> this is the only approach possible to achieve accurate corrections for
>> small offset adjustments with adjphase,
>
> Typically one would implement a PI controller.
>
>> so I would suggest that the
>> documentation be updated to state something discussing that adjphase is
>> expected to support small jumps in offset precisely without necessarily
>> bringing up frequency manipulation potentially done to achieve this.
>
> Sorry if the docs aren't clear.  (However, the use of NTP timex
> ADJ_FREQUENCY and ADJ_OFFSET really should be clear to everyone
> familiar with NTP.)
>
> Bottom line: PHC that offers adjphase should implement a clock servo.

adjphase/ADJ_OFFSET is a servo implemented in the kernel space/offloaded
to the device. That makes sense.

Another question then is can adjtime/ADJ_SETOFFSET make use of this
servo implementation on the device or is there a strict expectation that
adjtime/ADJ_SETOFFSET is purely a function that adds the offset to the
current current time? If adjphase is implemented by a driver, should
ADJ_SETOFFSET try to make use of it in the ptp stack if the offset
provided is supported by the adjphase implementation?

This question was asked earlier in the mailing thread.

>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-23  2:48                         ` Rahul Rameshbabu
@ 2023-01-23  2:58                           ` Richard Cochran
  2023-01-23 18:44                             ` Rahul Rameshbabu
  0 siblings, 1 reply; 59+ messages in thread
From: Richard Cochran @ 2023-01-23  2:58 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: Jakub Kicinski, Jacob Keller, Saeed Mahameed, David S. Miller,
	Paolo Abeni, Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan,
	Gal Pressman, Vincent Cheng

On Sun, Jan 22, 2023 at 06:48:49PM -0800, Rahul Rameshbabu wrote:

> Another question then is can adjtime/ADJ_SETOFFSET make use of this
> servo implementation on the device or is there a strict expectation that
> adjtime/ADJ_SETOFFSET is purely a function that adds the offset to the
> current current time?

ADJ_SETOFFSET == clock_settime.  Drivers should set the time
immediately.

> If adjphase is implemented by a driver, should
> ADJ_SETOFFSET try to make use of it in the ptp stack if the offset
> provided is supported by the adjphase implementation?

No.


BTW, as mentioned in the thread, the KernelDoc is really, really bad:

 * @adjphase:  Adjusts the phase offset of the hardware clock.
 *             parameter delta: Desired change in nanoseconds.

The change log is much better:

    ptp: Add adjphase function to support phase offset control.

    Adds adjust phase function to take advantage of a PHC
    clock's hardware filtering capability that uses phase offset
    control word instead of frequency offset control word.

So the Kernel Doc should say something like:

 * @adjphase:  Feeds the given phase offset into the hardware clock's servo.
 *             parameter delta: Measured phase offset in nanoseconds.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-23  2:15                   ` Richard Cochran
@ 2023-01-23 18:14                     ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2023-01-23 18:14 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Rahul Rameshbabu, Jakub Kicinski, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Vincent Cheng



On 1/22/2023 6:15 PM, Richard Cochran wrote:
> Wow, lots of confusion in this thread!
> 
> On Fri, Jan 20, 2023 at 03:58:25PM -0800, Jacob Keller wrote:
> 
>> Sure. I guess what I don't understand is why or when one would want to
>> use adjphase instead of just performing frequency adjustment via the
>> .adjfine operation...
> 
> The difference is in where the frequency adjust is calculated.  There
> are two possibilities:
> 
> 1. It may be done in user space.
>    This is NTP timex's ADJ_FREQUENCY.
>    PHC implements .adjfine
> 
> 2. It may be left to kernel space.
>    This is NTP timex's ADJ_OFFSET.
>    PHC implements .adjphase
> 
> In case of #2, the hardware implements some kind of clock servo.  The
> inputs to the servo are the reported phase offsets.  The output of the
> servo is a frequency adjustment.
> 
>> Especially since "gradual adjustment over time" means it will still be
>> slow to converge (just like adjusting frequency is today).
> 
> It depends on the servo parameters.
>  
>> We should definitely improve the doc to explain the diff between them
>> and make sure that its more clear to driver implementations.
> 
> Sure.
> 
>> It also makes it harder to justify mapping small .adjtime to .adjphase,
> 
> adjtime is totally different.  Don't mix those two up:
> 
> - adjtime  sets the time (phase), plain and simple.
> 
> - adjphase feeds the phase offsets into a servo algorithm in hardware.
> 
> FWIW, the PHC subsystem handles the semantics of timex ADJ_FREQUENCY
> and ADJ_OFFSET in exactly the same way as the NTP subsystem.
> 
> 
> Thanks,
> Richard

Thanks for a succinct summary!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-23  2:58                           ` Richard Cochran
@ 2023-01-23 18:44                             ` Rahul Rameshbabu
  2023-01-23 19:13                               ` Jacob Keller
  2023-01-23 22:36                               ` Richard Cochran
  0 siblings, 2 replies; 59+ messages in thread
From: Rahul Rameshbabu @ 2023-01-23 18:44 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Jakub Kicinski, Jacob Keller, Saeed Mahameed, David S. Miller,
	Paolo Abeni, Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan,
	Gal Pressman, Vincent Cheng

Thanks for the followup. Have a couple more questions.

On Sun, 22 Jan, 2023 18:58:48 -0800 Richard Cochran <richardcochran@gmail.com> wrote:
> On Sun, Jan 22, 2023 at 06:48:49PM -0800, Rahul Rameshbabu wrote:
>
>> Another question then is can adjtime/ADJ_SETOFFSET make use of this
>> servo implementation on the device or is there a strict expectation that
>> adjtime/ADJ_SETOFFSET is purely a function that adds the offset to the
>> current current time?
>
> ADJ_SETOFFSET == clock_settime.  Drivers should set the time
> immediately.
>
>> If adjphase is implemented by a driver, should
>> ADJ_SETOFFSET try to make use of it in the ptp stack if the offset
>> provided is supported by the adjphase implementation?
>
> No.
>
>
> BTW, as mentioned in the thread, the KernelDoc is really, really bad:
>
>  * @adjphase:  Adjusts the phase offset of the hardware clock.
>  *             parameter delta: Desired change in nanoseconds.
>
> The change log is much better:
>
>     ptp: Add adjphase function to support phase offset control.
>
>     Adds adjust phase function to take advantage of a PHC
>     clock's hardware filtering capability that uses phase offset
>     control word instead of frequency offset control word.
>
> So the Kernel Doc should say something like:
>
>  * @adjphase:  Feeds the given phase offset into the hardware clock's servo.
>  *             parameter delta: Measured phase offset in nanoseconds.

Questions based on the proposed doc change.

1. Can the PHC servo change the frequency and not be expected to reset
   it back to the frequency before the phase control word is issued? If
   it's an expectation for the phase control word to reset the frequency
   back, I think that needs to be updated in the comments as a
   requirement.
2. Is it expected that a PHC servo implementation has a fixed
   configuration? In userspace servo implementations, configuration
   parameters are supported. Would we need devlink parameters to support
   configuring a PHC servo?

>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-23 18:44                             ` Rahul Rameshbabu
@ 2023-01-23 19:13                               ` Jacob Keller
  2023-01-23 22:39                                 ` Richard Cochran
  2023-01-23 22:36                               ` Richard Cochran
  1 sibling, 1 reply; 59+ messages in thread
From: Jacob Keller @ 2023-01-23 19:13 UTC (permalink / raw)
  To: Rahul Rameshbabu, Richard Cochran
  Cc: Jakub Kicinski, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Vincent Cheng



On 1/23/2023 10:44 AM, Rahul Rameshbabu wrote:
> Thanks for the followup. Have a couple more questions.
> 
> On Sun, 22 Jan, 2023 18:58:48 -0800 Richard Cochran <richardcochran@gmail.com> wrote:
>> On Sun, Jan 22, 2023 at 06:48:49PM -0800, Rahul Rameshbabu wrote:
>>
>>> Another question then is can adjtime/ADJ_SETOFFSET make use of this
>>> servo implementation on the device or is there a strict expectation that
>>> adjtime/ADJ_SETOFFSET is purely a function that adds the offset to the
>>> current current time?
>>
>> ADJ_SETOFFSET == clock_settime.  Drivers should set the time
>> immediately.
>>
>>> If adjphase is implemented by a driver, should
>>> ADJ_SETOFFSET try to make use of it in the ptp stack if the offset
>>> provided is supported by the adjphase implementation?
>>
>> No.
>>
>>
>> BTW, as mentioned in the thread, the KernelDoc is really, really bad:
>>
>>  * @adjphase:  Adjusts the phase offset of the hardware clock.
>>  *             parameter delta: Desired change in nanoseconds.
>>
>> The change log is much better:
>>
>>     ptp: Add adjphase function to support phase offset control.
>>
>>     Adds adjust phase function to take advantage of a PHC
>>     clock's hardware filtering capability that uses phase offset
>>     control word instead of frequency offset control word.
>>
>> So the Kernel Doc should say something like:
>>
>>  * @adjphase:  Feeds the given phase offset into the hardware clock's servo.
>>  *             parameter delta: Measured phase offset in nanoseconds.
> 
> Questions based on the proposed doc change.
> 
> 1. Can the PHC servo change the frequency and not be expected to reset
>    it back to the frequency before the phase control word is issued? If
>    it's an expectation for the phase control word to reset the frequency
>    back, I think that needs to be updated in the comments as a
>    requirement.


My understanding from what Richard said is that .adjphase basically
offloads the entire servo and corrections to the hardware, and thus
would become responsible for maintaining the phase correction long term,
and callers would not use both .adjphase at the same time as .adjtime or
.adjfine

> 2. Is it expected that a PHC servo implementation has a fixed
>    configuration? In userspace servo implementations, configuration
>    parameters are supported. Would we need devlink parameters to support
>    configuring a PHC servo?
> 

I would assume some sort of parameter configuration, either via devlink
or something in the ptp_clock ecosystem if we get a device that has such
configuration?

>>
>> Thanks,
>> Richard

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-23 18:44                             ` Rahul Rameshbabu
  2023-01-23 19:13                               ` Jacob Keller
@ 2023-01-23 22:36                               ` Richard Cochran
  2023-01-24 10:33                                 ` Bar Shapira
  1 sibling, 1 reply; 59+ messages in thread
From: Richard Cochran @ 2023-01-23 22:36 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: Jakub Kicinski, Jacob Keller, Saeed Mahameed, David S. Miller,
	Paolo Abeni, Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan,
	Gal Pressman, Vincent Cheng

On Mon, Jan 23, 2023 at 10:44:35AM -0800, Rahul Rameshbabu wrote:

> 1. Can the PHC servo change the frequency and not be expected to reset
>    it back to the frequency before the phase control word is issued? If
>    it's an expectation for the phase control word to reset the frequency
>    back, I think that needs to be updated in the comments as a
>    requirement.

I would expect the PHC to restore the previously dialed frequency once
the ADJ_OFFSET (adjphase) calls stop.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-23 19:13                               ` Jacob Keller
@ 2023-01-23 22:39                                 ` Richard Cochran
  0 siblings, 0 replies; 59+ messages in thread
From: Richard Cochran @ 2023-01-23 22:39 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Rahul Rameshbabu, Jakub Kicinski, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Vincent Cheng

On Mon, Jan 23, 2023 at 11:13:35AM -0800, Jacob Keller wrote:

> > 1. Can the PHC servo change the frequency and not be expected to reset
> >    it back to the frequency before the phase control word is issued? If
> >    it's an expectation for the phase control word to reset the frequency
> >    back, I think that needs to be updated in the comments as a
> >    requirement.
> 
> 
> My understanding from what Richard said is that .adjphase basically
> offloads the entire servo and corrections to the hardware, and thus
> would become responsible for maintaining the phase correction long term,
> and callers would not use both .adjphase at the same time as .adjtime or
> .adjfine

Right.

> > 2. Is it expected that a PHC servo implementation has a fixed
> >    configuration? In userspace servo implementations, configuration
> >    parameters are supported. Would we need devlink parameters to support
> >    configuring a PHC servo?
> > 
> 
> I would assume some sort of parameter configuration, either via devlink
> or something in the ptp_clock ecosystem if we get a device that has such
> configuration?

Yeah, but so far no one has asked for this.

Could also be debugfs if no commonality between hardware exists.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-23 22:36                               ` Richard Cochran
@ 2023-01-24 10:33                                 ` Bar Shapira
  2023-01-24 19:15                                   ` Richard Cochran
  0 siblings, 1 reply; 59+ messages in thread
From: Bar Shapira @ 2023-01-24 10:33 UTC (permalink / raw)
  To: Richard Cochran, Rahul Rameshbabu
  Cc: Jakub Kicinski, Jacob Keller, Saeed Mahameed, David S. Miller,
	Paolo Abeni, Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan,
	Gal Pressman, Vincent Cheng



On 24/01/2023 0:36, Richard Cochran wrote:
> On Mon, Jan 23, 2023 at 10:44:35AM -0800, Rahul Rameshbabu wrote:
> 
>> 1. Can the PHC servo change the frequency and not be expected to reset
>>     it back to the frequency before the phase control word is issued? If
>>     it's an expectation for the phase control word to reset the frequency
>>     back, I think that needs to be updated in the comments as a
>>     requirement.
> 
> I would expect the PHC to restore the previously dialed frequency once
> the ADJ_OFFSET (adjphase) calls stop.
> 

Hi Richard,
I guess this expectation should be part of the documentation too, right? 
Are there more expectations when calling adjphase?
In previous responses on the mailing list it said that adjphase should 
not cause the time to 'jump' - is it also correct?

It seems that "Feeds the given phase offset into the hardware clock's 
servo" is still missing some information.
Can you help clarify the expected result after calling adjphase from SW?

Thanks,
Bar

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-24 10:33                                 ` Bar Shapira
@ 2023-01-24 19:15                                   ` Richard Cochran
  2023-01-25  0:48                                     ` Jakub Kicinski
  0 siblings, 1 reply; 59+ messages in thread
From: Richard Cochran @ 2023-01-24 19:15 UTC (permalink / raw)
  To: Bar Shapira
  Cc: Rahul Rameshbabu, Jakub Kicinski, Jacob Keller, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Vincent Cheng

On Tue, Jan 24, 2023 at 12:33:05PM +0200, Bar Shapira wrote:
> I guess this expectation should be part of the documentation too, right? Are
> there more expectations when calling adjphase?

I'll gladly ack improvements to the documentation. I myself won't
spend time on that, because it will only get ignored, even when it is
super clear.  Like ptp_clock_register(), for example.

> In previous responses on the mailing list it said that adjphase should not
> cause the time to 'jump' - is it also correct?

correct.
 
> It seems that "Feeds the given phase offset into the hardware clock's servo"
> is still missing some information.
> Can you help clarify the expected result after calling adjphase from SW?

If you don't have a servo implemented in hardware, then don't
implement .adjphase in your device driver.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-24 19:15                                   ` Richard Cochran
@ 2023-01-25  0:48                                     ` Jakub Kicinski
  2023-01-25  2:26                                       ` Rahul Rameshbabu
  2023-01-25  8:28                                       ` Gal Pressman
  0 siblings, 2 replies; 59+ messages in thread
From: Jakub Kicinski @ 2023-01-25  0:48 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Bar Shapira, Rahul Rameshbabu, Jacob Keller, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Vincent Cheng

On Tue, 24 Jan 2023 11:15:31 -0800 Richard Cochran wrote:
> On Tue, Jan 24, 2023 at 12:33:05PM +0200, Bar Shapira wrote:
> > I guess this expectation should be part of the documentation too, right? Are
> > there more expectations when calling adjphase?  
> 
> I'll gladly ack improvements to the documentation. I myself won't
> spend time on that, because it will only get ignored, even when it is
> super clear.  Like ptp_clock_register(), for example.

IMHO it is a responsibility of maintainers to try to teach, even if not
everyone turns out to be a diligent listener/reader. I've looked for
information about this callback at least twice in the last 6 months.

nVidia folks, could you please improve the doc, in that case? 
I think you also owe me the docs for your "debug" configuration 
of hairpin queues in debugfs.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-25  0:48                                     ` Jakub Kicinski
@ 2023-01-25  2:26                                       ` Rahul Rameshbabu
  2023-01-25  8:28                                       ` Gal Pressman
  1 sibling, 0 replies; 59+ messages in thread
From: Rahul Rameshbabu @ 2023-01-25  2:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Richard Cochran, Bar Shapira, Jacob Keller, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Vincent Cheng

On Tue, 24 Jan, 2023 16:48:32 -0800 Jakub Kicinski <kuba@kernel.org> wrote:
> On Tue, 24 Jan 2023 11:15:31 -0800 Richard Cochran wrote:
>> On Tue, Jan 24, 2023 at 12:33:05PM +0200, Bar Shapira wrote:
>> > I guess this expectation should be part of the documentation too, right? Are
>> > there more expectations when calling adjphase?  
>> 
>> I'll gladly ack improvements to the documentation. I myself won't
>> spend time on that, because it will only get ignored, even when it is
>> super clear.  Like ptp_clock_register(), for example.
>
> IMHO it is a responsibility of maintainers to try to teach, even if not
> everyone turns out to be a diligent listener/reader. I've looked for
> information about this callback at least twice in the last 6 months.
>
> nVidia folks, could you please improve the doc, in that case? 

ACK. We have a patch series underway for this. Other aspects of the
patch series include adding adjphase support to the testptp kselftest
(currently missing) and advertising the max phase offset supported by a
driver.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control
  2023-01-25  0:48                                     ` Jakub Kicinski
  2023-01-25  2:26                                       ` Rahul Rameshbabu
@ 2023-01-25  8:28                                       ` Gal Pressman
  1 sibling, 0 replies; 59+ messages in thread
From: Gal Pressman @ 2023-01-25  8:28 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bar Shapira, Rahul Rameshbabu, Jacob Keller, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Vincent Cheng, Richard Cochran

On 25/01/2023 2:48, Jakub Kicinski wrote:
> On Tue, 24 Jan 2023 11:15:31 -0800 Richard Cochran wrote:
>> On Tue, Jan 24, 2023 at 12:33:05PM +0200, Bar Shapira wrote:
>>> I guess this expectation should be part of the documentation too, right? Are
>>> there more expectations when calling adjphase?  
>>
>> I'll gladly ack improvements to the documentation. I myself won't
>> spend time on that, because it will only get ignored, even when it is
>> super clear.  Like ptp_clock_register(), for example.
> 
> IMHO it is a responsibility of maintainers to try to teach, even if not
> everyone turns out to be a diligent listener/reader. I've looked for
> information about this callback at least twice in the last 6 months.
> 
> nVidia folks, could you please improve the doc, in that case? 
> I think you also owe me the docs for your "debug" configuration 
> of hairpin queues in debugfs.

I have patches converting the hairpin debugfs to devlink params, the
write debugfs files will be reverted.

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2023-01-25  8:29 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-18 18:35 [pull request][net-next 00/15] mlx5 updates 2023-01-18 Saeed Mahameed
2023-01-18 18:35 ` [net-next 01/15] net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB Saeed Mahameed
2023-01-18 21:31   ` Jacob Keller
2023-01-20  4:00   ` patchwork-bot+netdevbpf
2023-01-18 18:35 ` [net-next 02/15] net/mlx5: Suppress error logging on UCTX creation Saeed Mahameed
2023-01-18 21:32   ` Jacob Keller
2023-01-18 18:35 ` [net-next 03/15] net/mlx5: Add adjphase function to support hardware-only offset control Saeed Mahameed
2023-01-18 21:33   ` Jacob Keller
2023-01-20  3:46     ` Jakub Kicinski
2023-01-20  3:56       ` Rahul Rameshbabu
2023-01-20  4:03         ` Jakub Kicinski
2023-01-20  4:26           ` Rahul Rameshbabu
2023-01-20  5:08             ` Jakub Kicinski
2023-01-20  5:22               ` Rahul Rameshbabu
2023-01-20  5:45                 ` Jakub Kicinski
2023-01-20  6:02                   ` Rahul Rameshbabu
2023-01-20 17:21             ` Jacob Keller
2023-01-20 18:00               ` Rahul Rameshbabu
2023-01-20 23:58                 ` Jacob Keller
2023-01-21  0:06                   ` Jakub Kicinski
2023-01-22 21:11                     ` Rahul Rameshbabu
2023-01-23  2:22                       ` Richard Cochran
2023-01-23  2:48                         ` Rahul Rameshbabu
2023-01-23  2:58                           ` Richard Cochran
2023-01-23 18:44                             ` Rahul Rameshbabu
2023-01-23 19:13                               ` Jacob Keller
2023-01-23 22:39                                 ` Richard Cochran
2023-01-23 22:36                               ` Richard Cochran
2023-01-24 10:33                                 ` Bar Shapira
2023-01-24 19:15                                   ` Richard Cochran
2023-01-25  0:48                                     ` Jakub Kicinski
2023-01-25  2:26                                       ` Rahul Rameshbabu
2023-01-25  8:28                                       ` Gal Pressman
2023-01-23  2:15                   ` Richard Cochran
2023-01-23 18:14                     ` Jacob Keller
2023-01-18 18:35 ` [net-next 04/15] net/mlx5: Add hardware extended range support for PTP adjtime and adjphase Saeed Mahameed
2023-01-18 21:35   ` Jacob Keller
2023-01-18 18:35 ` [net-next 05/15] net/mlx5: E-switch, Remove redundant comment about meta rules Saeed Mahameed
2023-01-18 21:40   ` Jacob Keller
2023-01-18 18:35 ` [net-next 06/15] net/mlx5e: Fail with messages when params are not valid for XSK Saeed Mahameed
2023-01-18 21:41   ` Jacob Keller
2023-01-18 18:35 ` [net-next 07/15] net/mlx5e: Add warning when log WQE size is smaller than log stride size Saeed Mahameed
2023-01-18 21:42   ` Jacob Keller
2023-01-18 18:35 ` [net-next 08/15] net/mlx5e: TC, Pass flow attr to attach/detach mod hdr functions Saeed Mahameed
2023-01-18 21:42   ` Jacob Keller
2023-01-18 18:35 ` [net-next 09/15] net/mlx5e: TC, Add tc prefix to attach/detach " Saeed Mahameed
2023-01-18 21:44   ` Jacob Keller
2023-01-18 18:35 ` [net-next 10/15] net/mlx5e: TC, Use common function allocating flow mod hdr or encap mod hdr Saeed Mahameed
2023-01-18 21:45   ` Jacob Keller
2023-01-18 18:35 ` [net-next 11/15] net/mlx5e: Warn when destroying mod hdr hash table that is not empty Saeed Mahameed
2023-01-18 21:45   ` Jacob Keller
2023-01-18 18:35 ` [net-next 12/15] net/mlx5: E-Switch, Fix typo for egress Saeed Mahameed
2023-01-18 21:46   ` Jacob Keller
2023-01-18 18:36 ` [net-next 13/15] net/mlx5e: Support Geneve and GRE with VF tunnel offload Saeed Mahameed
2023-01-18 21:48   ` Jacob Keller
2023-01-18 18:36 ` [net-next 14/15] net/mlx5e: Remove redundant allocation of spec in create indirect fwd group Saeed Mahameed
2023-01-18 21:50   ` Jacob Keller
2023-01-18 18:36 ` [net-next 15/15] net/mlx5e: Use read lock for eswitch get callbacks Saeed Mahameed
2023-01-18 21:51   ` Jacob Keller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.