All of lore.kernel.org
 help / color / mirror / Atom feed
* [pull request][net-next 00/15] mlx5 updates 2021-01-07
@ 2021-01-08  5:30 Saeed Mahameed
  2021-01-08  5:30 ` [net-next 01/15] net/mlx5: Add HW definition of reg_c_preserve Saeed Mahameed
                   ` (14 more replies)
  0 siblings, 15 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski; +Cc: netdev, Saeed Mahameed

From: Saeed Mahameed <saeedm@nvidia.com>

Hi Dave, Jakub

This series provides misc updates to mlx5 plus the +trk+new TC 
connection tracking rules support for UDP.

For more information please see tag log below.

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit 58334e7537278793c86baa88b70c48b0d50b00ae:

  Merge branch 'generic-zcopy_-functions' (2021-01-07 16:08:38 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2021-01-07

for you to fetch changes up to 29c9b5c4ed42ca9d7c8f8f6c34a7ef7ab787ad6e:

  net/mlx5e: IPsec, Remove unnecessary config flag usage (2021-01-07 21:27:08 -0800)

----------------------------------------------------------------
mlx5-updates-2021-01-07

Misc updates series for mlx5 driver:

1) From Eli and Jianbo, E-Switch cleanups and usage of new
   FW capability for mpls over udp

2) Paul Blakey, Adds support for mirroring with Connection tracking
by splitting rules to pre and post Connection tracking to perform the
mirroring.

3) Roi Dayan, Adds support for +trk+new connection tracking rules
 3.1) cleanups to connection tracking
 3.2) to support +trk+new feature of connection tracking, Roi adds another
     flow table that catches all missed packets from the original CT table
     (The table that handles stateful established flows)
 3.3) Add support to offload +trk+new rules for terminating flows for udp
    protocols using source port entropy.
    We support only the default registered vxlan, RoCE and Geneve ports.
    Using the registered ports assume the traffic is that of the
    registered protocol.

4) From Tariq, Cleanups and improvements to IPSec

----------------------------------------------------------------
Eli Cohen (2):
      net/mlx5e: Simplify condition on esw_vport_enable_qos()
      net/mlx5: E-Switch, use new cap as condition for mpls over udp

Jianbo Liu (1):
      net/mlx5e: E-Switch, Offload all chain 0 priorities when modify header and forward action is not supported

Paul Blakey (2):
      net/mlx5: Add HW definition of reg_c_preserve
      net/mlx5e: CT: Add support for mirroring

Roi Dayan (6):
      net/mlx5e: CT: Pass null instead of zero spec
      net/mlx5e: Remove redundant initialization to null
      net/mlx5e: CT: Remove redundant usage of zone mask
      net/mlx5e: CT: Preparation for offloading +trk+new ct rules
      net/mlx5e: CT: Support offload of +trk+new ct rules
      net/mlx5e: CT, Avoid false lock depenency warning

Tariq Toukan (4):
      net/mlx5e: IPsec, Enclose csum logic under ipsec config
      net/mlx5e: IPsec, Avoid unreachable return
      net/mlx5e: IPsec, Inline feature_check fast-path function
      net/mlx5e: IPsec, Remove unnecessary config flag usage

 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c | 353 +++++++++++++++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h |   6 +
 .../mellanox/mlx5/core/en/tc_tun_mplsoudp.c        |   4 +-
 .../mellanox/mlx5/core/en_accel/en_accel.h         |   4 +-
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c       |  14 -
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.h       |  29 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   4 -
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   2 -
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |  47 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |   3 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   3 +-
 .../ethernet/mellanox/mlx5/core/lib/fs_chains.c    |   7 +-
 include/linux/mlx5/mlx5_ifc.h                      |   4 +-
 13 files changed, 399 insertions(+), 81 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [net-next 01/15] net/mlx5: Add HW definition of reg_c_preserve
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 02/15] net/mlx5e: Simplify condition on esw_vport_enable_qos() Saeed Mahameed
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Paul Blakey, Maor Dickman, Roi Dayan, Saeed Mahameed

From: Paul Blakey <paulb@mellanox.com>

Add capability bit to test whether reg_c value is preserved on
recirculation.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 8fbddec26eb8..ec0d01e26b3e 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1278,7 +1278,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 
 	u8         reserved_at_a0[0x3];
 	u8	   ece_support[0x1];
-	u8	   reserved_at_a4[0x7];
+	u8	   reserved_at_a4[0x5];
+	u8         reg_c_preserve[0x1];
+	u8         reserved_at_aa[0x1];
 	u8         log_max_srq[0x5];
 	u8         reserved_at_b0[0x2];
 	u8         ts_cqe_to_dest_cqn[0x1];
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 02/15] net/mlx5e: Simplify condition on esw_vport_enable_qos()
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
  2021-01-08  5:30 ` [net-next 01/15] net/mlx5: Add HW definition of reg_c_preserve Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 03/15] net/mlx5: E-Switch, use new cap as condition for mpls over udp Saeed Mahameed
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Eli Cohen, Roi Dayan, Saeed Mahameed

From: Eli Cohen <elic@nvidia.com>

esw->qos.enabled will only be true if both MLX5_CAP_GEN(dev, qos) and
MLX5_CAP_QOS(dev, esw_scheduling) are true. Therefore, remove them from
the condition in and rely only on esw->qos.enabled.

Fixes: 1bd27b11c1df ("net/mlx5: Introduce E-switch QoS management")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index da901e364656..876e6449edb3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1042,8 +1042,7 @@ static int esw_vport_enable_qos(struct mlx5_eswitch *esw,
 	void *vport_elem;
 	int err = 0;
 
-	if (!esw->qos.enabled || !MLX5_CAP_GEN(dev, qos) ||
-	    !MLX5_CAP_QOS(dev, esw_scheduling))
+	if (!esw->qos.enabled)
 		return 0;
 
 	if (vport->qos.enabled)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 03/15] net/mlx5: E-Switch, use new cap as condition for mpls over udp
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
  2021-01-08  5:30 ` [net-next 01/15] net/mlx5: Add HW definition of reg_c_preserve Saeed Mahameed
  2021-01-08  5:30 ` [net-next 02/15] net/mlx5e: Simplify condition on esw_vport_enable_qos() Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 04/15] net/mlx5e: E-Switch, Offload all chain 0 priorities when modify header and forward action is not supported Saeed Mahameed
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Eli Cohen, Roi Dayan, Mark Bloch, Saeed Mahameed

From: Eli Cohen <elic@nvidia.com>

Use tunnel_stateless_mpls_over_udp instead of
MLX5_FLEX_PROTO_CW_MPLS_UDP since new devices have native support for
mpls over udp and do not rely on flex parser.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_mplsoudp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_mplsoudp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_mplsoudp.c
index 1f9526244222..3479672e84cf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_mplsoudp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_mplsoudp.c
@@ -81,8 +81,8 @@ static int parse_tunnel(struct mlx5e_priv *priv,
 	if (!enc_keyid.mask->keyid)
 		return 0;
 
-	if (!(MLX5_CAP_GEN(priv->mdev, flex_parser_protocols) &
-	      MLX5_FLEX_PROTO_CW_MPLS_UDP))
+	if (!MLX5_CAP_ETH(priv->mdev, tunnel_stateless_mpls_over_udp) &&
+	    !(MLX5_CAP_GEN(priv->mdev, flex_parser_protocols) & MLX5_FLEX_PROTO_CW_MPLS_UDP))
 		return -EOPNOTSUPP;
 
 	flow_rule_match_mpls(rule, &match);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 04/15] net/mlx5e: E-Switch, Offload all chain 0 priorities when modify header and forward action is not supported
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (2 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 03/15] net/mlx5: E-Switch, use new cap as condition for mpls over udp Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 05/15] net/mlx5e: CT: Pass null instead of zero spec Saeed Mahameed
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Jianbo Liu, Oz Shlomo, Roi Dayan, Saeed Mahameed

From: Jianbo Liu <jianbol@mellanox.com>

Miss path handling of tc multi chain filters (i.e. filters that are
defined on chain > 0) requires the hardware to communicate to the
driver the last chain that was processed. This is possible only when
the hardware is capable of performing the combination of modify header
and forward to table actions. Currently, if the hardware is missing
this capability then the driver only offloads rules that are defined
on tc chain 0 prio 1. However, this restriction can be relaxed because
packets that miss from chain 0 are processed through all the
priorities by tc software.

Allow the offload of all the supported priorities for chain 0 even
when the hardware is not capable to perform modify header and goto
table actions.

Fixes: 0b3a8b6b5340 ("net/mlx5: E-Switch: Fix using fwd and modify when firmware doesn't support it")
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c         | 6 ------
 drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c | 3 ---
 2 files changed, 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 4cdf834fa74a..56aa39ac1a1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1317,12 +1317,6 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
 	int err = 0;
 	int out_index;
 
-	if (!mlx5_chains_prios_supported(esw_chains(esw)) && attr->prio != 1) {
-		NL_SET_ERR_MSG_MOD(extack,
-				   "E-switch priorities unsupported, upgrade FW");
-		return -EOPNOTSUPP;
-	}
-
 	/* We check chain range only for tc flows.
 	 * For ft flows, we checked attr->chain was originally 0 and set it to
 	 * FDB_FT_CHAIN which is outside tc range.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c
index 947f346bdc2d..fa61d305990c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c
@@ -141,9 +141,6 @@ u32 mlx5_chains_get_nf_ft_chain(struct mlx5_fs_chains *chains)
 
 u32 mlx5_chains_get_prio_range(struct mlx5_fs_chains *chains)
 {
-	if (!mlx5_chains_prios_supported(chains))
-		return 1;
-
 	if (mlx5_chains_ignore_flow_level_supported(chains))
 		return UINT_MAX;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 05/15] net/mlx5e: CT: Pass null instead of zero spec
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (3 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 04/15] net/mlx5e: E-Switch, Offload all chain 0 priorities when modify header and forward action is not supported Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 06/15] net/mlx5e: Remove redundant initialization to null Saeed Mahameed
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Roi Dayan, Oz Shlomo, Paul Blakey, Saeed Mahameed

From: Roi Dayan <roid@nvidia.com>

No need to pass zero spec to mlx5_add_flow_rules() as the
function can handle null spec.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index e521254d886e..a0b193181ba5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -1220,9 +1220,8 @@ static int tc_ct_pre_ct_add_rules(struct mlx5_ct_ft *ct_ft,
 	pre_ct->flow_rule = rule;
 
 	/* add miss rule */
-	memset(spec, 0, sizeof(*spec));
 	dest.ft = nat ? ct_priv->ct_nat : ct_priv->ct;
-	rule = mlx5_add_flow_rules(ft, spec, &flow_act, &dest, 1);
+	rule = mlx5_add_flow_rules(ft, NULL, &flow_act, &dest, 1);
 	if (IS_ERR(rule)) {
 		err = PTR_ERR(rule);
 		ct_dbg("Failed to add pre ct miss rule zone %d", zone);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 06/15] net/mlx5e: Remove redundant initialization to null
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (4 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 05/15] net/mlx5e: CT: Pass null instead of zero spec Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 07/15] net/mlx5e: CT: Remove redundant usage of zone mask Saeed Mahameed
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Roi Dayan, Oz Shlomo, Paul Blakey, Saeed Mahameed

From: Roi Dayan <roid@nvidia.com>

miss_rule and prio_s args are not being referenced before assigned
so there is no need to init them.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c
index fa61d305990c..381325b4a863 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c
@@ -538,13 +538,13 @@ mlx5_chains_create_prio(struct mlx5_fs_chains *chains,
 			u32 chain, u32 prio, u32 level)
 {
 	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
-	struct mlx5_flow_handle *miss_rule = NULL;
+	struct mlx5_flow_handle *miss_rule;
 	struct mlx5_flow_group *miss_group;
 	struct mlx5_flow_table *next_ft;
 	struct mlx5_flow_table *ft;
-	struct prio *prio_s = NULL;
 	struct fs_chain *chain_s;
 	struct list_head *pos;
+	struct prio *prio_s;
 	u32 *flow_group_in;
 	int err;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 07/15] net/mlx5e: CT: Remove redundant usage of zone mask
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (5 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 06/15] net/mlx5e: Remove redundant initialization to null Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules Saeed Mahameed
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Roi Dayan, Paul Blakey, Saeed Mahameed

From: Roi Dayan <roid@nvidia.com>

The zone member is of type u16 so there is no reason to apply
the zone mask on it. This is also matching the call to set a
match in other places which don't need and don't apply the mask.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index a0b193181ba5..d7ecd5e5f7c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -703,9 +703,7 @@ mlx5_tc_ct_entry_add_rule(struct mlx5_tc_ct_priv *ct_priv,
 	attr->flags |= MLX5_ESW_ATTR_FLAG_NO_IN_PORT;
 
 	mlx5_tc_ct_set_tuple_match(netdev_priv(ct_priv->netdev), spec, flow_rule);
-	mlx5e_tc_match_to_reg_match(spec, ZONE_TO_REG,
-				    entry->tuple.zone & MLX5_CT_ZONE_MASK,
-				    MLX5_CT_ZONE_MASK);
+	mlx5e_tc_match_to_reg_match(spec, ZONE_TO_REG, entry->tuple.zone, MLX5_CT_ZONE_MASK);
 
 	zone_rule->rule = mlx5_tc_rule_insert(priv, spec, attr);
 	if (IS_ERR(zone_rule->rule)) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (6 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 07/15] net/mlx5e: CT: Remove redundant usage of zone mask Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08 21:48   ` Marcelo Ricardo Leitner
  2021-01-08  5:30 ` [net-next 09/15] net/mlx5e: CT: Support offload of " Saeed Mahameed
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Roi Dayan, Paul Blakey, Saeed Mahameed

From: Roi Dayan <roid@nvidia.com>

Connection tracking associates the connection state per packet. The
first packet of a connection is assigned with the +trk+new state. The
connection enters the established state once a packet is seen on the
other direction.

Currently we offload only the established flows. However, UDP traffic
using source port entropy (e.g. vxlan, RoCE) will never enter the
established state. Such protocols do not require stateful processing,
and therefore could be offloaded.

The change in the model is that a miss on the CT table will be forwarded
to a new +trk+new ct table and a miss there will be forwarded to the slow
path table.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/tc_ct.c    | 104 ++++++++++++++++--
 1 file changed, 96 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index d7ecd5e5f7c4..6dac2fabb7f5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -21,6 +21,7 @@
 #include "en.h"
 #include "en_tc.h"
 #include "en_rep.h"
+#include "fs_core.h"
 
 #define MLX5_CT_ZONE_BITS (mlx5e_tc_attr_to_reg_mappings[ZONE_TO_REG].mlen * 8)
 #define MLX5_CT_ZONE_MASK GENMASK(MLX5_CT_ZONE_BITS - 1, 0)
@@ -50,6 +51,9 @@ struct mlx5_tc_ct_priv {
 	struct mlx5_flow_table *ct;
 	struct mlx5_flow_table *ct_nat;
 	struct mlx5_flow_table *post_ct;
+	struct mlx5_flow_table *trk_new_ct;
+	struct mlx5_flow_group *miss_grp;
+	struct mlx5_flow_handle *miss_rule;
 	struct mutex control_lock; /* guards parallel adds/dels */
 	struct mutex shared_counter_lock;
 	struct mapping_ctx *zone_mapping;
@@ -1490,14 +1494,14 @@ mlx5_tc_ct_del_ft_cb(struct mlx5_tc_ct_priv *ct_priv, struct mlx5_ct_ft *ft)
  *      | set zone
  *      v
  * +--------------------+
- * + CT (nat or no nat) +
- * + tuple + zone match +
- * +--------------------+
- *      | set mark
- *      | set labels_id
- *      | set established
- *	| set zone_restore
- *      | do nat (if needed)
+ * + CT (nat or no nat) +    miss          +---------------------+  miss
+ * + tuple + zone match +----------------->+ trk_new_ct          +-------> SW
+ * +--------------------+                  + vxlan||roce match   +
+ *      | set mark                         +---------------------+
+ *      | set labels_id                             | set ct_state +trk+new
+ *      | set established                           | set zone_restore
+ *	| set zone_restore                          v
+ *      | do nat (if needed)                      post_ct
  *      v
  * +--------------+
  * + post_ct      + original filter actions
@@ -1893,6 +1897,72 @@ mlx5_tc_ct_init_check_support(struct mlx5e_priv *priv,
 		return mlx5_tc_ct_init_check_nic_support(priv, err_msg);
 }
 
+static struct mlx5_flow_handle *
+tc_ct_add_miss_rule(struct mlx5_flow_table *ft,
+		    struct mlx5_flow_table *next_ft)
+{
+	struct mlx5_flow_destination dest = {};
+	struct mlx5_flow_act act = {};
+
+	act.flags  = FLOW_ACT_IGNORE_FLOW_LEVEL | FLOW_ACT_NO_APPEND;
+	act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
+	dest.type  = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
+	dest.ft = next_ft;
+
+	return mlx5_add_flow_rules(ft, NULL, &act, &dest, 1);
+}
+
+static int
+tc_ct_add_ct_table_miss_rule(struct mlx5_tc_ct_priv *ct_priv)
+{
+	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	struct mlx5_flow_handle *miss_rule;
+	struct mlx5_flow_group *miss_group;
+	int max_fte = ct_priv->ct->max_fte;
+	u32 *flow_group_in;
+	int err = 0;
+
+	flow_group_in = kvzalloc(inlen, GFP_KERNEL);
+	if (!flow_group_in)
+		return -ENOMEM;
+
+	/* create miss group */
+	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index,
+		 max_fte - 2);
+	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index,
+		 max_fte - 1);
+	miss_group = mlx5_create_flow_group(ct_priv->ct, flow_group_in);
+	if (IS_ERR(miss_group)) {
+		err = PTR_ERR(miss_group);
+		goto err_miss_grp;
+	}
+
+	/* add miss rule to next fdb */
+	miss_rule = tc_ct_add_miss_rule(ct_priv->ct, ct_priv->trk_new_ct);
+	if (IS_ERR(miss_rule)) {
+		err = PTR_ERR(miss_rule);
+		goto err_miss_rule;
+	}
+
+	ct_priv->miss_grp = miss_group;
+	ct_priv->miss_rule = miss_rule;
+	kvfree(flow_group_in);
+	return 0;
+
+err_miss_rule:
+	mlx5_destroy_flow_group(miss_group);
+err_miss_grp:
+	kvfree(flow_group_in);
+	return err;
+}
+
+static void
+tc_ct_del_ct_table_miss_rule(struct mlx5_tc_ct_priv *ct_priv)
+{
+	mlx5_del_flow_rules(ct_priv->miss_rule);
+	mlx5_destroy_flow_group(ct_priv->miss_grp);
+}
+
 #define INIT_ERR_PREFIX "tc ct offload init failed"
 
 struct mlx5_tc_ct_priv *
@@ -1962,6 +2032,18 @@ mlx5_tc_ct_init(struct mlx5e_priv *priv, struct mlx5_fs_chains *chains,
 		goto err_post_ct_tbl;
 	}
 
+	ct_priv->trk_new_ct = mlx5_chains_create_global_table(chains);
+	if (IS_ERR(ct_priv->trk_new_ct)) {
+		err = PTR_ERR(ct_priv->trk_new_ct);
+		mlx5_core_warn(dev, "%s, failed to create trk new ct table err: %d",
+			       INIT_ERR_PREFIX, err);
+		goto err_trk_new_ct_tbl;
+	}
+
+	err = tc_ct_add_ct_table_miss_rule(ct_priv);
+	if (err)
+		goto err_init_ct_tbl;
+
 	idr_init(&ct_priv->fte_ids);
 	mutex_init(&ct_priv->control_lock);
 	mutex_init(&ct_priv->shared_counter_lock);
@@ -1971,6 +2053,10 @@ mlx5_tc_ct_init(struct mlx5e_priv *priv, struct mlx5_fs_chains *chains,
 
 	return ct_priv;
 
+err_init_ct_tbl:
+	mlx5_chains_destroy_global_table(chains, ct_priv->trk_new_ct);
+err_trk_new_ct_tbl:
+	mlx5_chains_destroy_global_table(chains, ct_priv->post_ct);
 err_post_ct_tbl:
 	mlx5_chains_destroy_global_table(chains, ct_priv->ct_nat);
 err_ct_nat_tbl:
@@ -1997,6 +2083,8 @@ mlx5_tc_ct_clean(struct mlx5_tc_ct_priv *ct_priv)
 
 	chains = ct_priv->chains;
 
+	tc_ct_del_ct_table_miss_rule(ct_priv);
+	mlx5_chains_destroy_global_table(chains, ct_priv->trk_new_ct);
 	mlx5_chains_destroy_global_table(chains, ct_priv->post_ct);
 	mlx5_chains_destroy_global_table(chains, ct_priv->ct_nat);
 	mlx5_chains_destroy_global_table(chains, ct_priv->ct);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 09/15] net/mlx5e: CT: Support offload of +trk+new ct rules
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (7 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08 21:59   ` Marcelo Ricardo Leitner
  2021-01-08  5:30 ` [net-next 10/15] net/mlx5e: CT: Add support for mirroring Saeed Mahameed
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Roi Dayan, Paul Blakey, Saeed Mahameed

From: Roi Dayan <roid@nvidia.com>

Add support to offload +trk+new rules for terminating flows for udp
protocols using source port entropy.
This kind of traffic will never be considered connect in conntrack
and thus never set as established so no need to keep
track of them in SW conntrack and offload this traffic based on dst
port.
In this commit we support only the default registered vxlan port,
RoCE and Geneve ports. Using the registered ports assume the traffic
is that of the registered protocol.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/tc_ct.c    | 228 +++++++++++++++++-
 .../ethernet/mellanox/mlx5/core/en/tc_ct.h    |   6 +
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |  16 +-
 3 files changed, 236 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index 6dac2fabb7f5..b0c357f755d4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -25,9 +25,6 @@
 
 #define MLX5_CT_ZONE_BITS (mlx5e_tc_attr_to_reg_mappings[ZONE_TO_REG].mlen * 8)
 #define MLX5_CT_ZONE_MASK GENMASK(MLX5_CT_ZONE_BITS - 1, 0)
-#define MLX5_CT_STATE_ESTABLISHED_BIT BIT(1)
-#define MLX5_CT_STATE_TRK_BIT BIT(2)
-#define MLX5_CT_STATE_NAT_BIT BIT(3)
 
 #define MLX5_FTE_ID_BITS (mlx5e_tc_attr_to_reg_mappings[FTEID_TO_REG].mlen * 8)
 #define MLX5_FTE_ID_MAX GENMASK(MLX5_FTE_ID_BITS - 1, 0)
@@ -39,6 +36,17 @@
 #define ct_dbg(fmt, args...)\
 	netdev_dbg(ct_priv->netdev, "ct_debug: " fmt "\n", ##args)
 
+#define IANA_VXLAN_UDP_PORT    4789
+#define ROCE_V2_UDP_DPORT      4791
+#define GENEVE_UDP_PORT        6081
+#define DEFAULT_UDP_PORTS 3
+
+static int default_udp_ports[] = {
+	IANA_VXLAN_UDP_PORT,
+	ROCE_V2_UDP_DPORT,
+	GENEVE_UDP_PORT,
+};
+
 struct mlx5_tc_ct_priv {
 	struct mlx5_core_dev *dev;
 	const struct net_device *netdev;
@@ -88,6 +96,16 @@ struct mlx5_tc_ct_pre {
 	struct mlx5_modify_hdr *modify_hdr;
 };
 
+struct mlx5_tc_ct_trk_new_rule {
+	struct mlx5_flow_handle *flow_rule;
+	struct list_head list;
+};
+
+struct mlx5_tc_ct_trk_new_rules {
+	struct list_head rules;
+	struct mlx5_modify_hdr *modify_hdr;
+};
+
 struct mlx5_ct_ft {
 	struct rhash_head node;
 	u16 zone;
@@ -98,6 +116,8 @@ struct mlx5_ct_ft {
 	struct rhashtable ct_entries_ht;
 	struct mlx5_tc_ct_pre pre_ct;
 	struct mlx5_tc_ct_pre pre_ct_nat;
+	struct mlx5_tc_ct_trk_new_rules trk_new_rules;
+	struct nf_conn *tmpl;
 };
 
 struct mlx5_ct_tuple {
@@ -1064,7 +1084,7 @@ mlx5_tc_ct_match_add(struct mlx5_tc_ct_priv *priv,
 {
 	struct flow_rule *rule = flow_cls_offload_flow_rule(f);
 	struct flow_dissector_key_ct *mask, *key;
-	bool trk, est, untrk, unest, new;
+	bool trk, est, untrk, unest, new, unnew;
 	u32 ctstate = 0, ctstate_mask = 0;
 	u16 ct_state_on, ct_state_off;
 	u16 ct_state, ct_state_mask;
@@ -1102,19 +1122,16 @@ mlx5_tc_ct_match_add(struct mlx5_tc_ct_priv *priv,
 	new = ct_state_on & TCA_FLOWER_KEY_CT_FLAGS_NEW;
 	est = ct_state_on & TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED;
 	untrk = ct_state_off & TCA_FLOWER_KEY_CT_FLAGS_TRACKED;
+	unnew = ct_state_off & TCA_FLOWER_KEY_CT_FLAGS_NEW;
 	unest = ct_state_off & TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED;
 
 	ctstate |= trk ? MLX5_CT_STATE_TRK_BIT : 0;
+	ctstate |= new ? MLX5_CT_STATE_NEW_BIT : 0;
 	ctstate |= est ? MLX5_CT_STATE_ESTABLISHED_BIT : 0;
 	ctstate_mask |= (untrk || trk) ? MLX5_CT_STATE_TRK_BIT : 0;
+	ctstate_mask |= (unnew || new) ? MLX5_CT_STATE_NEW_BIT : 0;
 	ctstate_mask |= (unest || est) ? MLX5_CT_STATE_ESTABLISHED_BIT : 0;
 
-	if (new) {
-		NL_SET_ERR_MSG_MOD(extack,
-				   "matching on ct_state +new isn't supported");
-		return -EOPNOTSUPP;
-	}
-
 	if (mask->ct_zone)
 		mlx5e_tc_match_to_reg_match(spec, ZONE_TO_REG,
 					    key->ct_zone, MLX5_CT_ZONE_MASK);
@@ -1136,6 +1153,8 @@ mlx5_tc_ct_match_add(struct mlx5_tc_ct_priv *priv,
 					    MLX5_CT_LABELS_MASK);
 	}
 
+	ct_attr->ct_state = ctstate;
+
 	return 0;
 }
 
@@ -1390,10 +1409,157 @@ mlx5_tc_ct_free_pre_ct_tables(struct mlx5_ct_ft *ft)
 	mlx5_tc_ct_free_pre_ct(ft, &ft->pre_ct);
 }
 
+static void mlx5_tc_ct_set_match_dst_udp_port(struct mlx5_flow_spec *spec, u16 dst_port)
+{
+	void *headers_c = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
+				       outer_headers);
+	void *headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value,
+				       outer_headers);
+
+	MLX5_SET_TO_ONES(fte_match_set_lyr_2_4, headers_c, udp_dport);
+	MLX5_SET(fte_match_set_lyr_2_4, headers_v, udp_dport, dst_port);
+
+	spec->match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
+}
+
+static struct mlx5_tc_ct_trk_new_rule *
+tc_ct_add_trk_new_rule(struct mlx5_ct_ft *ft, int port)
+{
+	struct mlx5_tc_ct_priv *ct_priv = ft->ct_priv;
+	struct mlx5_tc_ct_trk_new_rule *trk_new_rule;
+	struct mlx5_flow_destination dest = {};
+	struct mlx5_flow_act flow_act = {};
+	struct mlx5_flow_handle *rule;
+	struct mlx5_flow_spec *spec;
+	int err;
+
+	trk_new_rule = kzalloc(sizeof(*trk_new_rule), GFP_KERNEL);
+	if (!trk_new_rule)
+		return ERR_PTR(-ENOMEM);
+
+	spec = kzalloc(sizeof(*spec), GFP_KERNEL);
+	if (!spec) {
+		kfree(trk_new_rule);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST |
+			  MLX5_FLOW_CONTEXT_ACTION_MOD_HDR;
+	flow_act.flags |= FLOW_ACT_IGNORE_FLOW_LEVEL;
+	flow_act.modify_hdr = ft->trk_new_rules.modify_hdr;
+	dest.type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
+	dest.ft = ct_priv->post_ct;
+
+	mlx5e_tc_match_to_reg_match(spec, ZONE_TO_REG, ft->zone, MLX5_CT_ZONE_MASK);
+	mlx5_tc_ct_set_match_dst_udp_port(spec, port);
+
+	rule = mlx5_add_flow_rules(ct_priv->trk_new_ct, spec, &flow_act, &dest, 1);
+	if (IS_ERR(rule)) {
+		err = PTR_ERR(rule);
+		ct_dbg("Failed to add trk_new rule for udp port %d, err %d", port, err);
+		goto err_insert;
+	}
+
+	kfree(spec);
+	trk_new_rule->flow_rule = rule;
+	list_add_tail(&trk_new_rule->list, &ft->trk_new_rules.rules);
+	return trk_new_rule;
+
+err_insert:
+	kfree(spec);
+	kfree(trk_new_rule);
+	return ERR_PTR(err);
+}
+
+static void
+tc_ct_del_trk_new_rule(struct mlx5_tc_ct_trk_new_rule *rule)
+{
+	list_del(&rule->list);
+	mlx5_del_flow_rules(rule->flow_rule);
+	kfree(rule);
+}
+
+static int
+tc_ct_init_trk_new_rules(struct mlx5_ct_ft *ft)
+{
+	struct mlx5_tc_ct_priv *ct_priv = ft->ct_priv;
+	struct mlx5_tc_ct_trk_new_rule *rule, *tmp;
+	struct mlx5e_tc_mod_hdr_acts mod_acts = {};
+	struct mlx5_modify_hdr *mod_hdr;
+	struct mlx5e_priv *priv;
+	u32 ct_state;
+	int i, err;
+
+	priv = netdev_priv(ct_priv->netdev);
+
+	ct_state = MLX5_CT_STATE_TRK_BIT | MLX5_CT_STATE_NEW_BIT;
+	err = mlx5e_tc_match_to_reg_set(priv->mdev, &mod_acts, ct_priv->ns_type,
+					CTSTATE_TO_REG, ct_state);
+	if (err) {
+		ct_dbg("Failed to set register for ct trk_new");
+		goto err_set_registers;
+	}
+
+	err = mlx5e_tc_match_to_reg_set(priv->mdev, &mod_acts, ct_priv->ns_type,
+					ZONE_RESTORE_TO_REG, ft->zone_restore_id);
+	if (err) {
+		ct_dbg("Failed to set register for ct trk_new zone restore");
+		goto err_set_registers;
+	}
+
+	mod_hdr = mlx5_modify_header_alloc(priv->mdev,
+					   ct_priv->ns_type,
+					   mod_acts.num_actions,
+					   mod_acts.actions);
+	if (IS_ERR(mod_hdr)) {
+		err = PTR_ERR(mod_hdr);
+		ct_dbg("Failed to create ct trk_new mod hdr");
+		goto err_set_registers;
+	}
+
+	ft->trk_new_rules.modify_hdr = mod_hdr;
+	dealloc_mod_hdr_actions(&mod_acts);
+
+	for (i = 0; i < DEFAULT_UDP_PORTS; i++) {
+		int port = default_udp_ports[i];
+
+		rule = tc_ct_add_trk_new_rule(ft, port);
+		if (IS_ERR(rule))
+			goto err_insert;
+	}
+
+	return 0;
+
+err_insert:
+	list_for_each_entry_safe(rule, tmp, &ft->trk_new_rules.rules, list)
+		tc_ct_del_trk_new_rule(rule);
+	mlx5_modify_header_dealloc(priv->mdev, mod_hdr);
+err_set_registers:
+	dealloc_mod_hdr_actions(&mod_acts);
+	netdev_warn(priv->netdev,
+		    "Failed to offload ct trk_new flow, err %d\n", err);
+	return err;
+}
+
+static void
+tc_ct_cleanup_trk_new_rules(struct mlx5_ct_ft *ft)
+{
+	struct mlx5_tc_ct_priv *ct_priv = ft->ct_priv;
+	struct mlx5_tc_ct_trk_new_rule *rule, *tmp;
+	struct mlx5e_priv *priv;
+
+	list_for_each_entry_safe(rule, tmp, &ft->trk_new_rules.rules, list)
+		tc_ct_del_trk_new_rule(rule);
+
+	priv = netdev_priv(ct_priv->netdev);
+	mlx5_modify_header_dealloc(priv->mdev, ft->trk_new_rules.modify_hdr);
+}
+
 static struct mlx5_ct_ft *
 mlx5_tc_ct_add_ft_cb(struct mlx5_tc_ct_priv *ct_priv, u16 zone,
 		     struct nf_flowtable *nf_ft)
 {
+	struct nf_conntrack_zone ctzone;
 	struct mlx5_ct_ft *ft;
 	int err;
 
@@ -1415,11 +1581,16 @@ mlx5_tc_ct_add_ft_cb(struct mlx5_tc_ct_priv *ct_priv, u16 zone,
 	ft->nf_ft = nf_ft;
 	ft->ct_priv = ct_priv;
 	refcount_set(&ft->refcount, 1);
+	INIT_LIST_HEAD(&ft->trk_new_rules.rules);
 
 	err = mlx5_tc_ct_alloc_pre_ct_tables(ft);
 	if (err)
 		goto err_alloc_pre_ct;
 
+	err = tc_ct_init_trk_new_rules(ft);
+	if (err)
+		goto err_add_trk_new_rules;
+
 	err = rhashtable_init(&ft->ct_entries_ht, &cts_ht_params);
 	if (err)
 		goto err_init;
@@ -1429,6 +1600,14 @@ mlx5_tc_ct_add_ft_cb(struct mlx5_tc_ct_priv *ct_priv, u16 zone,
 	if (err)
 		goto err_insert;
 
+	nf_ct_zone_init(&ctzone, zone, NF_CT_DEFAULT_ZONE_DIR, 0);
+	ft->tmpl = nf_ct_tmpl_alloc(&init_net, &ctzone, GFP_KERNEL);
+	if (!ft->tmpl)
+		goto err_tmpl;
+
+	__set_bit(IPS_CONFIRMED_BIT, &ft->tmpl->status);
+	nf_conntrack_get(&ft->tmpl->ct_general);
+
 	err = nf_flow_table_offload_add_cb(ft->nf_ft,
 					   mlx5_tc_ct_block_flow_offload, ft);
 	if (err)
@@ -1437,10 +1616,14 @@ mlx5_tc_ct_add_ft_cb(struct mlx5_tc_ct_priv *ct_priv, u16 zone,
 	return ft;
 
 err_add_cb:
+	nf_conntrack_put(&ft->tmpl->ct_general);
+err_tmpl:
 	rhashtable_remove_fast(&ct_priv->zone_ht, &ft->node, zone_params);
 err_insert:
 	rhashtable_destroy(&ft->ct_entries_ht);
 err_init:
+	tc_ct_cleanup_trk_new_rules(ft);
+err_add_trk_new_rules:
 	mlx5_tc_ct_free_pre_ct_tables(ft);
 err_alloc_pre_ct:
 	mapping_remove(ct_priv->zone_mapping, ft->zone_restore_id);
@@ -1471,6 +1654,8 @@ mlx5_tc_ct_del_ft_cb(struct mlx5_tc_ct_priv *ct_priv, struct mlx5_ct_ft *ft)
 	rhashtable_free_and_destroy(&ft->ct_entries_ht,
 				    mlx5_tc_ct_flush_ft_entry,
 				    ct_priv);
+	nf_conntrack_put(&ft->tmpl->ct_general);
+	tc_ct_cleanup_trk_new_rules(ft);
 	mlx5_tc_ct_free_pre_ct_tables(ft);
 	mapping_remove(ct_priv->zone_mapping, ft->zone_restore_id);
 	kfree(ft);
@@ -2100,6 +2285,27 @@ mlx5_tc_ct_clean(struct mlx5_tc_ct_priv *ct_priv)
 	kfree(ct_priv);
 }
 
+static bool
+mlx5e_tc_ct_restore_trk_new(struct mlx5_tc_ct_priv *ct_priv,
+			    struct sk_buff *skb,
+			    struct mlx5_ct_tuple *tuple,
+			    u16 zone)
+{
+	struct mlx5_ct_ft *ft;
+
+	if ((ntohs(tuple->port.dst) != IANA_VXLAN_UDP_PORT) &&
+	    (ntohs(tuple->port.dst) != ROCE_V2_UDP_DPORT))
+		return false;
+
+	ft = rhashtable_lookup_fast(&ct_priv->zone_ht, &zone, zone_params);
+	if (!ft)
+		return false;
+
+	nf_conntrack_get(&ft->tmpl->ct_general);
+	nf_ct_set(skb, ft->tmpl, IP_CT_NEW);
+	return true;
+}
+
 bool
 mlx5e_tc_ct_restore_flow(struct mlx5_tc_ct_priv *ct_priv,
 			 struct sk_buff *skb, u8 zone_restore_id)
@@ -2123,7 +2329,7 @@ mlx5e_tc_ct_restore_flow(struct mlx5_tc_ct_priv *ct_priv,
 		entry = rhashtable_lookup_fast(&ct_priv->ct_tuples_nat_ht,
 					       &tuple, tuples_nat_ht_params);
 	if (!entry)
-		return false;
+		return mlx5e_tc_ct_restore_trk_new(ct_priv, skb, &tuple, zone);
 
 	tcf_ct_flow_table_restore_skb(skb, entry->restore_cookie);
 	return true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h
index 6503b614337c..f730dbfbb02c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h
@@ -10,6 +10,11 @@
 
 #include "en.h"
 
+#define MLX5_CT_STATE_ESTABLISHED_BIT BIT(1)
+#define MLX5_CT_STATE_TRK_BIT BIT(2)
+#define MLX5_CT_STATE_NAT_BIT BIT(3)
+#define MLX5_CT_STATE_NEW_BIT BIT(4)
+
 struct mlx5_flow_attr;
 struct mlx5e_tc_mod_hdr_acts;
 struct mlx5_rep_uplink_priv;
@@ -28,6 +33,7 @@ struct mlx5_ct_attr {
 	struct mlx5_ct_flow *ct_flow;
 	struct nf_flowtable *nf_ft;
 	u32 ct_labels_id;
+	u32 ct_state;
 };
 
 #define zone_to_reg_ct {\
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 56aa39ac1a1c..5cf7c221404b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -3255,11 +3255,11 @@ static bool actions_match_supported(struct mlx5e_priv *priv,
 				    struct mlx5e_tc_flow *flow,
 				    struct netlink_ext_ack *extack)
 {
-	bool ct_flow = false, ct_clear = false;
+	bool ct_flow = false, ct_clear = false, ct_new = false;
 	u32 actions;
 
-	ct_clear = flow->attr->ct_attr.ct_action &
-		TCA_CT_ACT_CLEAR;
+	ct_clear = flow->attr->ct_attr.ct_action & TCA_CT_ACT_CLEAR;
+	ct_new = flow->attr->ct_attr.ct_state & MLX5_CT_STATE_NEW_BIT;
 	ct_flow = flow_flag_test(flow, CT) && !ct_clear;
 	actions = flow->attr->action;
 
@@ -3274,6 +3274,16 @@ static bool actions_match_supported(struct mlx5e_priv *priv,
 		}
 	}
 
+	if (ct_new && ct_flow) {
+		NL_SET_ERR_MSG_MOD(extack, "Can't offload ct_state new with action ct");
+		return false;
+	}
+
+	if (ct_new && flow->attr->dest_chain) {
+		NL_SET_ERR_MSG_MOD(extack, "Can't offload ct_state new with action goto");
+		return false;
+	}
+
 	if (actions & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
 		return modify_header_match_supported(priv, &parse_attr->spec,
 						     flow_action, actions,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 10/15] net/mlx5e: CT: Add support for mirroring
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (8 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 09/15] net/mlx5e: CT: Support offload of " Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 11/15] net/mlx5e: CT, Avoid false lock depenency warning Saeed Mahameed
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Paul Blakey, Maor Dickman, Roi Dayan, Saeed Mahameed

From: Paul Blakey <paulb@mellanox.com>

Add support for mirroring before the CT action by splitting the pre ct rule.
Mirror outputs are done first on the tc chain,prio table rule (the fwd
rule), which will then forward to a per port fwd table.
On this fwd table, we insert the original pre ct rule that forwards to
ct/ct nat table.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/tc_ct.c    |  4 +++
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 25 ++++++++++---------
 2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index b0c357f755d4..9a189c06ab56 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -1825,6 +1825,10 @@ __mlx5_tc_ct_flow_offload(struct mlx5_tc_ct_priv *ct_priv,
 	ct_flow->post_ct_attr->prio = 0;
 	ct_flow->post_ct_attr->ft = ct_priv->post_ct;
 
+	/* Splits were handled before CT */
+	if (ct_priv->ns_type == MLX5_FLOW_NAMESPACE_FDB)
+		ct_flow->post_ct_attr->esw_attr->split_count = 0;
+
 	ct_flow->post_ct_attr->inner_match_level = MLX5_MATCH_NONE;
 	ct_flow->post_ct_attr->outer_match_level = MLX5_MATCH_NONE;
 	ct_flow->post_ct_attr->action &= ~(MLX5_FLOW_CONTEXT_ACTION_DECAP);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 5cf7c221404b..89bb464850a1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1165,19 +1165,23 @@ mlx5e_tc_offload_fdb_rules(struct mlx5_eswitch *esw,
 	if (flow_flag_test(flow, CT)) {
 		mod_hdr_acts = &attr->parse_attr->mod_hdr_acts;
 
-		return mlx5_tc_ct_flow_offload(get_ct_priv(flow->priv),
+		rule = mlx5_tc_ct_flow_offload(get_ct_priv(flow->priv),
 					       flow, spec, attr,
 					       mod_hdr_acts);
+	} else {
+		rule = mlx5_eswitch_add_offloaded_rule(esw, spec, attr);
 	}
 
-	rule = mlx5_eswitch_add_offloaded_rule(esw, spec, attr);
 	if (IS_ERR(rule))
 		return rule;
 
 	if (attr->esw_attr->split_count) {
 		flow->rule[1] = mlx5_eswitch_add_fwd_rule(esw, spec, attr);
 		if (IS_ERR(flow->rule[1])) {
-			mlx5_eswitch_del_offloaded_rule(esw, rule, attr);
+			if (flow_flag_test(flow, CT))
+				mlx5_tc_ct_delete_flow(get_ct_priv(flow->priv), flow, attr);
+			else
+				mlx5_eswitch_del_offloaded_rule(esw, rule, attr);
 			return flow->rule[1];
 		}
 	}
@@ -1192,14 +1196,14 @@ mlx5e_tc_unoffload_fdb_rules(struct mlx5_eswitch *esw,
 {
 	flow_flag_clear(flow, OFFLOADED);
 
+	if (attr->esw_attr->split_count)
+		mlx5_eswitch_del_fwd_rule(esw, flow->rule[1], attr);
+
 	if (flow_flag_test(flow, CT)) {
 		mlx5_tc_ct_delete_flow(get_ct_priv(flow->priv), flow, attr);
 		return;
 	}
 
-	if (attr->esw_attr->split_count)
-		mlx5_eswitch_del_fwd_rule(esw, flow->rule[1], attr);
-
 	mlx5_eswitch_del_offloaded_rule(esw, flow->rule[0], attr);
 }
 
@@ -3264,7 +3268,8 @@ static bool actions_match_supported(struct mlx5e_priv *priv,
 	actions = flow->attr->action;
 
 	if (mlx5e_is_eswitch_flow(flow)) {
-		if (flow->attr->esw_attr->split_count && ct_flow) {
+		if (flow->attr->esw_attr->split_count && ct_flow &&
+		    !MLX5_CAP_GEN(flow->attr->esw_attr->in_mdev, reg_c_preserve)) {
 			/* All registers used by ct are cleared when using
 			 * split rules.
 			 */
@@ -4373,6 +4378,7 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv,
 				return err;
 
 			flow_flag_set(flow, CT);
+			esw_attr->split_count = esw_attr->out_count;
 			break;
 		default:
 			NL_SET_ERR_MSG_MOD(extack, "The offload action is not supported");
@@ -4432,11 +4438,6 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv,
 			return -EOPNOTSUPP;
 		}
 
-		if (attr->action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) {
-			NL_SET_ERR_MSG_MOD(extack,
-					   "Mirroring goto chain rules isn't supported");
-			return -EOPNOTSUPP;
-		}
 		attr->action |= MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
 	}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 11/15] net/mlx5e: CT, Avoid false lock depenency warning
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (9 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 10/15] net/mlx5e: CT: Add support for mirroring Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 12/15] net/mlx5e: IPsec, Enclose csum logic under ipsec config Saeed Mahameed
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Roi Dayan, Paul Blakey, Saeed Mahameed

From: Roi Dayan <roid@nvidia.com>

To avoid false lock dependency warning set the ct_entries_ht lock
class different than the lock class of the ht being used when deleting
last flow from a group and then deleting a group, we get into del_sw_flow_group()
which call rhashtable_destroy on fg->ftes_hash which will take ht->mutex but
it's different than the ht->mutex here.

======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc2+ #8 Tainted: G           O
------------------------------------------------------
revalidator23/24009 is trying to acquire lock:
ffff888128d83828 (&node->lock){++++}-{3:3}, at: mlx5_del_flow_rules+0x83/0x7a0 [mlx5_core]

but task is already holding lock:
ffff8881081ef518 (&ht->mutex){+.+.}-{3:3}, at: rhashtable_free_and_destroy+0x37/0x720

which lock already depends on the new lock.

Fixes: 9808dd0a2aee ("net/mlx5e: CT: Use rhashtable's ct entries instead of a separate list")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index 9a189c06ab56..510eab3204d4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -1555,6 +1555,14 @@ tc_ct_cleanup_trk_new_rules(struct mlx5_ct_ft *ft)
 	mlx5_modify_header_dealloc(priv->mdev, ft->trk_new_rules.modify_hdr);
 }
 
+/* To avoid false lock dependency warning set the ct_entries_ht lock
+ * class different than the lock class of the ht being used when deleting
+ * last flow from a group and then deleting a group, we get into del_sw_flow_group()
+ * which call rhashtable_destroy on fg->ftes_hash which will take ht->mutex but
+ * it's different than the ht->mutex here.
+ */
+static struct lock_class_key ct_entries_ht_lock_key;
+
 static struct mlx5_ct_ft *
 mlx5_tc_ct_add_ft_cb(struct mlx5_tc_ct_priv *ct_priv, u16 zone,
 		     struct nf_flowtable *nf_ft)
@@ -1595,6 +1603,8 @@ mlx5_tc_ct_add_ft_cb(struct mlx5_tc_ct_priv *ct_priv, u16 zone,
 	if (err)
 		goto err_init;
 
+	lockdep_set_class(&ft->ct_entries_ht.mutex, &ct_entries_ht_lock_key);
+
 	err = rhashtable_insert_fast(&ct_priv->zone_ht, &ft->node,
 				     zone_params);
 	if (err)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 12/15] net/mlx5e: IPsec, Enclose csum logic under ipsec config
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (10 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 11/15] net/mlx5e: CT, Avoid false lock depenency warning Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 13/15] net/mlx5e: IPsec, Avoid unreachable return Saeed Mahameed
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Tariq Toukan, Raed Salem, Huy Nguyen, Saeed Mahameed

From: Tariq Toukan <tariqt@nvidia.com>

All IPsec logic should be wrapped under the compile flag,
including its checksum logic.
Introduce an inline function in ipsec datapath header,
with a corresponding stub.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h  | 10 ++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c        |  3 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
index 9df9b9a8e09b..fabf4b6b2b84 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
@@ -87,6 +87,11 @@ static inline bool mlx5e_ipsec_is_tx_flow(struct mlx5e_accel_tx_ipsec_state *ips
 	return ipsec_st->x;
 }
 
+static inline bool mlx5e_ipsec_eseg_meta(struct mlx5_wqe_eth_seg *eseg)
+{
+	return eseg->flow_table_metadata & cpu_to_be32(MLX5_ETH_WQE_FT_META_IPSEC);
+}
+
 void mlx5e_ipsec_tx_build_eseg(struct mlx5e_priv *priv, struct sk_buff *skb,
 			       struct mlx5_wqe_eth_seg *eseg);
 #else
@@ -96,6 +101,11 @@ void mlx5e_ipsec_offload_handle_rx_skb(struct net_device *netdev,
 				       struct mlx5_cqe64 *cqe)
 {}
 
+static inline bool mlx5e_ipsec_eseg_meta(struct mlx5_wqe_eth_seg *eseg)
+{
+	return false;
+}
+
 static inline bool mlx5_ipsec_is_rx_flow(struct mlx5_cqe64 *cqe) { return false; }
 #endif /* CONFIG_MLX5_EN_IPSEC */
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index e47e2a0059d0..22be21d5cdad 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -241,9 +241,8 @@ mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
 		sq->stats->csum_partial++;
 #endif
-	} else if (unlikely(eseg->flow_table_metadata & cpu_to_be32(MLX5_ETH_WQE_FT_META_IPSEC))) {
+	} else if (unlikely(mlx5e_ipsec_eseg_meta(eseg))) {
 		ipsec_txwqe_build_eseg_csum(sq, skb, eseg);
-
 	} else
 		sq->stats->csum_none++;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 13/15] net/mlx5e: IPsec, Avoid unreachable return
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (11 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 12/15] net/mlx5e: IPsec, Enclose csum logic under ipsec config Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 14/15] net/mlx5e: IPsec, Inline feature_check fast-path function Saeed Mahameed
  2021-01-08  5:30 ` [net-next 15/15] net/mlx5e: IPsec, Remove unnecessary config flag usage Saeed Mahameed
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Tariq Toukan, Huy Nguyen, Saeed Mahameed

From: Tariq Toukan <tariqt@nvidia.com>

Simple code improvement, move default return operation under
the #else block.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
index 899b98aca0d3..fb89b24deb2b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -142,9 +142,9 @@ static inline bool mlx5e_accel_tx_is_ipsec_flow(struct mlx5e_accel_tx_state *sta
 {
 #ifdef CONFIG_MLX5_EN_IPSEC
 	return mlx5e_ipsec_is_tx_flow(&state->ipsec);
-#endif
-
+#else
 	return false;
+#endif
 }
 
 static inline unsigned int mlx5e_accel_tx_ids_len(struct mlx5e_txqsq *sq,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 14/15] net/mlx5e: IPsec, Inline feature_check fast-path function
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (12 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 13/15] net/mlx5e: IPsec, Avoid unreachable return Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  2021-01-08  5:30 ` [net-next 15/15] net/mlx5e: IPsec, Remove unnecessary config flag usage Saeed Mahameed
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Tariq Toukan, Raed Salem, Huy Nguyen, Saeed Mahameed

From: Tariq Toukan <tariqt@nvidia.com>

Feature check functions are in the TX fast-path of all SKBs, not only
IPsec traffic.
Move the IPsec feature check function into a header and turn it inline.
Use a stub and clean the config flag condition in Eth main driver file.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c  | 14 --------------
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.h  | 19 +++++++++++++++++--
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 --
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
index a9b45606dbdb..a97e8d205094 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
@@ -497,20 +497,6 @@ void mlx5e_ipsec_offload_handle_rx_skb(struct net_device *netdev,
 	}
 }
 
-bool mlx5e_ipsec_feature_check(struct sk_buff *skb, struct net_device *netdev,
-			       netdev_features_t features)
-{
-	struct sec_path *sp = skb_sec_path(skb);
-	struct xfrm_state *x;
-
-	if (sp && sp->len) {
-		x = sp->xvec[0];
-		if (x && x->xso.offload_handle)
-			return true;
-	}
-	return false;
-}
-
 void mlx5e_ipsec_build_inverse_table(void)
 {
 	u16 mss_inv;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
index fabf4b6b2b84..3e80742a3caf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
@@ -57,8 +57,6 @@ struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev,
 					  struct sk_buff *skb, u32 *cqe_bcnt);
 
 void mlx5e_ipsec_inverse_table_init(void);
-bool mlx5e_ipsec_feature_check(struct sk_buff *skb, struct net_device *netdev,
-			       netdev_features_t features);
 void mlx5e_ipsec_set_iv_esn(struct sk_buff *skb, struct xfrm_state *x,
 			    struct xfrm_offload *xo);
 void mlx5e_ipsec_set_iv(struct sk_buff *skb, struct xfrm_state *x,
@@ -94,6 +92,21 @@ static inline bool mlx5e_ipsec_eseg_meta(struct mlx5_wqe_eth_seg *eseg)
 
 void mlx5e_ipsec_tx_build_eseg(struct mlx5e_priv *priv, struct sk_buff *skb,
 			       struct mlx5_wqe_eth_seg *eseg);
+
+static inline bool mlx5e_ipsec_feature_check(struct sk_buff *skb, struct net_device *netdev,
+					     netdev_features_t features)
+{
+	struct sec_path *sp = skb_sec_path(skb);
+
+	if (sp && sp->len) {
+		struct xfrm_state *x = sp->xvec[0];
+
+		if (x && x->xso.offload_handle)
+			return true;
+	}
+	return false;
+}
+
 #else
 static inline
 void mlx5e_ipsec_offload_handle_rx_skb(struct net_device *netdev,
@@ -107,6 +120,8 @@ static inline bool mlx5e_ipsec_eseg_meta(struct mlx5_wqe_eth_seg *eseg)
 }
 
 static inline bool mlx5_ipsec_is_rx_flow(struct mlx5_cqe64 *cqe) { return false; }
+static inline bool mlx5e_ipsec_feature_check(struct sk_buff *skb, struct net_device *netdev,
+					     netdev_features_t features) { return false; }
 #endif /* CONFIG_MLX5_EN_IPSEC */
 
 #endif /* __MLX5E_IPSEC_RXTX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f27f509ab028..c00eef14ee6c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4375,10 +4375,8 @@ netdev_features_t mlx5e_features_check(struct sk_buff *skb,
 	features = vlan_features_check(skb, features);
 	features = vxlan_features_check(skb, features);
 
-#ifdef CONFIG_MLX5_EN_IPSEC
 	if (mlx5e_ipsec_feature_check(skb, netdev, features))
 		return features;
-#endif
 
 	/* Validate if the tunneled packet is being offloaded by HW */
 	if (skb->encapsulation &&
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 15/15] net/mlx5e: IPsec, Remove unnecessary config flag usage
  2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
                   ` (13 preceding siblings ...)
  2021-01-08  5:30 ` [net-next 14/15] net/mlx5e: IPsec, Inline feature_check fast-path function Saeed Mahameed
@ 2021-01-08  5:30 ` Saeed Mahameed
  14 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2021-01-08  5:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Tariq Toukan, Raed Salem, Huy Nguyen, Saeed Mahameed

From: Tariq Toukan <tariqt@nvidia.com>

MLX5_IPSEC_DEV() is always defined, no need to protect it under config
flag CONFIG_MLX5_EN_IPSEC, especially in slow path.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 --
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 2 --
 2 files changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c00eef14ee6c..5309bc9f3197 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2068,10 +2068,8 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 	u32 buf_size = 0;
 	int i;
 
-#ifdef CONFIG_MLX5_EN_IPSEC
 	if (MLX5_IPSEC_DEV(mdev))
 		byte_count += MLX5E_METADATA_ETHER_LEN;
-#endif
 
 	if (mlx5e_rx_is_linear_skb(params, xsk)) {
 		int frag_stride;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 7f5851c61218..cb8e3d2b4750 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1786,12 +1786,10 @@ int mlx5e_rq_set_handlers(struct mlx5e_rq *rq, struct mlx5e_params *params, bool
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
 		rq->handle_rx_cqe = priv->profile->rx_handlers->handle_rx_cqe_mpwqe;
-#ifdef CONFIG_MLX5_EN_IPSEC
 		if (MLX5_IPSEC_DEV(mdev)) {
 			netdev_err(netdev, "MPWQE RQ with IPSec offload not supported\n");
 			return -EINVAL;
 		}
-#endif
 		if (!rq->handle_rx_cqe) {
 			netdev_err(netdev, "RX handler of MPWQE RQ is not set\n");
 			return -EINVAL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-08  5:30 ` [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules Saeed Mahameed
@ 2021-01-08 21:48   ` Marcelo Ricardo Leitner
  2021-01-10  7:45     ` Roi Dayan
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Ricardo Leitner @ 2021-01-08 21:48 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, netdev, Roi Dayan, Paul Blakey,
	Saeed Mahameed

Hi,

On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
> From: Roi Dayan <roid@nvidia.com>
> 
> Connection tracking associates the connection state per packet. The
> first packet of a connection is assigned with the +trk+new state. The
> connection enters the established state once a packet is seen on the
> other direction.
> 
> Currently we offload only the established flows. However, UDP traffic
> using source port entropy (e.g. vxlan, RoCE) will never enter the
> established state. Such protocols do not require stateful processing,
> and therefore could be offloaded.

If it doesn't require stateful processing, please enlight me on why
conntrack is being used in the first place. What's the use case here?

> 
> The change in the model is that a miss on the CT table will be forwarded
> to a new +trk+new ct table and a miss there will be forwarded to the slow
> path table.

AFAICU this new +trk+new ct table is a wildcard match on sport with
specific dports. Also AFAICU, such entries will not be visible to the
userspace then. Is this right?

  Marcelo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 09/15] net/mlx5e: CT: Support offload of +trk+new ct rules
  2021-01-08  5:30 ` [net-next 09/15] net/mlx5e: CT: Support offload of " Saeed Mahameed
@ 2021-01-08 21:59   ` Marcelo Ricardo Leitner
  2021-01-10  7:55     ` Roi Dayan
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Ricardo Leitner @ 2021-01-08 21:59 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, netdev, Roi Dayan, Paul Blakey,
	Saeed Mahameed

Hi,

On Thu, Jan 07, 2021 at 09:30:48PM -0800, Saeed Mahameed wrote:
> @@ -1429,6 +1600,14 @@ mlx5_tc_ct_add_ft_cb(struct mlx5_tc_ct_priv *ct_priv, u16 zone,
>  	if (err)
>  		goto err_insert;
>  
> +	nf_ct_zone_init(&ctzone, zone, NF_CT_DEFAULT_ZONE_DIR, 0);
> +	ft->tmpl = nf_ct_tmpl_alloc(&init_net, &ctzone, GFP_KERNEL);

I didn't test but I think this will add a hard dependency to
nf_conntrack_core and will cause conntrack to always be loaded by
mlx5_core, which is not good for some use cases.
nf_ct_tmpl_alloc() is defined in nf_conntrack_core.c.

762f926d6f19 ("net/sched: act_ct: Make tcf_ct_flow_table_restore_skb
inline") was done similarly to avoid this.

> +	if (!ft->tmpl)
> +		goto err_tmpl;
> +
> +	__set_bit(IPS_CONFIRMED_BIT, &ft->tmpl->status);
> +	nf_conntrack_get(&ft->tmpl->ct_general);
> +
>  	err = nf_flow_table_offload_add_cb(ft->nf_ft,
>  					   mlx5_tc_ct_block_flow_offload, ft);
>  	if (err)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-08 21:48   ` Marcelo Ricardo Leitner
@ 2021-01-10  7:45     ` Roi Dayan
  2021-01-10  7:52       ` Roi Dayan
  0 siblings, 1 reply; 30+ messages in thread
From: Roi Dayan @ 2021-01-10  7:45 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, netdev, Paul Blakey,
	Saeed Mahameed, Oz Shlomo



On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
> Hi,
> 
> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>> From: Roi Dayan <roid@nvidia.com>
>>
>> Connection tracking associates the connection state per packet. The
>> first packet of a connection is assigned with the +trk+new state. The
>> connection enters the established state once a packet is seen on the
>> other direction.
>>
>> Currently we offload only the established flows. However, UDP traffic
>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>> established state. Such protocols do not require stateful processing,
>> and therefore could be offloaded.
> 
> If it doesn't require stateful processing, please enlight me on why
> conntrack is being used in the first place. What's the use case here?
> 

The use case for example is when we have vxlan traffic but we do
conntrack on the inner packet (rules on the physical port) so
we never get established but on miss we can still offload as normal
vxlan traffic.

>>
>> The change in the model is that a miss on the CT table will be forwarded
>> to a new +trk+new ct table and a miss there will be forwarded to the slow
>> path table.
> 
> AFAICU this new +trk+new ct table is a wildcard match on sport with
> specific dports. Also AFAICU, such entries will not be visible to the
> userspace then. Is this right?
> 
>    Marcelo
> 

right.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-10  7:45     ` Roi Dayan
@ 2021-01-10  7:52       ` Roi Dayan
  2021-01-11 23:51         ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 30+ messages in thread
From: Roi Dayan @ 2021-01-10  7:52 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, netdev, Paul Blakey,
	Saeed Mahameed, Oz Shlomo



On 2021-01-10 9:45 AM, Roi Dayan wrote:
> 
> 
> On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
>> Hi,
>>
>> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>>> From: Roi Dayan <roid@nvidia.com>
>>>
>>> Connection tracking associates the connection state per packet. The
>>> first packet of a connection is assigned with the +trk+new state. The
>>> connection enters the established state once a packet is seen on the
>>> other direction.
>>>
>>> Currently we offload only the established flows. However, UDP traffic
>>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>>> established state. Such protocols do not require stateful processing,
>>> and therefore could be offloaded.
>>
>> If it doesn't require stateful processing, please enlight me on why
>> conntrack is being used in the first place. What's the use case here?
>>
> 
> The use case for example is when we have vxlan traffic but we do
> conntrack on the inner packet (rules on the physical port) so
> we never get established but on miss we can still offload as normal
> vxlan traffic.
> 

my mistake about "inner packet". we do CT on the underlay network, i.e.
the outer header.

>>>
>>> The change in the model is that a miss on the CT table will be forwarded
>>> to a new +trk+new ct table and a miss there will be forwarded to the 
>>> slow
>>> path table.
>>
>> AFAICU this new +trk+new ct table is a wildcard match on sport with
>> specific dports. Also AFAICU, such entries will not be visible to the
>> userspace then. Is this right?
>>
>>    Marcelo
>>
> 
> right.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 09/15] net/mlx5e: CT: Support offload of +trk+new ct rules
  2021-01-08 21:59   ` Marcelo Ricardo Leitner
@ 2021-01-10  7:55     ` Roi Dayan
  0 siblings, 0 replies; 30+ messages in thread
From: Roi Dayan @ 2021-01-10  7:55 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, netdev, Paul Blakey,
	Saeed Mahameed, Oz Shlomo



On 2021-01-08 11:59 PM, Marcelo Ricardo Leitner wrote:
> Hi,
> 
> On Thu, Jan 07, 2021 at 09:30:48PM -0800, Saeed Mahameed wrote:
>> @@ -1429,6 +1600,14 @@ mlx5_tc_ct_add_ft_cb(struct mlx5_tc_ct_priv *ct_priv, u16 zone,
>>   	if (err)
>>   		goto err_insert;
>>   
>> +	nf_ct_zone_init(&ctzone, zone, NF_CT_DEFAULT_ZONE_DIR, 0);
>> +	ft->tmpl = nf_ct_tmpl_alloc(&init_net, &ctzone, GFP_KERNEL);
> 
> I didn't test but I think this will add a hard dependency to
> nf_conntrack_core and will cause conntrack to always be loaded by
> mlx5_core, which is not good for some use cases.
> nf_ct_tmpl_alloc() is defined in nf_conntrack_core.c.
> 
> 762f926d6f19 ("net/sched: act_ct: Make tcf_ct_flow_table_restore_skb
> inline") was done similarly to avoid this.
> 

right. we will take a look what we can do with this.
thanks

>> +	if (!ft->tmpl)
>> +		goto err_tmpl;
>> +
>> +	__set_bit(IPS_CONFIRMED_BIT, &ft->tmpl->status);
>> +	nf_conntrack_get(&ft->tmpl->ct_general);
>> +
>>   	err = nf_flow_table_offload_add_cb(ft->nf_ft,
>>   					   mlx5_tc_ct_block_flow_offload, ft);
>>   	if (err)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-10  7:52       ` Roi Dayan
@ 2021-01-11 23:51         ` Marcelo Ricardo Leitner
  2021-01-12  9:27           ` Oz Shlomo
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Ricardo Leitner @ 2021-01-11 23:51 UTC (permalink / raw)
  To: Roi Dayan
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, netdev,
	Paul Blakey, Saeed Mahameed, Oz Shlomo

On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
> 
> 
> On 2021-01-10 9:45 AM, Roi Dayan wrote:
> > 
> > 
> > On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
> > > Hi,
> > > 
> > > On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
> > > > From: Roi Dayan <roid@nvidia.com>
> > > > 
> > > > Connection tracking associates the connection state per packet. The
> > > > first packet of a connection is assigned with the +trk+new state. The
> > > > connection enters the established state once a packet is seen on the
> > > > other direction.
> > > > 
> > > > Currently we offload only the established flows. However, UDP traffic
> > > > using source port entropy (e.g. vxlan, RoCE) will never enter the
> > > > established state. Such protocols do not require stateful processing,
> > > > and therefore could be offloaded.
> > > 
> > > If it doesn't require stateful processing, please enlight me on why
> > > conntrack is being used in the first place. What's the use case here?
> > > 
> > 
> > The use case for example is when we have vxlan traffic but we do
> > conntrack on the inner packet (rules on the physical port) so
> > we never get established but on miss we can still offload as normal
> > vxlan traffic.
> > 
> 
> my mistake about "inner packet". we do CT on the underlay network, i.e.
> the outer header.

I miss why the CT match is being used there then. Isn't it a config
issue/waste of resources? What is CT adding to the matches/actions
being done on these flows?

> 
> > > > 
> > > > The change in the model is that a miss on the CT table will be forwarded
> > > > to a new +trk+new ct table and a miss there will be forwarded to
> > > > the slow
> > > > path table.
> > > 
> > > AFAICU this new +trk+new ct table is a wildcard match on sport with
> > > specific dports. Also AFAICU, such entries will not be visible to the
> > > userspace then. Is this right?
> > > 
> > >    Marcelo
> > > 
> > 
> > right.

Thanks,
Marcelo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-11 23:51         ` Marcelo Ricardo Leitner
@ 2021-01-12  9:27           ` Oz Shlomo
  2021-01-14 13:02             ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 30+ messages in thread
From: Oz Shlomo @ 2021-01-12  9:27 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Roi Dayan
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, netdev,
	Paul Blakey, Saeed Mahameed



On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
> On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
>>
>>
>> On 2021-01-10 9:45 AM, Roi Dayan wrote:
>>>
>>>
>>> On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
>>>> Hi,
>>>>
>>>> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>>>>> From: Roi Dayan <roid@nvidia.com>
>>>>>
>>>>> Connection tracking associates the connection state per packet. The
>>>>> first packet of a connection is assigned with the +trk+new state. The
>>>>> connection enters the established state once a packet is seen on the
>>>>> other direction.
>>>>>
>>>>> Currently we offload only the established flows. However, UDP traffic
>>>>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>>>>> established state. Such protocols do not require stateful processing,
>>>>> and therefore could be offloaded.
>>>>
>>>> If it doesn't require stateful processing, please enlight me on why
>>>> conntrack is being used in the first place. What's the use case here?
>>>>
>>>
>>> The use case for example is when we have vxlan traffic but we do
>>> conntrack on the inner packet (rules on the physical port) so
>>> we never get established but on miss we can still offload as normal
>>> vxlan traffic.
>>>
>>
>> my mistake about "inner packet". we do CT on the underlay network, i.e.
>> the outer header.
> 
> I miss why the CT match is being used there then. Isn't it a config
> issue/waste of resources? What is CT adding to the matches/actions
> being done on these flows?
> 

Consider a use case where the network port receives both east-west encapsulated traffic and 
north-south non-encapsulated traffic that requires NAT.

One possible configuration is to first apply the CT-NAT action.
Established north-south connections will successfully execute the nat action and will set the +est 
ct state.
However, the +new state may apply either for valid east-west traffic (e.g. vxlan) due to source port 
entropy, or to insecure north-south traffic that the fw should block. The user may distinguish 
between the two cases, for example, by matching on the dest udp port.


>>
>>>>>
>>>>> The change in the model is that a miss on the CT table will be forwarded
>>>>> to a new +trk+new ct table and a miss there will be forwarded to
>>>>> the slow
>>>>> path table.
>>>>
>>>> AFAICU this new +trk+new ct table is a wildcard match on sport with
>>>> specific dports. Also AFAICU, such entries will not be visible to the
>>>> userspace then. Is this right?
>>>>
>>>>     Marcelo
>>>>
>>>
>>> right.
> 
> Thanks,
> Marcelo
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-12  9:27           ` Oz Shlomo
@ 2021-01-14 13:02             ` Marcelo Ricardo Leitner
  2021-01-14 14:03               ` Oz Shlomo
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Ricardo Leitner @ 2021-01-14 13:02 UTC (permalink / raw)
  To: Oz Shlomo
  Cc: Roi Dayan, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	netdev, Paul Blakey, Saeed Mahameed

On Tue, Jan 12, 2021 at 11:27:04AM +0200, Oz Shlomo wrote:
> 
> 
> On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
> > On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
> > > 
> > > 
> > > On 2021-01-10 9:45 AM, Roi Dayan wrote:
> > > > 
> > > > 
> > > > On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
> > > > > Hi,
> > > > > 
> > > > > On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
> > > > > > From: Roi Dayan <roid@nvidia.com>
> > > > > > 
> > > > > > Connection tracking associates the connection state per packet. The
> > > > > > first packet of a connection is assigned with the +trk+new state. The
> > > > > > connection enters the established state once a packet is seen on the
> > > > > > other direction.
> > > > > > 
> > > > > > Currently we offload only the established flows. However, UDP traffic
> > > > > > using source port entropy (e.g. vxlan, RoCE) will never enter the
> > > > > > established state. Such protocols do not require stateful processing,
> > > > > > and therefore could be offloaded.
> > > > > 
> > > > > If it doesn't require stateful processing, please enlight me on why
> > > > > conntrack is being used in the first place. What's the use case here?
> > > > > 
> > > > 
> > > > The use case for example is when we have vxlan traffic but we do
> > > > conntrack on the inner packet (rules on the physical port) so
> > > > we never get established but on miss we can still offload as normal
> > > > vxlan traffic.
> > > > 
> > > 
> > > my mistake about "inner packet". we do CT on the underlay network, i.e.
> > > the outer header.
> > 
> > I miss why the CT match is being used there then. Isn't it a config
> > issue/waste of resources? What is CT adding to the matches/actions
> > being done on these flows?
> > 
> 
> Consider a use case where the network port receives both east-west
> encapsulated traffic and north-south non-encapsulated traffic that requires
> NAT.
> 
> One possible configuration is to first apply the CT-NAT action.
> Established north-south connections will successfully execute the nat action
> and will set the +est ct state.
> However, the +new state may apply either for valid east-west traffic (e.g.
> vxlan) due to source port entropy, or to insecure north-south traffic that
> the fw should block. The user may distinguish between the two cases, for
> example, by matching on the dest udp port.

Sorry but I still don't see the big picture. :-]

What do you consider as east-west and north-south traffic? My initial
understanding of east-west is traffic between VFs and north-south
would be in and out to the wire. You mentioned that north-south is
insecure, it would match, but then, non-encapsulated?

So it seems you referred to the datacenter. East-west is traffic
between hosts on the same datacenter, and north-south is traffic that
goes out of it. This seems to match.

Assuming it's the latter, then it seems that the idea is to work
around a config simplification that was done by the user.  As
mentioned on the changelog, such protocols do not require stateful
processing, and AFAICU this patch twists conntrack so that the user
can have simplified rules. Why can't the user have specific rules for
the tunnels, and other for dealing with north-south traffic? The fw
would still be able to block unwanted traffic.

My main problems with this is this, that it is making conntrack do
stuff that the user may not be expecting it to do, and that packets
may get matched (maybe even unintentionally) and the system won't have
visibility on them. Maybe I'm just missing something?

> 
> 
> > > 
> > > > > > 
> > > > > > The change in the model is that a miss on the CT table will be forwarded
> > > > > > to a new +trk+new ct table and a miss there will be forwarded to
> > > > > > the slow
> > > > > > path table.
> > > > > 
> > > > > AFAICU this new +trk+new ct table is a wildcard match on sport with
> > > > > specific dports. Also AFAICU, such entries will not be visible to the
> > > > > userspace then. Is this right?
> > > > > 
> > > > >     Marcelo
> > > > > 
> > > > 
> > > > right.
> > 
> > Thanks,
> > Marcelo
> > 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-14 13:02             ` Marcelo Ricardo Leitner
@ 2021-01-14 14:03               ` Oz Shlomo
  2021-01-14 21:50                 ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 30+ messages in thread
From: Oz Shlomo @ 2021-01-14 14:03 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Roi Dayan, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	netdev, Paul Blakey, Saeed Mahameed



On 1/14/2021 3:02 PM, Marcelo Ricardo Leitner wrote:
> On Tue, Jan 12, 2021 at 11:27:04AM +0200, Oz Shlomo wrote:
>>
>>
>> On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
>>> On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
>>>>
>>>>
>>>> On 2021-01-10 9:45 AM, Roi Dayan wrote:
>>>>>
>>>>>
>>>>> On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>>>>>>> From: Roi Dayan <roid@nvidia.com>
>>>>>>>
>>>>>>> Connection tracking associates the connection state per packet. The
>>>>>>> first packet of a connection is assigned with the +trk+new state. The
>>>>>>> connection enters the established state once a packet is seen on the
>>>>>>> other direction.
>>>>>>>
>>>>>>> Currently we offload only the established flows. However, UDP traffic
>>>>>>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>>>>>>> established state. Such protocols do not require stateful processing,
>>>>>>> and therefore could be offloaded.
>>>>>>
>>>>>> If it doesn't require stateful processing, please enlight me on why
>>>>>> conntrack is being used in the first place. What's the use case here?
>>>>>>
>>>>>
>>>>> The use case for example is when we have vxlan traffic but we do
>>>>> conntrack on the inner packet (rules on the physical port) so
>>>>> we never get established but on miss we can still offload as normal
>>>>> vxlan traffic.
>>>>>
>>>>
>>>> my mistake about "inner packet". we do CT on the underlay network, i.e.
>>>> the outer header.
>>>
>>> I miss why the CT match is being used there then. Isn't it a config
>>> issue/waste of resources? What is CT adding to the matches/actions
>>> being done on these flows?
>>>
>>
>> Consider a use case where the network port receives both east-west
>> encapsulated traffic and north-south non-encapsulated traffic that requires
>> NAT.
>>
>> One possible configuration is to first apply the CT-NAT action.
>> Established north-south connections will successfully execute the nat action
>> and will set the +est ct state.
>> However, the +new state may apply either for valid east-west traffic (e.g.
>> vxlan) due to source port entropy, or to insecure north-south traffic that
>> the fw should block. The user may distinguish between the two cases, for
>> example, by matching on the dest udp port.
> 
> Sorry but I still don't see the big picture. :-]
> 
> What do you consider as east-west and north-south traffic? My initial
> understanding of east-west is traffic between VFs and north-south
> would be in and out to the wire. You mentioned that north-south is
> insecure, it would match, but then, non-encapsulated?
> 
> So it seems you referred to the datacenter. East-west is traffic
> between hosts on the same datacenter, and north-south is traffic that
> goes out of it. This seems to match.

Right.

> 
> Assuming it's the latter, then it seems that the idea is to work
> around a config simplification that was done by the user.  As
> mentioned on the changelog, such protocols do not require stateful
> processing, and AFAICU this patch twists conntrack so that the user
> can have simplified rules. Why can't the user have specific rules for
> the tunnels, and other for dealing with north-south traffic? The fw
> would still be able to block unwanted traffic.

We cannot control what the user is doing.
This is a valid tc configuration and would work using tc software datapath.
However, in such configurations vxlan packets would not be processed in hardware because they are 
marked as new connections.

> 
> My main problems with this is this, that it is making conntrack do
> stuff that the user may not be expecting it to do, and that packets
> may get matched (maybe even unintentionally) and the system won't have
> visibility on them. Maybe I'm just missing something?
> 

This is why we restricted this feature to udp protocols that will never enter established state due 
to source port entropy.
Do you see a problematic use case that can arise?

>>
>>
>>>>
>>>>>>>
>>>>>>> The change in the model is that a miss on the CT table will be forwarded
>>>>>>> to a new +trk+new ct table and a miss there will be forwarded to
>>>>>>> the slow
>>>>>>> path table.
>>>>>>
>>>>>> AFAICU this new +trk+new ct table is a wildcard match on sport with
>>>>>> specific dports. Also AFAICU, such entries will not be visible to the
>>>>>> userspace then. Is this right?
>>>>>>
>>>>>>      Marcelo
>>>>>>
>>>>>
>>>>> right.
>>>
>>> Thanks,
>>> Marcelo
>>>
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-14 14:03               ` Oz Shlomo
@ 2021-01-14 21:50                 ` Marcelo Ricardo Leitner
  2021-01-20 16:09                   ` Oz Shlomo
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Ricardo Leitner @ 2021-01-14 21:50 UTC (permalink / raw)
  To: Oz Shlomo
  Cc: Roi Dayan, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	netdev, Paul Blakey, Saeed Mahameed

On Thu, Jan 14, 2021 at 04:03:43PM +0200, Oz Shlomo wrote:
> 
> 
> On 1/14/2021 3:02 PM, Marcelo Ricardo Leitner wrote:
> > On Tue, Jan 12, 2021 at 11:27:04AM +0200, Oz Shlomo wrote:
> > > 
> > > 
> > > On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
> > > > On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
> > > > > 
> > > > > 
> > > > > On 2021-01-10 9:45 AM, Roi Dayan wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
> > > > > > > > From: Roi Dayan <roid@nvidia.com>
> > > > > > > > 
> > > > > > > > Connection tracking associates the connection state per packet. The
> > > > > > > > first packet of a connection is assigned with the +trk+new state. The
> > > > > > > > connection enters the established state once a packet is seen on the
> > > > > > > > other direction.
> > > > > > > > 
> > > > > > > > Currently we offload only the established flows. However, UDP traffic
> > > > > > > > using source port entropy (e.g. vxlan, RoCE) will never enter the
> > > > > > > > established state. Such protocols do not require stateful processing,
> > > > > > > > and therefore could be offloaded.
> > > > > > > 
> > > > > > > If it doesn't require stateful processing, please enlight me on why
> > > > > > > conntrack is being used in the first place. What's the use case here?
> > > > > > > 
> > > > > > 
> > > > > > The use case for example is when we have vxlan traffic but we do
> > > > > > conntrack on the inner packet (rules on the physical port) so
> > > > > > we never get established but on miss we can still offload as normal
> > > > > > vxlan traffic.
> > > > > > 
> > > > > 
> > > > > my mistake about "inner packet". we do CT on the underlay network, i.e.
> > > > > the outer header.
> > > > 
> > > > I miss why the CT match is being used there then. Isn't it a config
> > > > issue/waste of resources? What is CT adding to the matches/actions
> > > > being done on these flows?
> > > > 
> > > 
> > > Consider a use case where the network port receives both east-west
> > > encapsulated traffic and north-south non-encapsulated traffic that requires
> > > NAT.
> > > 
> > > One possible configuration is to first apply the CT-NAT action.
> > > Established north-south connections will successfully execute the nat action
> > > and will set the +est ct state.
> > > However, the +new state may apply either for valid east-west traffic (e.g.
> > > vxlan) due to source port entropy, or to insecure north-south traffic that
> > > the fw should block. The user may distinguish between the two cases, for
> > > example, by matching on the dest udp port.
> > 
> > Sorry but I still don't see the big picture. :-]
> > 
> > What do you consider as east-west and north-south traffic? My initial
> > understanding of east-west is traffic between VFs and north-south
> > would be in and out to the wire. You mentioned that north-south is
> > insecure, it would match, but then, non-encapsulated?
> > 
> > So it seems you referred to the datacenter. East-west is traffic
> > between hosts on the same datacenter, and north-south is traffic that
> > goes out of it. This seems to match.
> 
> Right.
> 
> > 
> > Assuming it's the latter, then it seems that the idea is to work
> > around a config simplification that was done by the user.  As
> > mentioned on the changelog, such protocols do not require stateful
> > processing, and AFAICU this patch twists conntrack so that the user
> > can have simplified rules. Why can't the user have specific rules for
> > the tunnels, and other for dealing with north-south traffic? The fw
> > would still be able to block unwanted traffic.
> 
> We cannot control what the user is doing.

Right, but we can educate and point them towards better configs. With
non-optimal configs it's fair to expect non-optimal effects.

> This is a valid tc configuration and would work using tc software datapath.
> However, in such configurations vxlan packets would not be processed in
> hardware because they are marked as new connections.

Makes sense.

> 
> > 
> > My main problems with this is this, that it is making conntrack do
> > stuff that the user may not be expecting it to do, and that packets
> > may get matched (maybe even unintentionally) and the system won't have
> > visibility on them. Maybe I'm just missing something?
> > 
> 
> This is why we restricted this feature to udp protocols that will never
> enter established state due to source port entropy.
> Do you see a problematic use case that can arise?

For use case, the only one I see is if someone wants to use this
feature for another application/dstport. It's hardcoded to tunnels
ones.

It feels that the problem is not being solved at the right place. It
will work well for hardware processing, while for software it will
work while having a ton of conntrack entries. Different behaviors that
can lead to people wasting time. Like, trying to debug on why srcport
is not getting randomized when offloaded, while in fact they are, it's
just masked.

As this is a fallback (iow, search is done in 2 levels at least), I
wonder what other approaches were considered. I'm thinking two for
now. One is to add a flag to conntrack entries that allow them to be
this generic. Finding the right conntrack entry probably gets harder,
but when the user dumps /proc/net/nf_conntrack, it says something. On
how/when to add this flag, maybe act_ct can do it if dstport matches
something and/or a sysctl specifying a port list.

The other one may sound an overkill, but is to work with conntrack
expectations somehow.

The first one is closer to the current proposal. It basically makes
the port list configurable and move the "do it" decision to outside
the driver, where the admin can have more control. If conntrack itself
can also leverage it and avoid having tons of entries, even better, as
then we have both behaviors in sync.

Thoughts?

> 
> > > 
> > > 
> > > > > 
> > > > > > > > 
> > > > > > > > The change in the model is that a miss on the CT table will be forwarded
> > > > > > > > to a new +trk+new ct table and a miss there will be forwarded to
> > > > > > > > the slow
> > > > > > > > path table.
> > > > > > > 
> > > > > > > AFAICU this new +trk+new ct table is a wildcard match on sport with
> > > > > > > specific dports. Also AFAICU, such entries will not be visible to the
> > > > > > > userspace then. Is this right?
> > > > > > > 
> > > > > > >      Marcelo
> > > > > > > 
> > > > > > 
> > > > > > right.
> > > > 
> > > > Thanks,
> > > > Marcelo
> > > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-14 21:50                 ` Marcelo Ricardo Leitner
@ 2021-01-20 16:09                   ` Oz Shlomo
  2021-01-22  1:18                     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 30+ messages in thread
From: Oz Shlomo @ 2021-01-20 16:09 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Pablo Neira Ayuso
  Cc: Roi Dayan, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	netdev, Paul Blakey, Saeed Mahameed



On 1/14/2021 11:50 PM, Marcelo Ricardo Leitner wrote:
> On Thu, Jan 14, 2021 at 04:03:43PM +0200, Oz Shlomo wrote:
>>
>>
>> On 1/14/2021 3:02 PM, Marcelo Ricardo Leitner wrote:
>>> On Tue, Jan 12, 2021 at 11:27:04AM +0200, Oz Shlomo wrote:
>>>>
>>>>
>>>> On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
>>>>> On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
>>>>>>
>>>>>>
>>>>>> On 2021-01-10 9:45 AM, Roi Dayan wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>>>>>>>>> From: Roi Dayan <roid@nvidia.com>
>>>>>>>>>
>>>>>>>>> Connection tracking associates the connection state per packet. The
>>>>>>>>> first packet of a connection is assigned with the +trk+new state. The
>>>>>>>>> connection enters the established state once a packet is seen on the
>>>>>>>>> other direction.
>>>>>>>>>
>>>>>>>>> Currently we offload only the established flows. However, UDP traffic
>>>>>>>>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>>>>>>>>> established state. Such protocols do not require stateful processing,
>>>>>>>>> and therefore could be offloaded.
>>>>>>>>
>>>>>>>> If it doesn't require stateful processing, please enlight me on why
>>>>>>>> conntrack is being used in the first place. What's the use case here?
>>>>>>>>
>>>>>>>
>>>>>>> The use case for example is when we have vxlan traffic but we do
>>>>>>> conntrack on the inner packet (rules on the physical port) so
>>>>>>> we never get established but on miss we can still offload as normal
>>>>>>> vxlan traffic.
>>>>>>>
>>>>>>
>>>>>> my mistake about "inner packet". we do CT on the underlay network, i.e.
>>>>>> the outer header.
>>>>>
>>>>> I miss why the CT match is being used there then. Isn't it a config
>>>>> issue/waste of resources? What is CT adding to the matches/actions
>>>>> being done on these flows?
>>>>>
>>>>
>>>> Consider a use case where the network port receives both east-west
>>>> encapsulated traffic and north-south non-encapsulated traffic that requires
>>>> NAT.
>>>>
>>>> One possible configuration is to first apply the CT-NAT action.
>>>> Established north-south connections will successfully execute the nat action
>>>> and will set the +est ct state.
>>>> However, the +new state may apply either for valid east-west traffic (e.g.
>>>> vxlan) due to source port entropy, or to insecure north-south traffic that
>>>> the fw should block. The user may distinguish between the two cases, for
>>>> example, by matching on the dest udp port.
>>>
>>> Sorry but I still don't see the big picture. :-]
>>>
>>> What do you consider as east-west and north-south traffic? My initial
>>> understanding of east-west is traffic between VFs and north-south
>>> would be in and out to the wire. You mentioned that north-south is
>>> insecure, it would match, but then, non-encapsulated?
>>>
>>> So it seems you referred to the datacenter. East-west is traffic
>>> between hosts on the same datacenter, and north-south is traffic that
>>> goes out of it. This seems to match.
>>
>> Right.
>>
>>>
>>> Assuming it's the latter, then it seems that the idea is to work
>>> around a config simplification that was done by the user.  As
>>> mentioned on the changelog, such protocols do not require stateful
>>> processing, and AFAICU this patch twists conntrack so that the user
>>> can have simplified rules. Why can't the user have specific rules for
>>> the tunnels, and other for dealing with north-south traffic? The fw
>>> would still be able to block unwanted traffic.
>>
>> We cannot control what the user is doing.
> 
> Right, but we can educate and point them towards better configs. With
> non-optimal configs it's fair to expect non-optimal effects.
> 
>> This is a valid tc configuration and would work using tc software datapath.
>> However, in such configurations vxlan packets would not be processed in
>> hardware because they are marked as new connections.
> 
> Makes sense.
> 
>>
>>>
>>> My main problems with this is this, that it is making conntrack do
>>> stuff that the user may not be expecting it to do, and that packets
>>> may get matched (maybe even unintentionally) and the system won't have
>>> visibility on them. Maybe I'm just missing something?
>>>
>>
>> This is why we restricted this feature to udp protocols that will never
>> enter established state due to source port entropy.
>> Do you see a problematic use case that can arise?
> 
> For use case, the only one I see is if someone wants to use this
> feature for another application/dstport. It's hardcoded to tunnels
> ones.

It's a hardware offload optimization feature.
This is why we chose to support specific protocols that explicitly define source port entropy.

> 
> It feels that the problem is not being solved at the right place. It
> will work well for hardware processing, while for software it will
> work while having a ton of conntrack entries. Different behaviors that
> can lead to people wasting time. Like, trying to debug on why srcport
> is not getting randomized when offloaded, while in fact they are, it's
> just masked.

The SW and HW offload are functionally identical.
You are correct that with this patch the UNREPLIED CT entries will not be visible to the user 
through /proc/net/nf_conntrack

> 
> As this is a fallback (iow, search is done in 2 levels at least), I
> wonder what other approaches were considered. I'm thinking two for
> now. One is to add a flag to conntrack entries that allow them to be
> this generic. Finding the right conntrack entry probably gets harder,
> but when the user dumps /proc/net/nf_conntrack, it says something. On
> how/when to add this flag, maybe act_ct can do it if dstport matches
> something and/or a sysctl specifying a port list.
> 
> The other one may sound an overkill, but is to work with conntrack
> expectations somehow.
> 
> The first one is closer to the current proposal. It basically makes
> the port list configurable and move the "do it" decision to outside
> the driver, where the admin can have more control. If conntrack itself
> can also leverage it and avoid having tons of entries, even better, as
> then we have both behaviors in sync.

IIUC you propose a mechanism for avoiding CT processing of packets with a certain mask (e.g. based 
on dst udp port). Configured by admin and enforced by act_ct or even conntrack itself.

If so, this seems like a fundamental change to nf conntrack requiring it to add packet 
classification engines.

> 
> Thoughts?
> 

I wonder if we should develop a generic mechanism to optimize CT software for a use case that is 
faulty by design.
This has limited value for software as it would only reduce the conntrack table size (packet 
classification is still required).
However, this feature may have a big impact on hardware offload.
Normally hardware offload relies on software to handle new connections. Causing all new connections 
to be processed by software.
With this patch the hardware may autonomously set the +new connection state for the relevant 
connections.



>>
>>>>
>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>> The change in the model is that a miss on the CT table will be forwarded
>>>>>>>>> to a new +trk+new ct table and a miss there will be forwarded to
>>>>>>>>> the slow
>>>>>>>>> path table.
>>>>>>>>
>>>>>>>> AFAICU this new +trk+new ct table is a wildcard match on sport with
>>>>>>>> specific dports. Also AFAICU, such entries will not be visible to the
>>>>>>>> userspace then. Is this right?
>>>>>>>>
>>>>>>>>       Marcelo
>>>>>>>>
>>>>>>>
>>>>>>> right.
>>>>>
>>>>> Thanks,
>>>>> Marcelo
>>>>>
>>>>
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-20 16:09                   ` Oz Shlomo
@ 2021-01-22  1:18                     ` Pablo Neira Ayuso
  2021-01-22  2:16                       ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 30+ messages in thread
From: Pablo Neira Ayuso @ 2021-01-22  1:18 UTC (permalink / raw)
  To: Oz Shlomo
  Cc: Marcelo Ricardo Leitner, Roi Dayan, Saeed Mahameed,
	David S. Miller, Jakub Kicinski, netdev, Paul Blakey,
	Saeed Mahameed

Hi Oz,

On Wed, Jan 20, 2021 at 06:09:48PM +0200, Oz Shlomo wrote:
> On 1/14/2021 11:50 PM, Marcelo Ricardo Leitner wrote:
> > 
> > Thoughts?
> > 
> 
> I wonder if we should develop a generic mechanism to optimize CT software
> for a use case that is faulty by design.
> This has limited value for software as it would only reduce the conntrack
> table size (packet classification is still required).
> However, this feature may have a big impact on hardware offload.
> Normally hardware offload relies on software to handle new connections.
> Causing all new connections to be processed by software.
> With this patch the hardware may autonomously set the +new connection state
> for the relevant connections.

Could you fix this issue with unidirectional flows by checking for
IPS_CONFIRMED status bit? The idea is to hardware offload the entry
after the first packet goes through software successfully. Then, there
is no need to wait for the established state that requires to see
traffic in both directions.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-22  1:18                     ` Pablo Neira Ayuso
@ 2021-01-22  2:16                       ` Marcelo Ricardo Leitner
  2021-01-25  9:15                         ` Oz Shlomo
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Ricardo Leitner @ 2021-01-22  2:16 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Oz Shlomo, Roi Dayan, Saeed Mahameed, David S. Miller,
	Jakub Kicinski, netdev, Paul Blakey, Saeed Mahameed

On Fri, Jan 22, 2021 at 02:18:34AM +0100, Pablo Neira Ayuso wrote:
> Hi Oz,
> 
> On Wed, Jan 20, 2021 at 06:09:48PM +0200, Oz Shlomo wrote:
> > On 1/14/2021 11:50 PM, Marcelo Ricardo Leitner wrote:
> > > 
> > > Thoughts?
> > > 
> > 
> > I wonder if we should develop a generic mechanism to optimize CT software
> > for a use case that is faulty by design.
> > This has limited value for software as it would only reduce the conntrack
> > table size (packet classification is still required).
> > However, this feature may have a big impact on hardware offload.
> > Normally hardware offload relies on software to handle new connections.
> > Causing all new connections to be processed by software.
> > With this patch the hardware may autonomously set the +new connection state
> > for the relevant connections.
> 
> Could you fix this issue with unidirectional flows by checking for
> IPS_CONFIRMED status bit? The idea is to hardware offload the entry
> after the first packet goes through software successfully. Then, there
> is no need to wait for the established state that requires to see
> traffic in both directions.

That's an interesting idea. This way, basically all that needs to be
changed is tcf_ct_flow_table_process_conn() to handle this new
condition for UDP packets and on tcf_ct_act().

It has a small performance penaulty if compared to the original
solution, as now the first packet(s) goes to sw, but looks like a good
compromise between supporting a (from what I could understand)
somewhat lazy flow design (as I still think these didn't need to go
through conntrack), an uniform system behavior (with and without
offload, with mlx5 or another driver) and a more generic approach.
Other situations that rely on unidirectional UDP flows will benefit
from it as well.

This way I even think it doesn't need to be configurable right now.
It will be easier to add a knob to switch back to the old behavior if
needed later on, if anything.

  Marcelo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules
  2021-01-22  2:16                       ` Marcelo Ricardo Leitner
@ 2021-01-25  9:15                         ` Oz Shlomo
  0 siblings, 0 replies; 30+ messages in thread
From: Oz Shlomo @ 2021-01-25  9:15 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Pablo Neira Ayuso
  Cc: Roi Dayan, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	netdev, Paul Blakey, Saeed Mahameed



On 1/22/2021 4:16 AM, Marcelo Ricardo Leitner wrote:
> On Fri, Jan 22, 2021 at 02:18:34AM +0100, Pablo Neira Ayuso wrote:
>> Hi Oz,
>>
>> On Wed, Jan 20, 2021 at 06:09:48PM +0200, Oz Shlomo wrote:
>>> On 1/14/2021 11:50 PM, Marcelo Ricardo Leitner wrote:
>>>>
>>>> Thoughts?
>>>>
>>>
>>> I wonder if we should develop a generic mechanism to optimize CT software
>>> for a use case that is faulty by design.
>>> This has limited value for software as it would only reduce the conntrack
>>> table size (packet classification is still required).
>>> However, this feature may have a big impact on hardware offload.
>>> Normally hardware offload relies on software to handle new connections.
>>> Causing all new connections to be processed by software.
>>> With this patch the hardware may autonomously set the +new connection state
>>> for the relevant connections.
>>
>> Could you fix this issue with unidirectional flows by checking for
>> IPS_CONFIRMED status bit? The idea is to hardware offload the entry
>> after the first packet goes through software successfully. Then, there
>> is no need to wait for the established state that requires to see
>> traffic in both directions.
> 
> That's an interesting idea. This way, basically all that needs to be
> changed is tcf_ct_flow_table_process_conn() to handle this new
> condition for UDP packets and on tcf_ct_act().

Will act_ct need to maintain a port list and classify the packet to realize whether the udp packet 
is part of a unidirection or biderectional udp connection?


> 
> It has a small performance penaulty if compared to the original
> solution, as now the first packet(s) goes to sw, but looks like a good
> compromise between supporting a (from what I could understand)
> somewhat lazy flow design (as I still think these didn't need to go
> through conntrack), an uniform system behavior (with and without
> offload, with mlx5 or another driver) and a more generic approach.
> Other situations that rely on unidirectional UDP flows will benefit
> from it as well.

The hardware offload perspective is a bit different.
With this approach the system will offload a rule per connection instead of offloading one mega-flow 
rule on dst udp port.
This will increase the hardware scale requirements in terms of number of offloaded rules.
In addition, a counter will need to be instantiated per rule and the software will need to manage 
the aging of these connections.

We hoped that the hardware can fully offload this scenario, avoiding the need for sw processing at all.


> 
> This way I even think it doesn't need to be configurable right now.
> It will be easier to add a knob to switch back to the old behavior if
> needed later on, if anything.
> 
>    Marcelo
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-01-26  4:48 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-08  5:30 [pull request][net-next 00/15] mlx5 updates 2021-01-07 Saeed Mahameed
2021-01-08  5:30 ` [net-next 01/15] net/mlx5: Add HW definition of reg_c_preserve Saeed Mahameed
2021-01-08  5:30 ` [net-next 02/15] net/mlx5e: Simplify condition on esw_vport_enable_qos() Saeed Mahameed
2021-01-08  5:30 ` [net-next 03/15] net/mlx5: E-Switch, use new cap as condition for mpls over udp Saeed Mahameed
2021-01-08  5:30 ` [net-next 04/15] net/mlx5e: E-Switch, Offload all chain 0 priorities when modify header and forward action is not supported Saeed Mahameed
2021-01-08  5:30 ` [net-next 05/15] net/mlx5e: CT: Pass null instead of zero spec Saeed Mahameed
2021-01-08  5:30 ` [net-next 06/15] net/mlx5e: Remove redundant initialization to null Saeed Mahameed
2021-01-08  5:30 ` [net-next 07/15] net/mlx5e: CT: Remove redundant usage of zone mask Saeed Mahameed
2021-01-08  5:30 ` [net-next 08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules Saeed Mahameed
2021-01-08 21:48   ` Marcelo Ricardo Leitner
2021-01-10  7:45     ` Roi Dayan
2021-01-10  7:52       ` Roi Dayan
2021-01-11 23:51         ` Marcelo Ricardo Leitner
2021-01-12  9:27           ` Oz Shlomo
2021-01-14 13:02             ` Marcelo Ricardo Leitner
2021-01-14 14:03               ` Oz Shlomo
2021-01-14 21:50                 ` Marcelo Ricardo Leitner
2021-01-20 16:09                   ` Oz Shlomo
2021-01-22  1:18                     ` Pablo Neira Ayuso
2021-01-22  2:16                       ` Marcelo Ricardo Leitner
2021-01-25  9:15                         ` Oz Shlomo
2021-01-08  5:30 ` [net-next 09/15] net/mlx5e: CT: Support offload of " Saeed Mahameed
2021-01-08 21:59   ` Marcelo Ricardo Leitner
2021-01-10  7:55     ` Roi Dayan
2021-01-08  5:30 ` [net-next 10/15] net/mlx5e: CT: Add support for mirroring Saeed Mahameed
2021-01-08  5:30 ` [net-next 11/15] net/mlx5e: CT, Avoid false lock depenency warning Saeed Mahameed
2021-01-08  5:30 ` [net-next 12/15] net/mlx5e: IPsec, Enclose csum logic under ipsec config Saeed Mahameed
2021-01-08  5:30 ` [net-next 13/15] net/mlx5e: IPsec, Avoid unreachable return Saeed Mahameed
2021-01-08  5:30 ` [net-next 14/15] net/mlx5e: IPsec, Inline feature_check fast-path function Saeed Mahameed
2021-01-08  5:30 ` [net-next 15/15] net/mlx5e: IPsec, Remove unnecessary config flag usage Saeed Mahameed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.