[pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads

All of lore.kernel.org
 help / color / mirror / Atom feed

* [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
@ 2017-08-30 23:04 Saeed Mahameed
  2017-08-30 23:04 ` [net-next 1/3] net/mlx5e: Use IP version matching to classify IP traffic Saeed Mahameed
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Saeed Mahameed @ 2017-08-30 23:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Saeed Mahameed

Hi Dave,

Tthe following changes provide GRE tunnel offloads for mlx5 ethernet netdevice driver.

For more details please see tag log message below.
Please pull and let me know if there's any problem.

Note: this series doesn't conflict with the ongoing net mlx5 submission.

Thanks,
Saeed.

---

The following changes since commit 90774a93ef075b39e55d31fe56fc286d71a046ac:

  bpf: test_maps: fix typos, "conenct" and "listeen" (2017-08-30 15:32:16 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-GRE-Offload

for you to fetch changes up to 7b3722fa9ef647eb1ae6a60a5d46f7c67ab09a33:

  net/mlx5e: Support RSS for GRE tunneled packets (2017-08-31 01:54:15 +0300)

----------------------------------------------------------------
mlx5-updates-2017-08-31 (GRE Offloads support)

This series provides the support for MPLS RSS and GRE TX offloads and
RSS support.

The first patch from Gal and Ariel provides the mlx5 driver support for
ConnectX capability to perform IP version identification and matching in
order to distinguish between IPv4 and IPv6 without the need to specify the
encapsulation type, thus perform RSS in MPLS automatically without
specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
classification for inner GRE RSS.

2nd patch from Gal, Adds the TX offloads support for GRE tunneled packets,
by reporting the needed netdev features.

3rd patch from Gal, Adds GRE inner RSS support by creating the needed device
resources (Steering Tables/rules and traffic classifiers) to Match GRE traffic
and perform RSS hashing on the inner headers.

Improvement:
Testing 8 TCP streams bandwidth over GRE:
    System: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
    NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
    Before: 21.3 Gbps (Single RQ)
    Now   : 90.5 Gbps (RSS spread on 8 RQs)

Thanks,
Saeed.

----------------------------------------------------------------
Gal Pressman (3):
      net/mlx5e: Use IP version matching to classify IP traffic
      net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels
      net/mlx5e: Support RSS for GRE tunneled packets

 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  18 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  11 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    | 281 ++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 108 ++++++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |   4 +-
 include/linux/mlx5/mlx5_ifc.h                      |   2 +-
 6 files changed, 384 insertions(+), 40 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [net-next 1/3] net/mlx5e: Use IP version matching to classify IP traffic
  2017-08-30 23:04 [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads Saeed Mahameed
@ 2017-08-30 23:04 ` Saeed Mahameed
  2017-08-30 23:04 ` [net-next 2/3] net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels Saeed Mahameed
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2017-08-30 23:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Gal Pressman, Ariel Levkovich, Saeed Mahameed

From: Gal Pressman <galp@mellanox.com>

This change adds the ability for flow steering to classify IPv4/6
packets with MPLS tag (Ethertype 0x8847 and 0x8848) as standard IP
packets and hit IPv4/6 classification steering rules.

Since IP packets with MPLS tag header have MPLS ethertype, they
missed the IPv4/6 ethertype rule and ended up hitting the default
filter forwarding all the packets to the same single RQ (No RSS).

Since our device is able to look past the MPLS tag and identify the
next protocol we introduce this solution which replaces ethertype
matching by the device's capability to perform IP version
identification and matching in order to distinguish between IPv4 and
IPv6.
Therefore, when driver is performing flow steering configuration on the
device it will use IP version matching in IP classified rules instead
of ethertype matching which will cause relevant MPLS tagged packets to
hit this rule as well.

If the device doesn't support IP version matching the driver will fall back
to use legacy ethertype matching in the steering as before.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c | 33 ++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index eecbc6d4f51f..85e6226dacfb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -660,6 +660,17 @@ static struct {
 	},
 };
 
+static u8 mlx5e_etype_to_ipv(u16 ethertype)
+{
+	if (ethertype == ETH_P_IP)
+		return 4;
+
+	if (ethertype == ETH_P_IPV6)
+		return 6;
+
+	return 0;
+}
+
 static struct mlx5_flow_handle *
 mlx5e_generate_ttc_rule(struct mlx5e_priv *priv,
 			struct mlx5_flow_table *ft,
@@ -667,10 +678,12 @@ mlx5e_generate_ttc_rule(struct mlx5e_priv *priv,
 			u16 etype,
 			u8 proto)
 {
+	int match_ipv_outer = MLX5_CAP_FLOWTABLE_NIC_RX(priv->mdev, ft_field_support.outer_ip_version);
 	MLX5_DECLARE_FLOW_ACT(flow_act);
 	struct mlx5_flow_handle *rule;
 	struct mlx5_flow_spec *spec;
 	int err = 0;
+	u8 ipv;
 
 	spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
 	if (!spec)
@@ -681,7 +694,13 @@ mlx5e_generate_ttc_rule(struct mlx5e_priv *priv,
 		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.ip_protocol);
 		MLX5_SET(fte_match_param, spec->match_value, outer_headers.ip_protocol, proto);
 	}
-	if (etype) {
+
+	ipv = mlx5e_etype_to_ipv(etype);
+	if (match_ipv_outer && ipv) {
+		spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
+		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.ip_version);
+		MLX5_SET(fte_match_param, spec->match_value, outer_headers.ip_version, ipv);
+	} else if (etype) {
 		spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
 		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.ethertype);
 		MLX5_SET(fte_match_param, spec->match_value, outer_headers.ethertype, etype);
@@ -739,7 +758,9 @@ static int mlx5e_generate_ttc_table_rules(struct mlx5e_priv *priv)
 #define MLX5E_TTC_TABLE_SIZE	(MLX5E_TTC_GROUP1_SIZE +\
 				 MLX5E_TTC_GROUP2_SIZE +\
 				 MLX5E_TTC_GROUP3_SIZE)
-static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc)
+
+static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc,
+					 bool use_ipv)
 {
 	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
 	struct mlx5e_flow_table *ft = &ttc->ft;
@@ -761,7 +782,10 @@ static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc)
 	/* L4 Group */
 	mc = MLX5_ADDR_OF(create_flow_group_in, in, match_criteria);
 	MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ip_protocol);
-	MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ethertype);
+	if (use_ipv)
+		MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ip_version);
+	else
+		MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ethertype);
 	MLX5_SET_CFG(in, match_criteria_enable, MLX5_MATCH_OUTER_HEADERS);
 	MLX5_SET_CFG(in, start_flow_index, ix);
 	ix += MLX5E_TTC_GROUP1_SIZE;
@@ -812,6 +836,7 @@ void mlx5e_destroy_ttc_table(struct mlx5e_priv *priv)
 
 int mlx5e_create_ttc_table(struct mlx5e_priv *priv)
 {
+	bool match_ipv_outer = MLX5_CAP_FLOWTABLE_NIC_RX(priv->mdev, ft_field_support.outer_ip_version);
 	struct mlx5e_ttc_table *ttc = &priv->fs.ttc;
 	struct mlx5_flow_table_attr ft_attr = {};
 	struct mlx5e_flow_table *ft = &ttc->ft;
@@ -828,7 +853,7 @@ int mlx5e_create_ttc_table(struct mlx5e_priv *priv)
 		return err;
 	}
 
-	err = mlx5e_create_ttc_table_groups(ttc);
+	err = mlx5e_create_ttc_table_groups(ttc, match_ipv_outer);
 	if (err)
 		goto err;
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 2/3] net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels
  2017-08-30 23:04 [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads Saeed Mahameed
  2017-08-30 23:04 ` [net-next 1/3] net/mlx5e: Use IP version matching to classify IP traffic Saeed Mahameed
@ 2017-08-30 23:04 ` Saeed Mahameed
  2017-08-30 23:04 ` [net-next 3/3] net/mlx5e: Support RSS for GRE tunneled packets Saeed Mahameed
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2017-08-30 23:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Gal Pressman, Saeed Mahameed

From: Gal Pressman <galp@mellanox.com>

Add TX offloads support for GRE tunneled packets by reporting the needed
netdev features.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 51 +++++++++++++++--------
 include/linux/mlx5/mlx5_ifc.h                     |  2 +-
 2 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index fdc2b92f020b..9475fb89a744 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3499,13 +3499,13 @@ static void mlx5e_del_vxlan_port(struct net_device *netdev,
 	mlx5e_vxlan_queue_work(priv, ti->sa_family, be16_to_cpu(ti->port), 0);
 }
 
-static netdev_features_t mlx5e_vxlan_features_check(struct mlx5e_priv *priv,
-						    struct sk_buff *skb,
-						    netdev_features_t features)
+static netdev_features_t mlx5e_tunnel_features_check(struct mlx5e_priv *priv,
+						     struct sk_buff *skb,
+						     netdev_features_t features)
 {
 	struct udphdr *udph;
-	u16 proto;
-	u16 port = 0;
+	u8 proto;
+	u16 port;
 
 	switch (vlan_get_protocol(skb)) {
 	case htons(ETH_P_IP):
@@ -3518,14 +3518,17 @@ static netdev_features_t mlx5e_vxlan_features_check(struct mlx5e_priv *priv,
 		goto out;
 	}
 
-	if (proto == IPPROTO_UDP) {
+	switch (proto) {
+	case IPPROTO_GRE:
+		return features;
+	case IPPROTO_UDP:
 		udph = udp_hdr(skb);
 		port = be16_to_cpu(udph->dest);
-	}
 
-	/* Verify if UDP port is being offloaded by HW */
-	if (port && mlx5e_vxlan_lookup_port(priv, port))
-		return features;
+		/* Verify if UDP port is being offloaded by HW */
+		if (mlx5e_vxlan_lookup_port(priv, port))
+			return features;
+	}
 
 out:
 	/* Disable CSUM and GSO if the udp dport is not offloaded by HW */
@@ -3549,7 +3552,7 @@ static netdev_features_t mlx5e_features_check(struct sk_buff *skb,
 	/* Validate if the tunneled packet is being offloaded by HW */
 	if (skb->encapsulation &&
 	    (features & NETIF_F_CSUM_MASK || features & NETIF_F_GSO_MASK))
-		return mlx5e_vxlan_features_check(priv, skb, features);
+		return mlx5e_tunnel_features_check(priv, skb, features);
 
 	return features;
 }
@@ -4014,20 +4017,32 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	netdev->hw_features      |= NETIF_F_HW_VLAN_CTAG_RX;
 	netdev->hw_features      |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
-	if (mlx5e_vxlan_allowed(mdev)) {
-		netdev->hw_features     |= NETIF_F_GSO_UDP_TUNNEL |
-					   NETIF_F_GSO_UDP_TUNNEL_CSUM |
-					   NETIF_F_GSO_PARTIAL;
+	if (mlx5e_vxlan_allowed(mdev) || MLX5_CAP_ETH(mdev, tunnel_stateless_gre)) {
+		netdev->hw_features     |= NETIF_F_GSO_PARTIAL;
 		netdev->hw_enc_features |= NETIF_F_IP_CSUM;
 		netdev->hw_enc_features |= NETIF_F_IPV6_CSUM;
 		netdev->hw_enc_features |= NETIF_F_TSO;
 		netdev->hw_enc_features |= NETIF_F_TSO6;
-		netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL;
-		netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM |
-					   NETIF_F_GSO_PARTIAL;
+		netdev->hw_enc_features |= NETIF_F_GSO_PARTIAL;
+	}
+
+	if (mlx5e_vxlan_allowed(mdev)) {
+		netdev->hw_features     |= NETIF_F_GSO_UDP_TUNNEL |
+					   NETIF_F_GSO_UDP_TUNNEL_CSUM;
+		netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL |
+					   NETIF_F_GSO_UDP_TUNNEL_CSUM;
 		netdev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
 	}
 
+	if (MLX5_CAP_ETH(mdev, tunnel_stateless_gre)) {
+		netdev->hw_features     |= NETIF_F_GSO_GRE |
+					   NETIF_F_GSO_GRE_CSUM;
+		netdev->hw_enc_features |= NETIF_F_GSO_GRE |
+					   NETIF_F_GSO_GRE_CSUM;
+		netdev->gso_partial_features |= NETIF_F_GSO_GRE |
+						NETIF_F_GSO_GRE_CSUM;
+	}
+
 	mlx5_query_port_fcs(mdev, &fcs_supported, &fcs_enabled);
 
 	if (fcs_supported)
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index ae7d09b9c52f..3d5d32e5446c 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -602,7 +602,7 @@ struct mlx5_ifc_per_protocol_networking_offload_caps_bits {
 	u8         reserved_at_1a[0x1];
 	u8         tunnel_lso_const_out_ip_id[0x1];
 	u8         reserved_at_1c[0x2];
-	u8         tunnel_statless_gre[0x1];
+	u8         tunnel_stateless_gre[0x1];
 	u8         tunnel_stateless_vxlan[0x1];
 
 	u8         swp[0x1];
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next 3/3] net/mlx5e: Support RSS for GRE tunneled packets
  2017-08-30 23:04 [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads Saeed Mahameed
  2017-08-30 23:04 ` [net-next 1/3] net/mlx5e: Use IP version matching to classify IP traffic Saeed Mahameed
  2017-08-30 23:04 ` [net-next 2/3] net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels Saeed Mahameed
@ 2017-08-30 23:04 ` Saeed Mahameed
  2017-08-31  5:15 ` [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads David Miller
  2017-08-31 13:51 ` Hannes Frederic Sowa
  4 siblings, 0 replies; 30+ messages in thread
From: Saeed Mahameed @ 2017-08-30 23:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Gal Pressman, Saeed Mahameed

From: Gal Pressman <galp@mellanox.com>

Introduce a new flow table and indirect TIRs which are used to hash the
inner packet headers of GRE tunneled packets.

When a GRE tunneled packet is received, the TTC flow table will match
the new IPv4/6->GRE rules which will forward it to the inner TTC table.
The inner TTC is similar to its counterpart outer TTC table, but
matching the inner packet headers instead of the outer ones (and does
not include the new IPv4/6->GRE rules).
The new rules will not add steering hops since they are added to an
already existing flow group which will be matched regardless of this
patch. Non GRE traffic will not be affected.

The inner flow table will forward the packet to inner indirect TIRs
which hash the inner packet and thus result in RSS for the tunneled
packets.

Testing 8 TCP streams bandwidth over GRE:
System: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
Before: 21.3 Gbps (Single RQ)
Now   : 90.5 Gbps (RSS spread on 8 RQs)

Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  18 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  11 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    | 248 ++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  57 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |   4 +-
 5 files changed, 321 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 0039b4725405..a31912415264 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -620,6 +620,12 @@ enum mlx5e_traffic_types {
 	MLX5E_NUM_INDIR_TIRS = MLX5E_TT_ANY,
 };
 
+enum mlx5e_tunnel_types {
+	MLX5E_TT_IPV4_GRE,
+	MLX5E_TT_IPV6_GRE,
+	MLX5E_NUM_TUNNEL_TT,
+};
+
 enum {
 	MLX5E_STATE_ASYNC_EVENTS_ENABLED,
 	MLX5E_STATE_OPENED,
@@ -679,6 +685,7 @@ struct mlx5e_l2_table {
 struct mlx5e_ttc_table {
 	struct mlx5e_flow_table  ft;
 	struct mlx5_flow_handle	 *rules[MLX5E_NUM_TT];
+	struct mlx5_flow_handle  *tunnel_rules[MLX5E_NUM_TUNNEL_TT];
 };
 
 #define ARFS_HASH_SHIFT BITS_PER_BYTE
@@ -711,6 +718,7 @@ enum {
 	MLX5E_VLAN_FT_LEVEL = 0,
 	MLX5E_L2_FT_LEVEL,
 	MLX5E_TTC_FT_LEVEL,
+	MLX5E_INNER_TTC_FT_LEVEL,
 	MLX5E_ARFS_FT_LEVEL
 };
 
@@ -736,6 +744,7 @@ struct mlx5e_flow_steering {
 	struct mlx5e_vlan_table         vlan;
 	struct mlx5e_l2_table           l2;
 	struct mlx5e_ttc_table          ttc;
+	struct mlx5e_ttc_table          inner_ttc;
 	struct mlx5e_arfs_tables        arfs;
 };
 
@@ -769,6 +778,7 @@ struct mlx5e_priv {
 	u32                        tisn[MLX5E_MAX_NUM_TC];
 	struct mlx5e_rqt           indir_rqt;
 	struct mlx5e_tir           indir_tir[MLX5E_NUM_INDIR_TIRS];
+	struct mlx5e_tir           inner_indir_tir[MLX5E_NUM_INDIR_TIRS];
 	struct mlx5e_tir           direct_tir[MLX5E_MAX_NUM_CHANNELS];
 	u32                        tx_rates[MLX5E_MAX_NUM_SQS];
 	int                        hard_mtu;
@@ -903,7 +913,7 @@ int mlx5e_redirect_rqt(struct mlx5e_priv *priv, u32 rqtn, int sz,
 		       struct mlx5e_redirect_rqt_param rrp);
 void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_params *params,
 				    enum mlx5e_traffic_types tt,
-				    void *tirc);
+				    void *tirc, bool inner);
 
 int mlx5e_open_locked(struct net_device *netdev);
 int mlx5e_close_locked(struct net_device *netdev);
@@ -932,6 +942,12 @@ void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params,
 void mlx5e_set_rq_type_params(struct mlx5_core_dev *mdev,
 			      struct mlx5e_params *params, u8 rq_type);
 
+static inline bool mlx5e_tunnel_inner_ft_supported(struct mlx5_core_dev *mdev)
+{
+	return (MLX5_CAP_ETH(mdev, tunnel_stateless_gre) &&
+		MLX5_CAP_FLOWTABLE_NIC_RX(mdev, ft_field_support.inner_ip_version));
+}
+
 static inline
 struct mlx5e_tx_wqe *mlx5e_post_nop(struct mlx5_wq_cyc *wq, u32 sqn, u16 *pc)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 0dd7e9caf150..c6ec90e9c95b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1212,9 +1212,18 @@ static void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen)
 
 	for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
 		memset(tirc, 0, ctxlen);
-		mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc);
+		mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc, false);
 		mlx5_core_modify_tir(mdev, priv->indir_tir[tt].tirn, in, inlen);
 	}
+
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return;
+
+	for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
+		memset(tirc, 0, ctxlen);
+		mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc, true);
+		mlx5_core_modify_tir(mdev, priv->inner_indir_tir[tt].tirn, in, inlen);
+	}
 }
 
 static int mlx5e_set_rxfh(struct net_device *dev, const u32 *indir,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index 85e6226dacfb..f11fd07ac4dd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -608,12 +608,21 @@ static void mlx5e_cleanup_ttc_rules(struct mlx5e_ttc_table *ttc)
 			ttc->rules[i] = NULL;
 		}
 	}
+
+	for (i = 0; i < MLX5E_NUM_TUNNEL_TT; i++) {
+		if (!IS_ERR_OR_NULL(ttc->tunnel_rules[i])) {
+			mlx5_del_flow_rules(ttc->tunnel_rules[i]);
+			ttc->tunnel_rules[i] = NULL;
+		}
+	}
 }
 
-static struct {
+struct mlx5e_etype_proto {
 	u16 etype;
 	u8 proto;
-} ttc_rules[] = {
+};
+
+static struct mlx5e_etype_proto ttc_rules[] = {
 	[MLX5E_TT_IPV4_TCP] = {
 		.etype = ETH_P_IP,
 		.proto = IPPROTO_TCP,
@@ -660,6 +669,17 @@ static struct {
 	},
 };
 
+static struct mlx5e_etype_proto ttc_tunnel_rules[] = {
+	[MLX5E_TT_IPV4_GRE] = {
+		.etype = ETH_P_IP,
+		.proto = IPPROTO_GRE,
+	},
+	[MLX5E_TT_IPV6_GRE] = {
+		.etype = ETH_P_IPV6,
+		.proto = IPPROTO_GRE,
+	},
+};
+
 static u8 mlx5e_etype_to_ipv(u16 ethertype)
 {
 	if (ethertype == ETH_P_IP)
@@ -742,6 +762,20 @@ static int mlx5e_generate_ttc_table_rules(struct mlx5e_priv *priv)
 			goto del_rules;
 	}
 
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return 0;
+
+	rules     = ttc->tunnel_rules;
+	dest.type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
+	dest.ft   = priv->fs.inner_ttc.ft.t;
+	for (tt = 0; tt < MLX5E_NUM_TUNNEL_TT; tt++) {
+		rules[tt] = mlx5e_generate_ttc_rule(priv, ft, &dest,
+						    ttc_tunnel_rules[tt].etype,
+						    ttc_tunnel_rules[tt].proto);
+		if (IS_ERR(rules[tt]))
+			goto del_rules;
+	}
+
 	return 0;
 
 del_rules:
@@ -752,13 +786,21 @@ static int mlx5e_generate_ttc_table_rules(struct mlx5e_priv *priv)
 }
 
 #define MLX5E_TTC_NUM_GROUPS	3
-#define MLX5E_TTC_GROUP1_SIZE	BIT(3)
-#define MLX5E_TTC_GROUP2_SIZE	BIT(1)
-#define MLX5E_TTC_GROUP3_SIZE	BIT(0)
+#define MLX5E_TTC_GROUP1_SIZE	(BIT(3) + MLX5E_NUM_TUNNEL_TT)
+#define MLX5E_TTC_GROUP2_SIZE	 BIT(1)
+#define MLX5E_TTC_GROUP3_SIZE	 BIT(0)
 #define MLX5E_TTC_TABLE_SIZE	(MLX5E_TTC_GROUP1_SIZE +\
 				 MLX5E_TTC_GROUP2_SIZE +\
 				 MLX5E_TTC_GROUP3_SIZE)
 
+#define MLX5E_INNER_TTC_NUM_GROUPS	3
+#define MLX5E_INNER_TTC_GROUP1_SIZE	BIT(3)
+#define MLX5E_INNER_TTC_GROUP2_SIZE	BIT(1)
+#define MLX5E_INNER_TTC_GROUP3_SIZE	BIT(0)
+#define MLX5E_INNER_TTC_TABLE_SIZE	(MLX5E_INNER_TTC_GROUP1_SIZE +\
+					 MLX5E_INNER_TTC_GROUP2_SIZE +\
+					 MLX5E_INNER_TTC_GROUP3_SIZE)
+
 static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc,
 					 bool use_ipv)
 {
@@ -826,6 +868,190 @@ static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc,
 	return err;
 }
 
+static struct mlx5_flow_handle *
+mlx5e_generate_inner_ttc_rule(struct mlx5e_priv *priv,
+			      struct mlx5_flow_table *ft,
+			      struct mlx5_flow_destination *dest,
+			      u16 etype, u8 proto)
+{
+	MLX5_DECLARE_FLOW_ACT(flow_act);
+	struct mlx5_flow_handle *rule;
+	struct mlx5_flow_spec *spec;
+	int err = 0;
+	u8 ipv;
+
+	spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
+	if (!spec)
+		return ERR_PTR(-ENOMEM);
+
+	ipv = mlx5e_etype_to_ipv(etype);
+	if (etype && ipv) {
+		spec->match_criteria_enable = MLX5_MATCH_INNER_HEADERS;
+		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, inner_headers.ip_version);
+		MLX5_SET(fte_match_param, spec->match_value, inner_headers.ip_version, ipv);
+	}
+
+	if (proto) {
+		spec->match_criteria_enable = MLX5_MATCH_INNER_HEADERS;
+		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, inner_headers.ip_protocol);
+		MLX5_SET(fte_match_param, spec->match_value, inner_headers.ip_protocol, proto);
+	}
+
+	rule = mlx5_add_flow_rules(ft, spec, &flow_act, dest, 1);
+	if (IS_ERR(rule)) {
+		err = PTR_ERR(rule);
+		netdev_err(priv->netdev, "%s: add rule failed\n", __func__);
+	}
+
+	kvfree(spec);
+	return err ? ERR_PTR(err) : rule;
+}
+
+static int mlx5e_generate_inner_ttc_table_rules(struct mlx5e_priv *priv)
+{
+	struct mlx5_flow_destination dest;
+	struct mlx5_flow_handle **rules;
+	struct mlx5e_ttc_table *ttc;
+	struct mlx5_flow_table *ft;
+	int err;
+	int tt;
+
+	ttc =  &priv->fs.inner_ttc;
+	ft = ttc->ft.t;
+	rules = ttc->rules;
+
+	dest.type = MLX5_FLOW_DESTINATION_TYPE_TIR;
+	for (tt = 0; tt < MLX5E_NUM_TT; tt++) {
+		if (tt == MLX5E_TT_ANY)
+			dest.tir_num = priv->direct_tir[0].tirn;
+		else
+			dest.tir_num = priv->inner_indir_tir[tt].tirn;
+
+		rules[tt] = mlx5e_generate_inner_ttc_rule(priv, ft, &dest,
+							  ttc_rules[tt].etype,
+							  ttc_rules[tt].proto);
+		if (IS_ERR(rules[tt]))
+			goto del_rules;
+	}
+
+	return 0;
+
+del_rules:
+	err = PTR_ERR(rules[tt]);
+	rules[tt] = NULL;
+	mlx5e_cleanup_ttc_rules(ttc);
+	return err;
+}
+
+static int mlx5e_create_inner_ttc_table_groups(struct mlx5e_ttc_table *ttc)
+{
+	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	struct mlx5e_flow_table *ft = &ttc->ft;
+	int ix = 0;
+	u32 *in;
+	int err;
+	u8 *mc;
+
+	ft->g = kcalloc(MLX5E_INNER_TTC_NUM_GROUPS, sizeof(*ft->g), GFP_KERNEL);
+	if (!ft->g)
+		return -ENOMEM;
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in) {
+		kfree(ft->g);
+		return -ENOMEM;
+	}
+
+	/* L4 Group */
+	mc = MLX5_ADDR_OF(create_flow_group_in, in, match_criteria);
+	MLX5_SET_TO_ONES(fte_match_param, mc, inner_headers.ip_protocol);
+	MLX5_SET_TO_ONES(fte_match_param, mc, inner_headers.ip_version);
+	MLX5_SET_CFG(in, match_criteria_enable, MLX5_MATCH_INNER_HEADERS);
+	MLX5_SET_CFG(in, start_flow_index, ix);
+	ix += MLX5E_INNER_TTC_GROUP1_SIZE;
+	MLX5_SET_CFG(in, end_flow_index, ix - 1);
+	ft->g[ft->num_groups] = mlx5_create_flow_group(ft->t, in);
+	if (IS_ERR(ft->g[ft->num_groups]))
+		goto err;
+	ft->num_groups++;
+
+	/* L3 Group */
+	MLX5_SET(fte_match_param, mc, inner_headers.ip_protocol, 0);
+	MLX5_SET_CFG(in, start_flow_index, ix);
+	ix += MLX5E_INNER_TTC_GROUP2_SIZE;
+	MLX5_SET_CFG(in, end_flow_index, ix - 1);
+	ft->g[ft->num_groups] = mlx5_create_flow_group(ft->t, in);
+	if (IS_ERR(ft->g[ft->num_groups]))
+		goto err;
+	ft->num_groups++;
+
+	/* Any Group */
+	memset(in, 0, inlen);
+	MLX5_SET_CFG(in, start_flow_index, ix);
+	ix += MLX5E_INNER_TTC_GROUP3_SIZE;
+	MLX5_SET_CFG(in, end_flow_index, ix - 1);
+	ft->g[ft->num_groups] = mlx5_create_flow_group(ft->t, in);
+	if (IS_ERR(ft->g[ft->num_groups]))
+		goto err;
+	ft->num_groups++;
+
+	kvfree(in);
+	return 0;
+
+err:
+	err = PTR_ERR(ft->g[ft->num_groups]);
+	ft->g[ft->num_groups] = NULL;
+	kvfree(in);
+
+	return err;
+}
+
+static int mlx5e_create_inner_ttc_table(struct mlx5e_priv *priv)
+{
+	struct mlx5e_ttc_table *ttc = &priv->fs.inner_ttc;
+	struct mlx5_flow_table_attr ft_attr = {};
+	struct mlx5e_flow_table *ft = &ttc->ft;
+	int err;
+
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return 0;
+
+	ft_attr.max_fte = MLX5E_INNER_TTC_TABLE_SIZE;
+	ft_attr.level   = MLX5E_INNER_TTC_FT_LEVEL;
+	ft_attr.prio    = MLX5E_NIC_PRIO;
+
+	ft->t = mlx5_create_flow_table(priv->fs.ns, &ft_attr);
+	if (IS_ERR(ft->t)) {
+		err = PTR_ERR(ft->t);
+		ft->t = NULL;
+		return err;
+	}
+
+	err = mlx5e_create_inner_ttc_table_groups(ttc);
+	if (err)
+		goto err;
+
+	err = mlx5e_generate_inner_ttc_table_rules(priv);
+	if (err)
+		goto err;
+
+	return 0;
+
+err:
+	mlx5e_destroy_flow_table(ft);
+	return err;
+}
+
+static void mlx5e_destroy_inner_ttc_table(struct mlx5e_priv *priv)
+{
+	struct mlx5e_ttc_table *ttc = &priv->fs.inner_ttc;
+
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return;
+
+	mlx5e_cleanup_ttc_rules(ttc);
+	mlx5e_destroy_flow_table(&ttc->ft);
+}
+
 void mlx5e_destroy_ttc_table(struct mlx5e_priv *priv)
 {
 	struct mlx5e_ttc_table *ttc = &priv->fs.ttc;
@@ -1179,11 +1405,18 @@ int mlx5e_create_flow_steering(struct mlx5e_priv *priv)
 		priv->netdev->hw_features &= ~NETIF_F_NTUPLE;
 	}
 
+	err = mlx5e_create_inner_ttc_table(priv);
+	if (err) {
+		netdev_err(priv->netdev, "Failed to create inner ttc table, err=%d\n",
+			   err);
+		goto err_destroy_arfs_tables;
+	}
+
 	err = mlx5e_create_ttc_table(priv);
 	if (err) {
 		netdev_err(priv->netdev, "Failed to create ttc table, err=%d\n",
 			   err);
-		goto err_destroy_arfs_tables;
+		goto err_destroy_inner_ttc_table;
 	}
 
 	err = mlx5e_create_l2_table(priv);
@@ -1208,6 +1441,8 @@ int mlx5e_create_flow_steering(struct mlx5e_priv *priv)
 	mlx5e_destroy_l2_table(priv);
 err_destroy_ttc_table:
 	mlx5e_destroy_ttc_table(priv);
+err_destroy_inner_ttc_table:
+	mlx5e_destroy_inner_ttc_table(priv);
 err_destroy_arfs_tables:
 	mlx5e_arfs_destroy_tables(priv);
 
@@ -1219,6 +1454,7 @@ void mlx5e_destroy_flow_steering(struct mlx5e_priv *priv)
 	mlx5e_destroy_vlan_table(priv);
 	mlx5e_destroy_l2_table(priv);
 	mlx5e_destroy_ttc_table(priv);
+	mlx5e_destroy_inner_ttc_table(priv);
 	mlx5e_arfs_destroy_tables(priv);
 	mlx5e_ethtool_cleanup_steering(priv);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9475fb89a744..111c7523d448 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2349,9 +2349,10 @@ static void mlx5e_build_tir_ctx_lro(struct mlx5e_params *params, void *tirc)
 
 void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_params *params,
 				    enum mlx5e_traffic_types tt,
-				    void *tirc)
+				    void *tirc, bool inner)
 {
-	void *hfso = MLX5_ADDR_OF(tirc, tirc, rx_hash_field_selector_outer);
+	void *hfso = inner ? MLX5_ADDR_OF(tirc, tirc, rx_hash_field_selector_inner) :
+			     MLX5_ADDR_OF(tirc, tirc, rx_hash_field_selector_outer);
 
 #define MLX5_HASH_IP            (MLX5_HASH_FIELD_SEL_SRC_IP   |\
 				 MLX5_HASH_FIELD_SEL_DST_IP)
@@ -2500,6 +2501,21 @@ static int mlx5e_modify_tirs_lro(struct mlx5e_priv *priv)
 	return err;
 }
 
+static void mlx5e_build_inner_indir_tir_ctx(struct mlx5e_priv *priv,
+					    enum mlx5e_traffic_types tt,
+					    u32 *tirc)
+{
+	MLX5_SET(tirc, tirc, transport_domain, priv->mdev->mlx5e_res.td.tdn);
+
+	mlx5e_build_tir_ctx_lro(&priv->channels.params, tirc);
+
+	MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_INDIRECT);
+	MLX5_SET(tirc, tirc, indirect_table, priv->indir_rqt.rqtn);
+	MLX5_SET(tirc, tirc, tunneled_offload_en, 0x1);
+
+	mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc, true);
+}
+
 static int mlx5e_set_mtu(struct mlx5e_priv *priv, u16 mtu)
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
@@ -2865,7 +2881,7 @@ static void mlx5e_build_indir_tir_ctx(struct mlx5e_priv *priv,
 
 	MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_INDIRECT);
 	MLX5_SET(tirc, tirc, indirect_table, priv->indir_rqt.rqtn);
-	mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc);
+	mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc, false);
 }
 
 static void mlx5e_build_direct_tir_ctx(struct mlx5e_priv *priv, u32 rqtn, u32 *tirc)
@@ -2884,6 +2900,7 @@ int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv)
 	struct mlx5e_tir *tir;
 	void *tirc;
 	int inlen;
+	int i = 0;
 	int err;
 	u32 *in;
 	int tt;
@@ -2899,16 +2916,36 @@ int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv)
 		tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
 		mlx5e_build_indir_tir_ctx(priv, tt, tirc);
 		err = mlx5e_create_tir(priv->mdev, tir, in, inlen);
-		if (err)
-			goto err_destroy_tirs;
+		if (err) {
+			mlx5_core_warn(priv->mdev, "create indirect tirs failed, %d\n", err);
+			goto err_destroy_inner_tirs;
+		}
 	}
 
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		goto out;
+
+	for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++) {
+		memset(in, 0, inlen);
+		tir = &priv->inner_indir_tir[i];
+		tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
+		mlx5e_build_inner_indir_tir_ctx(priv, i, tirc);
+		err = mlx5e_create_tir(priv->mdev, tir, in, inlen);
+		if (err) {
+			mlx5_core_warn(priv->mdev, "create inner indirect tirs failed, %d\n", err);
+			goto err_destroy_inner_tirs;
+		}
+	}
+
+out:
 	kvfree(in);
 
 	return 0;
 
-err_destroy_tirs:
-	mlx5_core_warn(priv->mdev, "create indirect tirs failed, %d\n", err);
+err_destroy_inner_tirs:
+	for (i--; i >= 0; i--)
+		mlx5e_destroy_tir(priv->mdev, &priv->inner_indir_tir[i]);
+
 	for (tt--; tt >= 0; tt--)
 		mlx5e_destroy_tir(priv->mdev, &priv->indir_tir[tt]);
 
@@ -2962,6 +2999,12 @@ void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv)
 
 	for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++)
 		mlx5e_destroy_tir(priv->mdev, &priv->indir_tir[i]);
+
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return;
+
+	for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++)
+		mlx5e_destroy_tir(priv->mdev, &priv->inner_indir_tir[i]);
 }
 
 void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index d731d57a996a..5a7bea688ec8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -83,8 +83,8 @@
 #define ETHTOOL_PRIO_NUM_LEVELS 1
 #define ETHTOOL_NUM_PRIOS 11
 #define ETHTOOL_MIN_LEVEL (KERNEL_MIN_LEVEL + ETHTOOL_NUM_PRIOS)
-/* Vlan, mac, ttc, aRFS */
-#define KERNEL_NIC_PRIO_NUM_LEVELS 4
+/* Vlan, mac, ttc, inner ttc, aRFS */
+#define KERNEL_NIC_PRIO_NUM_LEVELS 5
 #define KERNEL_NIC_NUM_PRIOS 1
 /* One more level for tc */
 #define KERNEL_MIN_LEVEL (KERNEL_NIC_PRIO_NUM_LEVELS + 1)
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-08-30 23:04 [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads Saeed Mahameed
                   ` (2 preceding siblings ...)
  2017-08-30 23:04 ` [net-next 3/3] net/mlx5e: Support RSS for GRE tunneled packets Saeed Mahameed
@ 2017-08-31  5:15 ` David Miller
  2017-08-31 13:51 ` Hannes Frederic Sowa
  4 siblings, 0 replies; 30+ messages in thread
From: David Miller @ 2017-08-31  5:15 UTC (permalink / raw)
  To: saeedm; +Cc: netdev

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Thu, 31 Aug 2017 02:04:06 +0300

> The following changes provide GRE tunnel offloads for mlx5 ethernet netdevice driver.
> 
> For more details please see tag log message below.
> Please pull and let me know if there's any problem.
> 
> Note: this series doesn't conflict with the ongoing net mlx5 submission.

Looks good, pulled.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-08-30 23:04 [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads Saeed Mahameed
                   ` (3 preceding siblings ...)
  2017-08-31  5:15 ` [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads David Miller
@ 2017-08-31 13:51 ` Hannes Frederic Sowa
  2017-09-02 23:01   ` Saeed Mahameed
  4 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2017-08-31 13:51 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: David S. Miller, netdev

Saeed Mahameed <saeedm@mellanox.com> writes:

> The first patch from Gal and Ariel provides the mlx5 driver support for
> ConnectX capability to perform IP version identification and matching in
> order to distinguish between IPv4 and IPv6 without the need to specify the
> encapsulation type, thus perform RSS in MPLS automatically without
> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
> classification for inner GRE RSS.

I don't think this is legal at all or did I misunderstood something?

<https://tools.ietf.org/html/rfc3032#section-2.2>

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-08-31 13:51 ` Hannes Frederic Sowa
@ 2017-09-02 23:01   ` Saeed Mahameed
  2017-09-03  1:32     ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Saeed Mahameed @ 2017-09-02 23:01 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Saeed Mahameed, David S. Miller, Linux Netdev List

On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Saeed Mahameed <saeedm@mellanox.com> writes:
>
>> The first patch from Gal and Ariel provides the mlx5 driver support for
>> ConnectX capability to perform IP version identification and matching in
>> order to distinguish between IPv4 and IPv6 without the need to specify the
>> encapsulation type, thus perform RSS in MPLS automatically without
>> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>> classification for inner GRE RSS.
>
> I don't think this is legal at all or did I misunderstood something?
>
> <https://tools.ietf.org/html/rfc3032#section-2.2>

It seems you misunderstood the cover letter.  The HW will still
identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
steering rules rather than adding new specific rules with  {MPLS
ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
functionality a better and general way to approach it.
Bottom line the hardware is capable of processing MPLS headers and
perform RSS on the inner packet (IPv4/6) without the need of the
driver to provide precise steering MPLS rules.

>
> Thanks,
> Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-02 23:01   ` Saeed Mahameed
@ 2017-09-03  1:32     ` Hannes Frederic Sowa
  2017-09-03  1:37       ` Tom Herbert
  2017-09-03  4:07       ` Saeed Mahameed
  0 siblings, 2 replies; 30+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-03  1:32 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Saeed Mahameed, David S. Miller, Linux Netdev List

Hi Saeed,

On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > Saeed Mahameed <saeedm@mellanox.com> writes:
> >
> >> The first patch from Gal and Ariel provides the mlx5 driver support for
> >> ConnectX capability to perform IP version identification and matching in
> >> order to distinguish between IPv4 and IPv6 without the need to specify the
> >> encapsulation type, thus perform RSS in MPLS automatically without
> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
> >> classification for inner GRE RSS.
> >
> > I don't think this is legal at all or did I misunderstood something?
> >
> > <https://tools.ietf.org/html/rfc3032#section-2.2>
> 
> It seems you misunderstood the cover letter.  The HW will still
> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
> steering rules rather than adding new specific rules with  {MPLS
> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
> functionality a better and general way to approach it.
> Bottom line the hardware is capable of processing MPLS headers and
> perform RSS on the inner packet (IPv4/6) without the need of the
> driver to provide precise steering MPLS rules.

Sorry, I think I am still confused.

I just want to make sure that you don't use the first nibble after the
mpls bottom of stack label in any way as an indicator if that is an IPv4
or IPv6 packet by default. It can be anything. The forward equivalence
class tells the stack which protocol you see.

If you match on the first nibble behind the MPLS bottom of stack label
the '4' or '6' respectively could be part of a MAC address with its
first nibble being 4 or 6, because the particular pseudowire is EoMPLS
and uses no control world.

I wanted to mention it, because with addition of e.g. VPLS this could
cause problems down the road and should at least be controllable? It is
probably better to use Entropy Labels in future.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-03  1:32     ` Hannes Frederic Sowa
@ 2017-09-03  1:37       ` Tom Herbert
  2017-09-03  4:11         ` Saeed Mahameed
  2017-09-03  4:07       ` Saeed Mahameed
  1 sibling, 1 reply; 30+ messages in thread
From: Tom Herbert @ 2017-09-03  1:37 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Saeed,
>
> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>> >
>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>> >> ConnectX capability to perform IP version identification and matching in
>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>> >> encapsulation type, thus perform RSS in MPLS automatically without
>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>> >> classification for inner GRE RSS.
>> >
>> > I don't think this is legal at all or did I misunderstood something?
>> >
>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>
>> It seems you misunderstood the cover letter.  The HW will still
>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>> steering rules rather than adding new specific rules with  {MPLS
>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>> functionality a better and general way to approach it.
>> Bottom line the hardware is capable of processing MPLS headers and
>> perform RSS on the inner packet (IPv4/6) without the need of the
>> driver to provide precise steering MPLS rules.
>
> Sorry, I think I am still confused.
>
> I just want to make sure that you don't use the first nibble after the
> mpls bottom of stack label in any way as an indicator if that is an IPv4
> or IPv6 packet by default. It can be anything. The forward equivalence
> class tells the stack which protocol you see.
>
> If you match on the first nibble behind the MPLS bottom of stack label
> the '4' or '6' respectively could be part of a MAC address with its
> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
> and uses no control world.
>
> I wanted to mention it, because with addition of e.g. VPLS this could
> cause problems down the road and should at least be controllable? It is
> probably better to use Entropy Labels in future.
>
Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you
prefer) then all this protocol specific DPI for RSS just goes away ;-)

Tom

> Thanks,
> Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-03  1:32     ` Hannes Frederic Sowa
  2017-09-03  1:37       ` Tom Herbert
@ 2017-09-03  4:07       ` Saeed Mahameed
  2017-09-04  9:37         ` Hannes Frederic Sowa
  1 sibling, 1 reply; 30+ messages in thread
From: Saeed Mahameed @ 2017-09-03  4:07 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Saeed Mahameed, David S. Miller, Linux Netdev List

On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Saeed,
>
> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>> >
>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>> >> ConnectX capability to perform IP version identification and matching in
>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>> >> encapsulation type, thus perform RSS in MPLS automatically without
>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>> >> classification for inner GRE RSS.
>> >
>> > I don't think this is legal at all or did I misunderstood something?
>> >
>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>
>> It seems you misunderstood the cover letter.  The HW will still
>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>> steering rules rather than adding new specific rules with  {MPLS
>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>> functionality a better and general way to approach it.
>> Bottom line the hardware is capable of processing MPLS headers and
>> perform RSS on the inner packet (IPv4/6) without the need of the
>> driver to provide precise steering MPLS rules.
>
> Sorry, I think I am still confused.
>
> I just want to make sure that you don't use the first nibble after the
> mpls bottom of stack label in any way as an indicator if that is an IPv4
> or IPv6 packet by default. It can be anything. The forward equivalence
> class tells the stack which protocol you see.
>
> If you match on the first nibble behind the MPLS bottom of stack label
> the '4' or '6' respectively could be part of a MAC address with its
> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
> and uses no control world.
>
> I wanted to mention it, because with addition of e.g. VPLS this could
> cause problems down the road and should at least be controllable? It is
> probably better to use Entropy Labels in future.
>

Hi Hannes,

I see your concern now, but still it has nothing to do with the
driver, the whole change is only to simplify driver code to not push
full blown matching steering rules into the HW, and simply replace it
with a one bit command.

Regarding your concern, I will need to check with the HW guys and
review the processing algorithm that identifies IP packets over MPLs,
and will get back to you.

if there is really a problem, then yes, we might need to make it
controllable ..

> Thanks,
> Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-03  1:37       ` Tom Herbert
@ 2017-09-03  4:11         ` Saeed Mahameed
  2017-09-03 15:43           ` Tom Herbert
  0 siblings, 1 reply; 30+ messages in thread
From: Saeed Mahameed @ 2017-09-03  4:11 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Hannes Frederic Sowa, Saeed Mahameed, David S. Miller, Linux Netdev List

On Sat, Sep 2, 2017 at 6:37 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Hi Saeed,
>>
>> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>>> <hannes@stressinduktion.org> wrote:
>>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>>> >
>>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>>> >> ConnectX capability to perform IP version identification and matching in
>>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>>> >> encapsulation type, thus perform RSS in MPLS automatically without
>>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>>> >> classification for inner GRE RSS.
>>> >
>>> > I don't think this is legal at all or did I misunderstood something?
>>> >
>>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>>
>>> It seems you misunderstood the cover letter.  The HW will still
>>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>>> steering rules rather than adding new specific rules with  {MPLS
>>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>>> functionality a better and general way to approach it.
>>> Bottom line the hardware is capable of processing MPLS headers and
>>> perform RSS on the inner packet (IPv4/6) without the need of the
>>> driver to provide precise steering MPLS rules.
>>
>> Sorry, I think I am still confused.
>>
>> I just want to make sure that you don't use the first nibble after the
>> mpls bottom of stack label in any way as an indicator if that is an IPv4
>> or IPv6 packet by default. It can be anything. The forward equivalence
>> class tells the stack which protocol you see.
>>
>> If you match on the first nibble behind the MPLS bottom of stack label
>> the '4' or '6' respectively could be part of a MAC address with its
>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
>> and uses no control world.
>>
>> I wanted to mention it, because with addition of e.g. VPLS this could
>> cause problems down the road and should at least be controllable? It is
>> probably better to use Entropy Labels in future.
>>
> Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you
> prefer) then all this protocol specific DPI for RSS just goes away ;-)

Hi Tom,

How does MPLS/UDP or GRE/UDP RSS works without protocol specific DPI ?
unlike vxlan those protocols are not over UDP and you can't just play
with the outer header udp src port, or do you ?

Can you elaborate ?

Thanks,
Saeed.

>
> Tom
>
>> Thanks,
>> Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-03  4:11         ` Saeed Mahameed
@ 2017-09-03 15:43           ` Tom Herbert
  2017-09-03 16:17             ` Or Gerlitz
  2017-09-04 13:50             ` Hannes Frederic Sowa
  0 siblings, 2 replies; 30+ messages in thread
From: Tom Herbert @ 2017-09-03 15:43 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Hannes Frederic Sowa, Saeed Mahameed, David S. Miller, Linux Netdev List

On Sat, Sep 2, 2017 at 9:11 PM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
> On Sat, Sep 2, 2017 at 6:37 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>>> Hi Saeed,
>>>
>>> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>>>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>>>> <hannes@stressinduktion.org> wrote:
>>>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>>>> >
>>>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>>>> >> ConnectX capability to perform IP version identification and matching in
>>>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>>>> >> encapsulation type, thus perform RSS in MPLS automatically without
>>>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>>>> >> classification for inner GRE RSS.
>>>> >
>>>> > I don't think this is legal at all or did I misunderstood something?
>>>> >
>>>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>>>
>>>> It seems you misunderstood the cover letter.  The HW will still
>>>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>>>> steering rules rather than adding new specific rules with  {MPLS
>>>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>>>> functionality a better and general way to approach it.
>>>> Bottom line the hardware is capable of processing MPLS headers and
>>>> perform RSS on the inner packet (IPv4/6) without the need of the
>>>> driver to provide precise steering MPLS rules.
>>>
>>> Sorry, I think I am still confused.
>>>
>>> I just want to make sure that you don't use the first nibble after the
>>> mpls bottom of stack label in any way as an indicator if that is an IPv4
>>> or IPv6 packet by default. It can be anything. The forward equivalence
>>> class tells the stack which protocol you see.
>>>
>>> If you match on the first nibble behind the MPLS bottom of stack label
>>> the '4' or '6' respectively could be part of a MAC address with its
>>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
>>> and uses no control world.
>>>
>>> I wanted to mention it, because with addition of e.g. VPLS this could
>>> cause problems down the road and should at least be controllable? It is
>>> probably better to use Entropy Labels in future.
>>>
>> Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you
>> prefer) then all this protocol specific DPI for RSS just goes away ;-)
>
> Hi Tom,
>
> How does MPLS/UDP or GRE/UDP RSS works without protocol specific DPI ?
> unlike vxlan those protocols are not over UDP and you can't just play
> with the outer header udp src port, or do you ?
>
> Can you elaborate ?
>
Hi Saeed,

An encapsulator sets the UDP source port to be the flow entropy of the
packet being encapsulated. So when the packet traverses the network
devices can base their hash just on the canonical 5-tuple which is
sufficient for ECMP and RSS. IPv6 flow label is even better since the
middleboxes don't even need to look at the transport header, a packet
is steered based on the 3-tuple of addresses and flow label. In the
Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
the source port. Flow label is similarly set by ip6_make_flowlabel.
Both of these functions use the skb->hash which is computed by calling
flow dissector at most once per packet (if the packet was received
with an L4 HW hash or locally originated on a connection the hash does
not need to be computed).

Please look at https://people.netfilter.org/pablo/netdev0.1/papers/UDP-Encapsulation-in-Linux.pdf
as well as Davem's "Less is More" presentation which highlights the
virtues of protocol generic HW mechanisms
(https://www.youtube.com/watch?v=6VgmazGwL_Y).

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-03 15:43           ` Tom Herbert
@ 2017-09-03 16:17             ` Or Gerlitz
  2017-09-03 16:45               ` Tom Herbert
  2017-09-04 13:50             ` Hannes Frederic Sowa
  1 sibling, 1 reply; 30+ messages in thread
From: Or Gerlitz @ 2017-09-03 16:17 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Hannes Frederic Sowa, Saeed Mahameed,
	David S. Miller, Linux Netdev List

On Sun, Sep 3, 2017 at 6:43 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Sat, Sep 2, 2017 at 9:11 PM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>> On Sat, Sep 2, 2017 at 6:37 PM, Tom Herbert <tom@herbertland.com> wrote:
>>> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
>>> <hannes@stressinduktion.org> wrote:
>>>> Hi Saeed,
>>>>
>>>> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>>>>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>>>>> <hannes@stressinduktion.org> wrote:
>>>>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>>>>> >
>>>>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>>>>> >> ConnectX capability to perform IP version identification and matching in
>>>>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>>>>> >> encapsulation type, thus perform RSS in MPLS automatically without
>>>>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>>>>> >> classification for inner GRE RSS.
>>>>> >
>>>>> > I don't think this is legal at all or did I misunderstood something?
>>>>> >
>>>>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>>>>
>>>>> It seems you misunderstood the cover letter.  The HW will still
>>>>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>>>>> steering rules rather than adding new specific rules with  {MPLS
>>>>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>>>>> functionality a better and general way to approach it.
>>>>> Bottom line the hardware is capable of processing MPLS headers and
>>>>> perform RSS on the inner packet (IPv4/6) without the need of the
>>>>> driver to provide precise steering MPLS rules.
>>>>
>>>> Sorry, I think I am still confused.
>>>>
>>>> I just want to make sure that you don't use the first nibble after the
>>>> mpls bottom of stack label in any way as an indicator if that is an IPv4
>>>> or IPv6 packet by default. It can be anything. The forward equivalence
>>>> class tells the stack which protocol you see.
>>>>
>>>> If you match on the first nibble behind the MPLS bottom of stack label
>>>> the '4' or '6' respectively could be part of a MAC address with its
>>>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
>>>> and uses no control world.
>>>>
>>>> I wanted to mention it, because with addition of e.g. VPLS this could
>>>> cause problems down the road and should at least be controllable? It is
>>>> probably better to use Entropy Labels in future.
>>>>
>>> Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you
>>> prefer) then all this protocol specific DPI for RSS just goes away ;-)

>> How does MPLS/UDP or GRE/UDP RSS works without protocol specific DPI ?
>> unlike vxlan those protocols are not over UDP and you can't just play
>> with the outer header udp src port, or do you ?
>> Can you elaborate ?

> An encapsulator sets the UDP source port to be the flow entropy of the
> packet being encapsulated. So when the packet traverses the network
> devices can base their hash just on the canonical 5-tuple which is
> sufficient for ECMP and RSS. IPv6 flow label is even better since the
> middleboxes don't even need to look at the transport header, a packet
> is steered based on the 3-tuple of addresses and flow label. In the
> Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
> the source port. Flow label is similarly set by ip6_make_flowlabel.
> Both of these functions use the skb->hash which is computed by calling
> flow dissector at most once per packet (if the packet was received
> with an L4 HW hash or locally originated on a connection the hash does
> not need to be computed).

Hi Tom,

Re all sorts of udp encap, sure, we're all on the less-is-more thing and just
RSS-ing on the ip+udp encap header.

For GRE, I was trying to fight back that rss-ing on inner, but as
Saeed commented,
we didn't see something simple through which the HW can do spreading. To make
sure I follow, you are saying that if this is gre6 tunneling

net-next.git]# git grep -p ip6_make_flowlabel net/ include/linux/ include/net/
include/net/ipv6.h=static inline void iph_to_flow_copy_v6addrs(struct
flow_keys *flow,
include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net
*net, struct sk_buff *skb,
include/net/ipv6.h=static inline void ip6_set_txhash(struct sock *sk) { }
include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net
*net, struct sk_buff *skb,
net/ipv6/ip6_gre.c=static int ip6gre_header(struct sk_buff *skb,
struct net_device *dev,
net/ipv6/ip6_gre.c:                  ip6_make_flowlabel(dev_net(dev), skb,
net/ipv6/ip6_output.c=int ip6_xmit(const struct sock *sk, struct
sk_buff *skb, struct flowi6 *fl6,
net/ipv6/ip6_output.c:  ip6_flow_hdr(hdr, tclass,
ip6_make_flowlabel(net, skb, fl6->flowlabel,
net/ipv6/ip6_output.c=struct sk_buff *__ip6_make_skb(struct sock *sk,
net/ipv6/ip6_output.c:               ip6_make_flowlabel(net, skb,
fl6->flowlabel,
net/ipv6/ip6_tunnel.c=int ip6_tnl_xmit(struct sk_buff *skb, struct
net_device *dev, __u8 dsfield,
net/ipv6/ip6_tunnel.c:               ip6_make_flowlabel(net, skb,
fl6->flowlabel, true, fl6));

the sender side (ip6_tnl_xmit?) will set the IPv6 flow label in a
similar manner done by udp_flow_src_port? and
if the receiver side hashes on L3 IPv6 src/dst/flow label we'll get
spreading? nice!

Still, what do we do with IPv4 GRE tunnels? and what do we do with HW
which isn't capable to RSS on flow label?

> Please look at https://people.netfilter.org/pablo/netdev0.1/papers/UDP-Encapsulation-in-Linux.pdf
> as well as Davem's "Less is More" presentation which highlights the
> virtues of protocol generic HW mechanisms
> (https://www.youtube.com/watch?v=6VgmazGwL_Y).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-03 16:17             ` Or Gerlitz
@ 2017-09-03 16:45               ` Tom Herbert
  2017-09-03 18:58                 ` Or Gerlitz
  0 siblings, 1 reply; 30+ messages in thread
From: Tom Herbert @ 2017-09-03 16:45 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Saeed Mahameed, Hannes Frederic Sowa, Saeed Mahameed,
	David S. Miller, Linux Netdev List

> Re all sorts of udp encap, sure, we're all on the less-is-more thing and just
> RSS-ing on the ip+udp encap header.
>
> For GRE, I was trying to fight back that rss-ing on inner, but as
> Saeed commented,
> we didn't see something simple through which the HW can do spreading. To make
> sure I follow, you are saying that if this is gre6 tunneling
>
It's pretty common that HW does this since GRE is in widespread use for a while.

> net-next.git]# git grep -p ip6_make_flowlabel net/ include/linux/ include/net/
> include/net/ipv6.h=static inline void iph_to_flow_copy_v6addrs(struct
> flow_keys *flow,
> include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net
> *net, struct sk_buff *skb,
> include/net/ipv6.h=static inline void ip6_set_txhash(struct sock *sk) { }
> include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net
> *net, struct sk_buff *skb,
> net/ipv6/ip6_gre.c=static int ip6gre_header(struct sk_buff *skb,
> struct net_device *dev,
> net/ipv6/ip6_gre.c:                  ip6_make_flowlabel(dev_net(dev), skb,
> net/ipv6/ip6_output.c=int ip6_xmit(const struct sock *sk, struct
> sk_buff *skb, struct flowi6 *fl6,
> net/ipv6/ip6_output.c:  ip6_flow_hdr(hdr, tclass,
> ip6_make_flowlabel(net, skb, fl6->flowlabel,
> net/ipv6/ip6_output.c=struct sk_buff *__ip6_make_skb(struct sock *sk,
> net/ipv6/ip6_output.c:               ip6_make_flowlabel(net, skb,
> fl6->flowlabel,
> net/ipv6/ip6_tunnel.c=int ip6_tnl_xmit(struct sk_buff *skb, struct
> net_device *dev, __u8 dsfield,
> net/ipv6/ip6_tunnel.c:               ip6_make_flowlabel(net, skb,
> fl6->flowlabel, true, fl6));
>
Seems right.

> the sender side (ip6_tnl_xmit?) will set the IPv6 flow label in a
> similar manner done by udp_flow_src_port? and
> if the receiver side hashes on L3 IPv6 src/dst/flow label we'll get
> spreading? nice!
>
As long as the network devices support it.

> Still, what do we do with IPv4 GRE tunnels? and what do we do with HW
> which isn't capable to RSS on flow label?
>
Throw it out and buy hardware that supports flow label! ;-)

Seriously though, flow labels are the only reasonable way that RSS can
be supported in IPv6. If a device tries to do DPI on IPv6 then they'll
eventually need to be able to parse of some number of extension
headers which unlike IPv4 is unbounded in size. So there are going to
be circumstances in which a device either doesn't understand an EH, or
the size of EHs blows it parsing buffer limit so it can't do the DPI.
IMO, telling host implementations that we're not allowed to use
extension headers because middleboxes can't support them is
unacceptable...

Btw, these same arguments apply as to why CHECKSUM_COMPLETE is the
only reasonable way to handle receive checksums in IPv6.

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-03 16:45               ` Tom Herbert
@ 2017-09-03 18:58                 ` Or Gerlitz
  0 siblings, 0 replies; 30+ messages in thread
From: Or Gerlitz @ 2017-09-03 18:58 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Hannes Frederic Sowa, Saeed Mahameed,
	David S. Miller, Linux Netdev List

On Sun, Sep 3, 2017 at 7:45 PM, Tom Herbert <tom@herbertland.com> wrote:
>> Re all sorts of udp encap, sure, we're all on the less-is-more thing and just
>> RSS-ing on the ip+udp encap header.

>> For GRE, I was trying to fight back that rss-ing on inner, but as
>> Saeed commented,
>> we didn't see something simple through which the HW can do spreading. To
>> make sure I follow, you are saying that if this is gre6 tunneling

> It's pretty common that HW does this since GRE is in widespread use for a while.

ECMP? I guess so


>> the sender side (ip6_tnl_xmit?) will set the IPv6 flow label in a
>> similar manner done by udp_flow_src_port? and
>> if the receiver side hashes on L3 IPv6 src/dst/flow label we'll get
>> spreading? nice!

> As long as the network devices support it.

yeah, hopefully upcoming devices will support the thing

>> Still, what do we do with IPv4 GRE tunnels? and what do we do with HW
>> which isn't capable to RSS on flow label?

> Throw it out and buy hardware that supports flow label! ;-)

still, we came across IPv4 GRE customer environments

> Seriously though, flow labels are the only reasonable way that RSS can
> be supported in IPv6. If a device tries to do DPI on IPv6 then they'll
> eventually need to be able to parse of some number of extension
> headers which unlike IPv4 is unbounded in size. So there are going to
> be circumstances in which a device either doesn't understand an EH, or
> the size of EHs blows it parsing buffer limit so it can't do the DPI.
> IMO, telling host implementations that we're not allowed to use
> extension headers because middleboxes can't support them is
> unacceptable...

makes sense to me

> Btw, these same arguments apply as to why CHECKSUM_COMPLETE is the
> only reasonable way to handle receive checksums in IPv6.

supported on mlx5

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-03  4:07       ` Saeed Mahameed
@ 2017-09-04  9:37         ` Hannes Frederic Sowa
  0 siblings, 0 replies; 30+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-04  9:37 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Saeed Mahameed, David S. Miller, Linux Netdev List

Hello,

Saeed Mahameed <saeedm@dev.mellanox.co.il> writes:

[...]

> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Sorry, I think I am still confused.
>>
>> I just want to make sure that you don't use the first nibble after the
>> mpls bottom of stack label in any way as an indicator if that is an IPv4
>> or IPv6 packet by default. It can be anything. The forward equivalence
>> class tells the stack which protocol you see.
>>
>> If you match on the first nibble behind the MPLS bottom of stack label
>> the '4' or '6' respectively could be part of a MAC address with its
>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
>> and uses no control world.
>>
>> I wanted to mention it, because with addition of e.g. VPLS this could
>> cause problems down the road and should at least be controllable? It is
>> probably better to use Entropy Labels in future.
>>
>
> Hi Hannes,
>
> I see your concern now, but still it has nothing to do with the
> driver, the whole change is only to simplify driver code to not push
> full blown matching steering rules into the HW, and simply replace it
> with a one bit command.

Sorry, I very much got the impression from the cover letter plus the
descriptions of the patches.

> Regarding your concern, I will need to check with the HW guys and
> review the processing algorithm that identifies IP packets over MPLs,
> and will get back to you.
>
> if there is really a problem, then yes, we might need to make it
> controllable ..

Thanks for looking into this. I do think it can be added as a feature
but I very much think such logic should be disabled by default.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-03 15:43           ` Tom Herbert
  2017-09-03 16:17             ` Or Gerlitz
@ 2017-09-04 13:50             ` Hannes Frederic Sowa
  2017-09-04 16:15               ` Tom Herbert
  1 sibling, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-04 13:50 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

Tom Herbert <tom@herbertland.com> writes:

> An encapsulator sets the UDP source port to be the flow entropy of the
> packet being encapsulated. So when the packet traverses the network
> devices can base their hash just on the canonical 5-tuple which is
> sufficient for ECMP and RSS. IPv6 flow label is even better since the
> middleboxes don't even need to look at the transport header, a packet
> is steered based on the 3-tuple of addresses and flow label. In the
> Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
> the source port. Flow label is similarly set by ip6_make_flowlabel.
> Both of these functions use the skb->hash which is computed by calling
> flow dissector at most once per packet (if the packet was received
> with an L4 HW hash or locally originated on a connection the hash does
> not need to be computed).

This would require the MPLS stack copying the flowlabel of IPv6
connections between MPLS routers over their whole lifetime in the MPLS
network. The same would hold for MPLS encapsulated inside UDP, the
source port needs to be kept constant. This is very impractical. The
hash for the flow label can often not be recomputed by interim routers,
because they might lack the knowledge of the upper layer protocol.

UDP source port entropy still has the problem that we don't respect the
source port as RSS entropy by default in network cards, because of
possible fragmentation and thus possible reordering of packets. GRE does
not have this problem and is way easier to identify by hardware.

Basically we need to tell network cards that they can use specific
destination UDP ports where we allow the source port to be used in RSS
hash calculation. I don't see how this is any easier than just using GRE
with a defined protocol field? I do like the combination of ipv6
flowlabel + GRE.

Btw. people are using the GRE Key as additional entropy without looking
into the GRE payload.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-04 13:50             ` Hannes Frederic Sowa
@ 2017-09-04 16:15               ` Tom Herbert
  2017-09-04 16:52                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Tom Herbert @ 2017-09-04 16:15 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

On Mon, Sep 4, 2017 at 6:50 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Tom Herbert <tom@herbertland.com> writes:
>
>> An encapsulator sets the UDP source port to be the flow entropy of the
>> packet being encapsulated. So when the packet traverses the network
>> devices can base their hash just on the canonical 5-tuple which is
>> sufficient for ECMP and RSS. IPv6 flow label is even better since the
>> middleboxes don't even need to look at the transport header, a packet
>> is steered based on the 3-tuple of addresses and flow label. In the
>> Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
>> the source port. Flow label is similarly set by ip6_make_flowlabel.
>> Both of these functions use the skb->hash which is computed by calling
>> flow dissector at most once per packet (if the packet was received
>> with an L4 HW hash or locally originated on a connection the hash does
>> not need to be computed).
>
> This would require the MPLS stack copying the flowlabel of IPv6
> connections between MPLS routers over their whole lifetime in the MPLS
> network. The same would hold for MPLS encapsulated inside UDP, the
> source port needs to be kept constant. This is very impractical. The
> hash for the flow label can often not be recomputed by interim routers,
> because they might lack the knowledge of the upper layer protocol.
>
Hannes,

When the flow label is set the packet will traverse the network and be
ECMP routed regardless of whether the payload is MPLS at anything
else-- the important characteristic is that network devices don't need
to know how to parse MPLS (or GRE, or IPIP, or L2TP, ESP, or ...) to
provide good ECMP. At a source the flow label or UDP source port needs
to be generated. That can be based on DPI, derived from the MPLS
entropy label, use SPI in ESP, etc. I don't see anything special about
MPLS in this regard.

> UDP source port entropy still has the problem that we don't respect the
> source port as RSS entropy by default in network cards, because of
> possible fragmentation and thus possible reordering of packets. GRE does
> not have this problem and is way easier to identify by hardware.
>
> Basically we need to tell network cards that they can use specific
> destination UDP ports where we allow the source port to be used in RSS
> hash calculation. I don't see how this is any easier than just using GRE
> with a defined protocol field? I do like the combination of ipv6
> flowlabel + GRE.
>
No, we don't any more want port specific configuration in NICs! The
NIC should just fallback to 3-tuple hash when it see MF or offset set
in IPv4 header. But even if it doesn't implement this, receiving OOO
fragments is hardly the end of the world-- IP packets are always
allowed to be received OOO. If something breaks because in order
delivery is assumed then that is the bug that needs to be fixed. So at
best handling fragmentation in this manner is proposed om
optimization whose benefits will pale to getting good ECMP and RSS
when encapsulation is in use.

> Btw. people are using the GRE Key as additional entropy without looking
> into the GRE payload.
>
Sure some are, but the GRE key is not defined to be flow entropy so
it's not ubiquitous it used for that so it gives sufficient entropy or
is even constant per flow. GRE/UDP (RFC8086) was primarily written to
allow a more consistent method (as was RFC7510 for doing MPLS/UDP).

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-04 16:15               ` Tom Herbert
@ 2017-09-04 16:52                 ` Hannes Frederic Sowa
  2017-09-04 17:11                   ` Tom Herbert
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-04 16:52 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

Hello Tom,

Tom Herbert <tom@herbertland.com> writes:

> On Mon, Sep 4, 2017 at 6:50 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Tom Herbert <tom@herbertland.com> writes:
>>
>>> An encapsulator sets the UDP source port to be the flow entropy of the
>>> packet being encapsulated. So when the packet traverses the network
>>> devices can base their hash just on the canonical 5-tuple which is
>>> sufficient for ECMP and RSS. IPv6 flow label is even better since the
>>> middleboxes don't even need to look at the transport header, a packet
>>> is steered based on the 3-tuple of addresses and flow label. In the
>>> Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
>>> the source port. Flow label is similarly set by ip6_make_flowlabel.
>>> Both of these functions use the skb->hash which is computed by calling
>>> flow dissector at most once per packet (if the packet was received
>>> with an L4 HW hash or locally originated on a connection the hash does
>>> not need to be computed).
>>
>> This would require the MPLS stack copying the flowlabel of IPv6
>> connections between MPLS routers over their whole lifetime in the MPLS
>> network. The same would hold for MPLS encapsulated inside UDP, the
>> source port needs to be kept constant. This is very impractical. The
>> hash for the flow label can often not be recomputed by interim routers,
>> because they might lack the knowledge of the upper layer protocol.
>>
> Hannes,
>
> When the flow label is set the packet will traverse the network and be
> ECMP routed regardless of whether the payload is MPLS at anything
> else-- the important characteristic is that network devices don't need
> to know how to parse MPLS (or GRE, or IPIP, or L2TP, ESP, or ...) to
> provide good ECMP. At a source the flow label or UDP source port needs
> to be generated. That can be based on DPI, derived from the MPLS
> entropy label, use SPI in ESP, etc. I don't see anything special about
> MPLS in this regard.

The MPLS circuit is only end to end in terms of IP processing if MPLS is
used for multitenant separation.

Normally the IP connection is done between two label switch routers,
thus is not end to end. One LSR will decapsulate the packet and throw
the IP header away, do the label processing and will reencapsulate it
with the new next hop information. To keep the assigned entropy alive it
would have to save the UDP source port or flowlabel and patch the
outgoing IP header again. This is certainly possible it just seems more
unnatural.

Normally every next hop does MPLS processing and thus the packet
traverses up the stack. Special purpose (entropy) MPLS labels allow the
stack to achieve RSS just based on the label stack and will be
end-to-end in a MPLS cloud.

>> UDP source port entropy still has the problem that we don't respect the
>> source port as RSS entropy by default in network cards, because of
>> possible fragmentation and thus possible reordering of packets. GRE does
>> not have this problem and is way easier to identify by hardware.
>>
>> Basically we need to tell network cards that they can use specific
>> destination UDP ports where we allow the source port to be used in RSS
>> hash calculation. I don't see how this is any easier than just using GRE
>> with a defined protocol field? I do like the combination of ipv6
>> flowlabel + GRE.
>>
> No, we don't any more want port specific configuration in NICs! The
> NIC should just fallback to 3-tuple hash when it see MF or offset set
> in IPv4 header. But even if it doesn't implement this, receiving OOO
> fragments is hardly the end of the world-- IP packets are always
> allowed to be received OOO. If something breaks because in order
> delivery is assumed then that is the bug that needs to be fixed. So at
> best handling fragmentation in this manner is proposed om
> optimization whose benefits will pale to getting good ECMP and RSS
> when encapsulation is in use.

The problem is that you end up having two streams, one fragmented and
one non-fragmented, but actually they belong to the same stream. It is
known to break stuff, see:

<https://patchwork.ozlabs.org/patch/59235/>

I would agree with you, but we can't break existing setups,
unfortunately.

>> Btw. people are using the GRE Key as additional entropy without looking
>> into the GRE payload.
>>
> Sure some are, but the GRE key is not defined to be flow entropy so
> it's not ubiquitous it used for that so it gives sufficient entropy or
> is even constant per flow. GRE/UDP (RFC8086) was primarily written to
> allow a more consistent method (as was RFC7510 for doing MPLS/UDP).

I agree with that, just wanted to mention it.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-04 16:52                 ` Hannes Frederic Sowa
@ 2017-09-04 17:11                   ` Tom Herbert
  2017-09-04 17:57                     ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Tom Herbert @ 2017-09-04 17:11 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

> The problem is that you end up having two streams, one fragmented and
> one non-fragmented, but actually they belong to the same stream. It is
> known to break stuff, see:
>
> <https://patchwork.ozlabs.org/patch/59235/>
>
> I would agree with you, but we can't break existing setups,
> unfortunately.
>
I'm not sure what "existing setups" means here. If this is referring
to the Internet then out of order packets and fragments abound-- any
assumption of in order delivery for IP packets is useless there.

Btw, TCP has exactly the same problem in this regard that UDP has with
regard to fragmentation. The reason that TCP isn't considered an issue
is the assumption that TCP will do PMTU discovery and set DF (I would
bet that devices don't actually check for DF and vendors don't test
when it's not set!).

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-04 17:11                   ` Tom Herbert
@ 2017-09-04 17:57                     ` Hannes Frederic Sowa
  2017-09-04 18:56                       ` Tom Herbert
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-04 17:57 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

Hi Tom,

Tom Herbert <tom@herbertland.com> writes:

>> The problem is that you end up having two streams, one fragmented and
>> one non-fragmented, but actually they belong to the same stream. It is
>> known to break stuff, see:
>>
>> <https://patchwork.ozlabs.org/patch/59235/>
>>
>> I would agree with you, but we can't break existing setups,
>> unfortunately.
>>
> I'm not sure what "existing setups" means here. If this is referring
> to the Internet then out of order packets and fragments abound-- any
> assumption of in order delivery for IP packets is useless there.

Network cards don't know where traffic originates from, even on the same
card. Clouds nowadays try to convince everyone to virtualize existing
workloads. So I *assume* that at least for traffic that seems to be in
one L2 domain out of order should not occur or things will break.

Ethernet for a long time appeared to do very much in order delivery and
guarantees that. Thus for people it appeared that UDP packets are also
strictly in order. Encapsulating Ethernet inside UDP thus must preserve
those properties, especially if used for virtualization. Unfortunately
fragmentation happens and the stack has to deal with it somehow.

I do know about some software that uses UDP in multicast that is prone
to misbehave in case of OoO frames. It uses a small reassemble queue but
if reordering gets too high, it will simply drop data. This might be
acceptable up to a specific degree.

I guess one application you could harm is plain old simple syslog, which
often generated fragmented packets. Luckily one often has time stamps in
there to reassemble the log lines but one had to do this manually.

L2 domains are not bound to local networks anymore, thanks to vxlan etc.

If you are in a controlled environment and you do know your software
that runs perfectly well, certainly, no doubt, UDP hashing can be
enabled. I would argue we can't do that generally.

> Btw, TCP has exactly the same problem in this regard that UDP has with
> regard to fragmentation. The reason that TCP isn't considered an issue
> is the assumption that TCP will do PMTU discovery and set DF (I would
> bet that devices don't actually check for DF and vendors don't test
> when it's not set!).

I don't know.

For the application the stream will appear in order at the socket level
and OoO or fragmentation won't break anything. End systems will have a
harder time reassembling the stream and thus fast paths couldn't be
used. Is that what you are referring to?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-04 17:57                     ` Hannes Frederic Sowa
@ 2017-09-04 18:56                       ` Tom Herbert
  2017-09-05 11:14                         ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Tom Herbert @ 2017-09-04 18:56 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

On Mon, Sep 4, 2017 at 10:57 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Tom,
>
> Tom Herbert <tom@herbertland.com> writes:
>
>>> The problem is that you end up having two streams, one fragmented and
>>> one non-fragmented, but actually they belong to the same stream. It is
>>> known to break stuff, see:
>>>
>>> <https://patchwork.ozlabs.org/patch/59235/>
>>>
>>> I would agree with you, but we can't break existing setups,
>>> unfortunately.
>>>
>> I'm not sure what "existing setups" means here. If this is referring
>> to the Internet then out of order packets and fragments abound-- any
>> assumption of in order delivery for IP packets is useless there.
>
> Network cards don't know where traffic originates from, even on the same
> card. Clouds nowadays try to convince everyone to virtualize existing
> workloads. So I *assume* that at least for traffic that seems to be in
> one L2 domain out of order should not occur or things will break.
>
> Ethernet for a long time appeared to do very much in order delivery and
> guarantees that. Thus for people it appeared that UDP packets are also
> strictly in order. Encapsulating Ethernet inside UDP thus must preserve
> those properties, especially if used for virtualization. Unfortunately
> fragmentation happens and the stack has to deal with it somehow.
>
There is absolutely no requirement in IP that packets are delivered in
order-- there never has been and there never will be! If the ULP, like
Ethernet encapsulation, requires in order deliver then it needs to
implement that itself like TCP, GRE, and other protocols ensure that
with sequence numbers and reassembly. All of these hoops we do make
sure that packets always follow the same path and are always in order
are done for benefit of middlebox devices like stateful firewalls that
have force us to adopt their concept of network architecture-- in the
long run this is self-defeating and kills our ability to innovate.

I'm not saying that we shouldn't consider legacy devices, but we
should scrutinize new development or solutions that perpetuate
incorrect design or bad assumptions.

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-04 18:56                       ` Tom Herbert
@ 2017-09-05 11:14                         ` Hannes Frederic Sowa
  2017-09-05 16:02                           ` Tom Herbert
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-05 11:14 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

Tom Herbert <tom@herbertland.com> writes:

> There is absolutely no requirement in IP that packets are delivered in
> order-- there never has been and there never will be! If the ULP, like
> Ethernet encapsulation, requires in order deliver then it needs to
> implement that itself like TCP, GRE, and other protocols ensure that
> with sequence numbers and reassembly. All of these hoops we do make
> sure that packets always follow the same path and are always in order
> are done for benefit of middlebox devices like stateful firewalls that
> have force us to adopt their concept of network architecture-- in the
> long run this is self-defeating and kills our ability to innovate.
>
> I'm not saying that we shouldn't consider legacy devices, but we
> should scrutinize new development or solutions that perpetuate
> incorrect design or bad assumptions.

So configure RSS per port and ensure no fragments are send to those
ports. This is possible and rather easy to do. It solves the problem
with legacy software and it spreads out packets for your applications.

It is not perfect but it is working and solves both problems.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-05 11:14                         ` Hannes Frederic Sowa
@ 2017-09-05 16:02                           ` Tom Herbert
  2017-09-05 19:20                             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Tom Herbert @ 2017-09-05 16:02 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Tom Herbert <tom@herbertland.com> writes:
>
>> There is absolutely no requirement in IP that packets are delivered in
>> order-- there never has been and there never will be! If the ULP, like
>> Ethernet encapsulation, requires in order deliver then it needs to
>> implement that itself like TCP, GRE, and other protocols ensure that
>> with sequence numbers and reassembly. All of these hoops we do make
>> sure that packets always follow the same path and are always in order
>> are done for benefit of middlebox devices like stateful firewalls that
>> have force us to adopt their concept of network architecture-- in the
>> long run this is self-defeating and kills our ability to innovate.
>>
>> I'm not saying that we shouldn't consider legacy devices, but we
>> should scrutinize new development or solutions that perpetuate
>> incorrect design or bad assumptions.
>
> So configure RSS per port and ensure no fragments are send to those
> ports. This is possible and rather easy to do. It solves the problem
> with legacy software and it spreads out packets for your applications.
>
> It is not perfect but it is working and solves both problems.
>
Hannes,

I don't see how that solves anything. The purpose of RSS is to
distribute the load of protocol packets across queues. This needs to
work for UDP applications. For instance, if I were building a QUIC
server I'd want the sort of flow distribution that a TCP server would
give. You can't do that by configuring a few ports in the device.

If I were to suggest any HW change it would be to not do DPI on
fragments (MF or offset is set). This ensures that all packets of the
fragment train get hashed to the same queue and is on fact what RPS
has been doing for years without any complaints.

But even before I'd make that recommendation, I'd really like
understand what the problem actually is. The only thing I can garner
from this discussion and the Intel patch is that when fragments are
received OOO that is perceived as a problem. But the by the protocol
specification clearly says this is not a problem. So the questions
are: who is negatively affected by this? Is this a problem because
some artificial test that checks for everything to be in order is now
failing? Is this affecting real users? Is this an issue in the stack
or really with some implementation outside of the stack? If it is an
implementation outside of the stack, then are we just bandaid'ing over
someone else's incorrect implementation by patching the kernel (like
would have be the case if we change the kernel to interoperate with
Facebook's switch that couldn't handle OOO in twstate).

Thanks,
Tom

> Bye,
> Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-05 16:02                           ` Tom Herbert
@ 2017-09-05 19:20                             ` Hannes Frederic Sowa
  2017-09-05 21:13                               ` Tom Herbert
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-05 19:20 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

Hi Tom,

Tom Herbert <tom@herbertland.com> writes:

> On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Tom Herbert <tom@herbertland.com> writes:
>>
>>> There is absolutely no requirement in IP that packets are delivered in
>>> order-- there never has been and there never will be! If the ULP, like
>>> Ethernet encapsulation, requires in order deliver then it needs to
>>> implement that itself like TCP, GRE, and other protocols ensure that
>>> with sequence numbers and reassembly. All of these hoops we do make
>>> sure that packets always follow the same path and are always in order
>>> are done for benefit of middlebox devices like stateful firewalls that
>>> have force us to adopt their concept of network architecture-- in the
>>> long run this is self-defeating and kills our ability to innovate.
>>>
>>> I'm not saying that we shouldn't consider legacy devices, but we
>>> should scrutinize new development or solutions that perpetuate
>>> incorrect design or bad assumptions.
>>
>> So configure RSS per port and ensure no fragments are send to those
>> ports. This is possible and rather easy to do. It solves the problem
>> with legacy software and it spreads out packets for your applications.
>>
>> It is not perfect but it is working and solves both problems.
>>
> Hannes,
>
> I don't see how that solves anything. The purpose of RSS is to
> distribute the load of protocol packets across queues. This needs to
> work for UDP applications. For instance, if I were building a QUIC
> server I'd want the sort of flow distribution that a TCP server would
> give. You can't do that by configuring a few ports in the device.

I seriously am with you and I think you are on the right track. I would
also much rather like to see fragmentation *not* happening and would
like to see hardware logic *not* parsing deep down into packets and
protocols. We can agree on that!

Albeit I don't understand the problem with my approach:

If you know that you are running a QUIC server and know your environment
(safe against reordering), run `ethtool -N ethX rx-flow-hash udp4 sdfn'
on the box and you would get spreading of UDP packets based on the UDP
src+dst port numbers to the rx queues. Otherwise only source and
destination addresses are being considered, which is the safe default
(maybe this is sometimes enough).

Intel network cards warn you if you actually switch the mode:

-- >8 --
drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c:                                  "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n");
drivers/net/ethernet/intel/igb/igb_ethtool.c:                           "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n");
drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c:                       e_warn(drv, "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n");
-- 8< --

I argue that we can't change the default.

Furthermore I tried to argue that in a cloud where you don't know which
applications you run within the encapsulated networks, it is very hard
to predict which protocols can be considered safe against those
effects. I agree, that most of the time, it probably should not
matter. But given the rule about kernel backwards compatibility I do
have doubts that this change would go unnoticed.

I am the devil's advocate here and argue it is on us to show that is
safe before we change the semantics.

> If I were to suggest any HW change it would be to not do DPI on
> fragments (MF or offset is set). This ensures that all packets of the
> fragment train get hashed to the same queue and is on fact what RPS
> has been doing for years without any complaints.

The same UDP flow could consist out of fragments and non-fragmented
packets. Even if hardware would act according to what you wrote,
reordering is very much possible within the same UDP flow.

In case of QUIC, as far as I know, it uses destination ports 80 and
443. Like we already notify NICs about vxlan and geneve port numbers, we
could selectively enable RSS for UDP ports on those destination port
numbers to tell the NIC it won't receive fragments on those ports and
thus do RSS. The same way the NIC could be notified if reordering on
this port is possible. This could be done for example via a lightweight
setsockopt.

The situation with encapsulation is even more complicated:

We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
constellation. If we do the fragmentation inside the vxlan tunnel and
carry over the skb hash to all resulting UDP/vxlan packets source ports,
we are fine and reordering on the receiver NIC won't happen in this
case. If the fragmentation happens on the outer UDP header, this will
result in reordering of the inner L2 flow. Unfortunately this depends on
how the vxlan tunnel was set up, how other devices do that and (I
believe so) on the kernel version.

Given the fact that we tell the NICs the receiving port of vxlan tunnels
anyway (and I only can assume) they are doing RSS on that, we probably
have reordering anyway on those tunnels. So maybe you are right and it
doesn't matter in the end for encapsulated packets, by accident.

> But even before I'd make that recommendation, I'd really like
> understand what the problem actually is. The only thing I can garner
> from this discussion and the Intel patch is that when fragments are
> received OOO that is perceived as a problem. But the by the protocol
> specification clearly says this is not a problem. So the questions
> are: who is negatively affected by this? Is this a problem because
> some artificial test that checks for everything to be in order is now
> failing? Is this affecting real users? Is this an issue in the stack
> or really with some implementation outside of the stack? If it is an
> implementation outside of the stack, then are we just bandaid'ing over
> someone else's incorrect implementation by patching the kernel (like
> would have be the case if we change the kernel to interoperate with
> Facebook's switch that couldn't handle OOO in twstate).

I agree with that and I would also love to hear more feedback on
this. Unfortunately I don't know.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-05 19:20                             ` Hannes Frederic Sowa
@ 2017-09-05 21:13                               ` Tom Herbert
  2017-09-06  3:06                                 ` Alexander Duyck
  0 siblings, 1 reply; 30+ messages in thread
From: Tom Herbert @ 2017-09-05 21:13 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller, Linux Netdev List

> The situation with encapsulation is even more complicated:
>
> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
> constellation. If we do the fragmentation inside the vxlan tunnel and
> carry over the skb hash to all resulting UDP/vxlan packets source ports,
> we are fine and reordering on the receiver NIC won't happen in this
> case. If the fragmentation happens on the outer UDP header, this will
> result in reordering of the inner L2 flow. Unfortunately this depends on
> how the vxlan tunnel was set up, how other devices do that and (I
> believe so) on the kernel version.
>
This really isn't that complicated. The assumption that an IP network
always delivers packets in order is simply wrong. The inventors of
VXLAN must have know full well that when you use IP, packets can and
eventually will be delivered out of order. This isn't just because of
fragmentation, there are many other reasons that packets can be
delivered OOO. This also must have been known when IP/GRE and any
other protocol that carries L2 over IP was invented. If OOO is an
issue for these protocols then they need to be fixed-- this is not a
concern with IP protocol nor the stack.

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-05 21:13                               ` Tom Herbert
@ 2017-09-06  3:06                                 ` Alexander Duyck
  2017-09-06 16:17                                   ` Tom Herbert
  0 siblings, 1 reply; 30+ messages in thread
From: Alexander Duyck @ 2017-09-06  3:06 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Hannes Frederic Sowa, Saeed Mahameed, Saeed Mahameed,
	David S. Miller, Linux Netdev List

On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote:
>> The situation with encapsulation is even more complicated:
>>
>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>> constellation. If we do the fragmentation inside the vxlan tunnel and
>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>> we are fine and reordering on the receiver NIC won't happen in this
>> case. If the fragmentation happens on the outer UDP header, this will
>> result in reordering of the inner L2 flow. Unfortunately this depends on
>> how the vxlan tunnel was set up, how other devices do that and (I
>> believe so) on the kernel version.
>>
> This really isn't that complicated. The assumption that an IP network
> always delivers packets in order is simply wrong. The inventors of
> VXLAN must have know full well that when you use IP, packets can and
> eventually will be delivered out of order. This isn't just because of
> fragmentation, there are many other reasons that packets can be
> delivered OOO. This also must have been known when IP/GRE and any
> other protocol that carries L2 over IP was invented. If OOO is an
> issue for these protocols then they need to be fixed-- this is not a
> concern with IP protocol nor the stack.
>
> Tom

As far as a little background on the original patch I believe the
issue that was fixed by the patch was a video streaming application
that was sending/receiving a mix of fragmented and non-fragmented
packets. Receiving them out of order due to the fragmentation was
causing issues with stutters in the video and so we ended up disabling
UDP by default in the NICs listed. We decided to go that way as UDP
RSS was viewed as a performance optimization, while the out-of-order
problems were viewed as a functionality issue.

The default for i40e is to have UDP RSS hashing enabled if I recall
correctly. Basically as we move away from enterprise to cloud I
suspect that is going to be the case more and more since all the UDP
tunnels require either port recognition or UDP RSS to be enabled. For
now we carry the ability to enable UDP RSS if desired in the legacy
drivers, and I believe we have some white papers somewhere that
suggest enabling it if you are planning to use UDP based tunnels.

- Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-06  3:06                                 ` Alexander Duyck
@ 2017-09-06 16:17                                   ` Tom Herbert
  2017-09-06 17:43                                     ` Alexander Duyck
  0 siblings, 1 reply; 30+ messages in thread
From: Tom Herbert @ 2017-09-06 16:17 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Hannes Frederic Sowa, Saeed Mahameed, Saeed Mahameed,
	David S. Miller, Linux Netdev List

On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote:
>>> The situation with encapsulation is even more complicated:
>>>
>>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>>> constellation. If we do the fragmentation inside the vxlan tunnel and
>>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>>> we are fine and reordering on the receiver NIC won't happen in this
>>> case. If the fragmentation happens on the outer UDP header, this will
>>> result in reordering of the inner L2 flow. Unfortunately this depends on
>>> how the vxlan tunnel was set up, how other devices do that and (I
>>> believe so) on the kernel version.
>>>
>> This really isn't that complicated. The assumption that an IP network
>> always delivers packets in order is simply wrong. The inventors of
>> VXLAN must have know full well that when you use IP, packets can and
>> eventually will be delivered out of order. This isn't just because of
>> fragmentation, there are many other reasons that packets can be
>> delivered OOO. This also must have been known when IP/GRE and any
>> other protocol that carries L2 over IP was invented. If OOO is an
>> issue for these protocols then they need to be fixed-- this is not a
>> concern with IP protocol nor the stack.
>>
>> Tom
>
> As far as a little background on the original patch I believe the
> issue that was fixed by the patch was a video streaming application
> that was sending/receiving a mix of fragmented and non-fragmented
> packets. Receiving them out of order due to the fragmentation was
> causing issues with stutters in the video and so we ended up disabling
> UDP by default in the NICs listed. We decided to go that way as UDP
> RSS was viewed as a performance optimization, while the out-of-order
> problems were viewed as a functionality issue.
>
Hi Alex,

Thanks for the details! Were you able to find the root cause for this?
In particular, it would be interesting to know if it is the kernel or
device that introduced the jitter, or if it's the application that
doesn't handle OOO well...

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-06 16:17                                   ` Tom Herbert
@ 2017-09-06 17:43                                     ` Alexander Duyck
  2017-09-06 19:01                                       ` Tom Herbert
  0 siblings, 1 reply; 30+ messages in thread
From: Alexander Duyck @ 2017-09-06 17:43 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Hannes Frederic Sowa, Saeed Mahameed, Saeed Mahameed,
	David S. Miller, Linux Netdev List

On Wed, Sep 6, 2017 at 9:17 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>> The situation with encapsulation is even more complicated:
>>>>
>>>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>>>> constellation. If we do the fragmentation inside the vxlan tunnel and
>>>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>>>> we are fine and reordering on the receiver NIC won't happen in this
>>>> case. If the fragmentation happens on the outer UDP header, this will
>>>> result in reordering of the inner L2 flow. Unfortunately this depends on
>>>> how the vxlan tunnel was set up, how other devices do that and (I
>>>> believe so) on the kernel version.
>>>>
>>> This really isn't that complicated. The assumption that an IP network
>>> always delivers packets in order is simply wrong. The inventors of
>>> VXLAN must have know full well that when you use IP, packets can and
>>> eventually will be delivered out of order. This isn't just because of
>>> fragmentation, there are many other reasons that packets can be
>>> delivered OOO. This also must have been known when IP/GRE and any
>>> other protocol that carries L2 over IP was invented. If OOO is an
>>> issue for these protocols then they need to be fixed-- this is not a
>>> concern with IP protocol nor the stack.
>>>
>>> Tom
>>
>> As far as a little background on the original patch I believe the
>> issue that was fixed by the patch was a video streaming application
>> that was sending/receiving a mix of fragmented and non-fragmented
>> packets. Receiving them out of order due to the fragmentation was
>> causing issues with stutters in the video and so we ended up disabling
>> UDP by default in the NICs listed. We decided to go that way as UDP
>> RSS was viewed as a performance optimization, while the out-of-order
>> problems were viewed as a functionality issue.
>>
> Hi Alex,
>
> Thanks for the details! Were you able to find the root cause for this?
> In particular, it would be interesting to know if it is the kernel or
> device that introduced the jitter, or if it's the application that
> doesn't handle OOO well...
>
> Tom

It is hard to say since my memory of the events from 7 years ago is
pretty vague at this point, but I'm pretty sure it was the
application. Basically getting the frames out of order was causing
them to have to drop video data if I recall correctly.

- Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
  2017-09-06 17:43                                     ` Alexander Duyck
@ 2017-09-06 19:01                                       ` Tom Herbert
  0 siblings, 0 replies; 30+ messages in thread
From: Tom Herbert @ 2017-09-06 19:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Hannes Frederic Sowa, Saeed Mahameed, Saeed Mahameed,
	David S. Miller, Linux Netdev List

On Wed, Sep 6, 2017 at 10:43 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Sep 6, 2017 at 9:17 AM, Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>>> The situation with encapsulation is even more complicated:
>>>>>
>>>>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>>>>> constellation. If we do the fragmentation inside the vxlan tunnel and
>>>>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>>>>> we are fine and reordering on the receiver NIC won't happen in this
>>>>> case. If the fragmentation happens on the outer UDP header, this will
>>>>> result in reordering of the inner L2 flow. Unfortunately this depends on
>>>>> how the vxlan tunnel was set up, how other devices do that and (I
>>>>> believe so) on the kernel version.
>>>>>
>>>> This really isn't that complicated. The assumption that an IP network
>>>> always delivers packets in order is simply wrong. The inventors of
>>>> VXLAN must have know full well that when you use IP, packets can and
>>>> eventually will be delivered out of order. This isn't just because of
>>>> fragmentation, there are many other reasons that packets can be
>>>> delivered OOO. This also must have been known when IP/GRE and any
>>>> other protocol that carries L2 over IP was invented. If OOO is an
>>>> issue for these protocols then they need to be fixed-- this is not a
>>>> concern with IP protocol nor the stack.
>>>>
>>>> Tom
>>>
>>> As far as a little background on the original patch I believe the
>>> issue that was fixed by the patch was a video streaming application
>>> that was sending/receiving a mix of fragmented and non-fragmented
>>> packets. Receiving them out of order due to the fragmentation was
>>> causing issues with stutters in the video and so we ended up disabling
>>> UDP by default in the NICs listed. We decided to go that way as UDP
>>> RSS was viewed as a performance optimization, while the out-of-order
>>> problems were viewed as a functionality issue.
>>>
>> Hi Alex,
>>
>> Thanks for the details! Were you able to find the root cause for this?
>> In particular, it would be interesting to know if it is the kernel or
>> device that introduced the jitter, or if it's the application that
>> doesn't handle OOO well...
>>
>> Tom
>
> It is hard to say since my memory of the events from 7 years ago is
> pretty vague at this point, but I'm pretty sure it was the
> application. Basically getting the frames out of order was causing
> them to have to drop video data if I recall correctly.
>
Oh, I didn't notice that patch was from 2010. Maybe the application
has been fixed by now! :-)

Perhaps, it's time to try to turn UDP hashing on again by default?
Even if NICs aren't doing this, there are network devices that are
looking at UDP for ECMP and that doesn't seem to be causing widespread
problems currently...

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-09-06 19:01 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-30 23:04 [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads Saeed Mahameed
2017-08-30 23:04 ` [net-next 1/3] net/mlx5e: Use IP version matching to classify IP traffic Saeed Mahameed
2017-08-30 23:04 ` [net-next 2/3] net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels Saeed Mahameed
2017-08-30 23:04 ` [net-next 3/3] net/mlx5e: Support RSS for GRE tunneled packets Saeed Mahameed
2017-08-31  5:15 ` [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads David Miller
2017-08-31 13:51 ` Hannes Frederic Sowa
2017-09-02 23:01   ` Saeed Mahameed
2017-09-03  1:32     ` Hannes Frederic Sowa
2017-09-03  1:37       ` Tom Herbert
2017-09-03  4:11         ` Saeed Mahameed
2017-09-03 15:43           ` Tom Herbert
2017-09-03 16:17             ` Or Gerlitz
2017-09-03 16:45               ` Tom Herbert
2017-09-03 18:58                 ` Or Gerlitz
2017-09-04 13:50             ` Hannes Frederic Sowa
2017-09-04 16:15               ` Tom Herbert
2017-09-04 16:52                 ` Hannes Frederic Sowa
2017-09-04 17:11                   ` Tom Herbert
2017-09-04 17:57                     ` Hannes Frederic Sowa
2017-09-04 18:56                       ` Tom Herbert
2017-09-05 11:14                         ` Hannes Frederic Sowa
2017-09-05 16:02                           ` Tom Herbert
2017-09-05 19:20                             ` Hannes Frederic Sowa
2017-09-05 21:13                               ` Tom Herbert
2017-09-06  3:06                                 ` Alexander Duyck
2017-09-06 16:17                                   ` Tom Herbert
2017-09-06 17:43                                     ` Alexander Duyck
2017-09-06 19:01                                       ` Tom Herbert
2017-09-03  4:07       ` Saeed Mahameed
2017-09-04  9:37         ` Hannes Frederic Sowa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.