All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/5]  qed/qede: Tunnel hardware GRO support
@ 2016-06-22  8:25 Manish Chopra
  2016-06-22  8:25 ` [PATCH net-next 1/5] net: export udp and gre gro_complete() APIs Manish Chopra
                   ` (5 more replies)
  0 siblings, 6 replies; 36+ messages in thread
From: Manish Chopra @ 2016-06-22  8:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz

Hi David,

This series adds driver support for the processing of tunnel
[specifically vxlan/geneve/gre tunnels] packets which are
aggregated [GROed] by the hardware before driver passes
such packets upto the stack.

Patch 1 - General infrastructure change

Exported UDP and GRE gro_complete APIs to be revealed by driver
to complete/process hardware assisted GRO for encapsulated packets,
as gro_complete() APIs are partially exposed.
[sufficient for TCP, but not enough for udp/gre tunnels].

Patch 2/3/4 -

These patches add support for handling tunnel packets
[which are GROed by the adapter] in the driver receive data path.

Patch 5 -

This actually enables GRO functionality for tunnel packets on the adapter.

Please consider applying this series to "net-next"

Thanks,
Manish

Manish Chopra (5):
  net: export udp and gre gro_complete() APIs
  qede: Add support to handle VXLAN hardware GRO packets
  qede: Add support to handle GENEVE hardware GRO packets
  qede: Add support to handle GRE hardware GRO packets
  qed: Enable hardware GRO feature for encapsulated packets

 drivers/net/ethernet/qlogic/qed/qed_l2.c     |   2 +
 drivers/net/ethernet/qlogic/qede/qede.h      |   1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c | 163 ++++++++++++++++++++++++---
 include/net/gre.h                            |   3 +
 include/net/udp.h                            |   6 +
 net/ipv4/gre_offload.c                       |   3 +-
 net/ipv4/udp_offload.c                       |   3 +-
 net/ipv6/udp_offload.c                       |   3 +-
 8 files changed, 163 insertions(+), 21 deletions(-)

-- 
2.7.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH net-next 1/5] net: export udp and gre gro_complete() APIs
  2016-06-22  8:25 [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Manish Chopra
@ 2016-06-22  8:25 ` Manish Chopra
  2016-06-22  8:25 ` [PATCH net-next 2/5] qede: Add support to handle VXLAN hardware GRO packets Manish Chopra
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Manish Chopra @ 2016-06-22  8:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz

This patch exports relevant APIs needed to be used by this driver
to handle hardware assisted encapsulated GRO packets.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
---
 include/net/gre.h      | 3 +++
 include/net/udp.h      | 6 ++++++
 net/ipv4/gre_offload.c | 3 ++-
 net/ipv4/udp_offload.c | 3 ++-
 net/ipv6/udp_offload.c | 3 ++-
 5 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/net/gre.h b/include/net/gre.h
index 5dce30a..a5fe689 100644
--- a/include/net/gre.h
+++ b/include/net/gre.h
@@ -23,6 +23,9 @@ struct gre_protocol {
 int gre_add_protocol(const struct gre_protocol *proto, u8 version);
 int gre_del_protocol(const struct gre_protocol *proto, u8 version);
 
+/* gre_gro_complete exported for drivers implementing hardware GRO */
+int gre_gro_complete(struct sk_buff *skb, int nhoff);
+
 struct net_device *gretap_fb_dev_create(struct net *net, const char *name,
 				       u8 name_assign_type);
 int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
diff --git a/include/net/udp.h b/include/net/udp.h
index 8894d71..ac6b19b 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -174,6 +174,12 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *skb,
 				 struct udphdr *uh, udp_lookup_t lookup);
 int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
 
+/* udp4_gro_complete exported for drivers implementing hardware GRO */
+int udp4_gro_complete(struct sk_buff *skb, int nhoff);
+
+/* udp6_gro_complete exported for drivers implementing hardware GRO */
+int udp6_gro_complete(struct sk_buff *skb, int nhoff);
+
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
 	struct udphdr *uh;
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index ecd1e09..c09c7e9 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -238,7 +238,7 @@ out:
 	return pp;
 }
 
-static int gre_gro_complete(struct sk_buff *skb, int nhoff)
+int gre_gro_complete(struct sk_buff *skb, int nhoff)
 {
 	struct gre_base_hdr *greh = (struct gre_base_hdr *)(skb->data + nhoff);
 	struct packet_offload *ptype;
@@ -267,6 +267,7 @@ static int gre_gro_complete(struct sk_buff *skb, int nhoff)
 
 	return err;
 }
+EXPORT_SYMBOL(gre_gro_complete);
 
 static const struct net_offload gre_offload = {
 	.callbacks = {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 81f253b..b4899bb 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -359,7 +359,7 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
 }
 EXPORT_SYMBOL(udp_gro_complete);
 
-static int udp4_gro_complete(struct sk_buff *skb, int nhoff)
+int udp4_gro_complete(struct sk_buff *skb, int nhoff)
 {
 	const struct iphdr *iph = ip_hdr(skb);
 	struct udphdr *uh = (struct udphdr *)(skb->data + nhoff);
@@ -374,6 +374,7 @@ static int udp4_gro_complete(struct sk_buff *skb, int nhoff)
 
 	return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
+EXPORT_SYMBOL(udp4_gro_complete);
 
 static const struct net_offload udpv4_offload = {
 	.callbacks = {
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index ac858c4..827cd20 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -147,7 +147,7 @@ flush:
 	return NULL;
 }
 
-static int udp6_gro_complete(struct sk_buff *skb, int nhoff)
+int udp6_gro_complete(struct sk_buff *skb, int nhoff)
 {
 	const struct ipv6hdr *ipv6h = ipv6_hdr(skb);
 	struct udphdr *uh = (struct udphdr *)(skb->data + nhoff);
@@ -162,6 +162,7 @@ static int udp6_gro_complete(struct sk_buff *skb, int nhoff)
 
 	return udp_gro_complete(skb, nhoff, udp6_lib_lookup_skb);
 }
+EXPORT_SYMBOL(udp6_gro_complete);
 
 static const struct net_offload udpv6_offload = {
 	.callbacks = {
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next 2/5] qede: Add support to handle VXLAN hardware GRO packets
  2016-06-22  8:25 [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Manish Chopra
  2016-06-22  8:25 ` [PATCH net-next 1/5] net: export udp and gre gro_complete() APIs Manish Chopra
@ 2016-06-22  8:25 ` Manish Chopra
  2016-06-22  8:25 ` [PATCH net-next 3/5] qede: Add support to handle GENEVE " Manish Chopra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Manish Chopra @ 2016-06-22  8:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz

This patch adds support to do necessary processing
for hardware assisted VXLAN tunnel GRO packets before
driver delivers them upto the stack.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qede/qede.h      |  1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c | 79 ++++++++++++++++++++++++----
 2 files changed, 70 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h b/drivers/net/ethernet/qlogic/qede/qede.h
index 1441c8f..d8ec269 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -213,6 +213,7 @@ struct qede_agg_info {
 	struct sk_buff *skb;
 	int frag_id;
 	u16 vlan_tag;
+	u8 tunnel_type;
 };
 
 struct qede_rx_queue {
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 2972742..f94ea16 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -36,6 +36,7 @@
 #include <linux/random.h>
 #include <net/ip6_checksum.h>
 #include <linux/bitops.h>
+#include <net/vxlan.h>
 
 #include "qede.h"
 
@@ -1048,6 +1049,12 @@ out:
 	return -ENOMEM;
 }
 
+static bool qede_tunn_exist(u16 flag)
+{
+	return !!(flag & (PARSING_AND_ERR_FLAGS_TUNNELEXIST_MASK <<
+			  PARSING_AND_ERR_FLAGS_TUNNELEXIST_SHIFT));
+}
+
 static void qede_tpa_start(struct qede_dev *edev,
 			   struct qede_rx_queue *rxq,
 			   struct eth_fast_path_rx_tpa_start_cqe *cqe)
@@ -1065,6 +1072,14 @@ static void qede_tpa_start(struct qede_dev *edev,
 	sw_rx_data_cons = &rxq->sw_rx_ring[rxq->sw_rx_cons & NUM_RX_BDS_MAX];
 	sw_rx_data_prod = &rxq->sw_rx_ring[rxq->sw_rx_prod & NUM_RX_BDS_MAX];
 
+	if (qede_tunn_exist(le16_to_cpu(cqe->pars_flags.flags))) {
+		u8 flags = cqe->tunnel_pars_flags.flags, shift;
+
+		shift = ETH_TUNNEL_PARSING_FLAGS_TYPE_SHIFT;
+		tpa_info->tunnel_type = (flags >> shift) &
+					 ETH_TUNNEL_PARSING_FLAGS_TYPE_MASK;
+	}
+
 	/* Use pre-allocated replacement buffer - we can't release the agg.
 	 * start until its over and we don't want to risk allocation failing
 	 * here, so re-allocate when aggregation will be over.
@@ -1159,12 +1174,55 @@ static void qede_gro_ipv6_csum(struct sk_buff *skb)
 				  &iph->saddr, &iph->daddr, 0);
 	tcp_gro_complete(skb);
 }
+
+static void qede_set_nh_th_offset(struct sk_buff *skb, int off)
+{
+	skb_set_network_header(skb, off);
+
+	if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) {
+		off += sizeof(struct iphdr);
+		skb_set_transport_header(skb, off);
+	} else {
+		off += sizeof(struct ipv6hdr);
+		skb_set_transport_header(skb, off);
+	}
+}
+
+static void qede_handle_vxlan_tunnel_gro(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		qede_set_nh_th_offset(skb, VXLAN_HEADROOM);
+		udp4_gro_complete(skb, sizeof(struct iphdr));
+		break;
+	case htons(ETH_P_IPV6):
+		qede_set_nh_th_offset(skb, VXLAN6_HEADROOM);
+		udp6_gro_complete(skb, sizeof(struct ipv6hdr));
+		break;
+	default:
+		WARN_ONCE(1, "Unsupported VXLAN tunnel GRO proto=0x%x\n",
+			  skb->protocol);
+	}
+}
+
+static void qede_handle_tunnel_gro(struct qede_dev *edev,
+				   struct sk_buff *skb, u8 tunnel_type)
+{
+	switch (tunnel_type) {
+	case ETH_RX_TUNN_VXLAN:
+		qede_handle_vxlan_tunnel_gro(skb);
+		break;
+	default:
+		WARN_ONCE(1, "Unsupported tunnel GRO, tunnel type=0x%x\n",
+			  tunnel_type);
+	}
+}
 #endif
 
 static void qede_gro_receive(struct qede_dev *edev,
 			     struct qede_fastpath *fp,
 			     struct sk_buff *skb,
-			     u16 vlan_tag)
+			     struct qede_agg_info *tpa_info)
 {
 	/* FW can send a single MTU sized packet from gro flow
 	 * due to aggregation timeout/last segment etc. which
@@ -1179,6 +1237,12 @@ static void qede_gro_receive(struct qede_dev *edev,
 
 #ifdef CONFIG_INET
 	if (skb_shinfo(skb)->gso_size) {
+		if (tpa_info->tunnel_type) {
+			qede_handle_tunnel_gro(edev, skb,
+					       tpa_info->tunnel_type);
+			goto send_skb;
+		}
+
 		skb_set_network_header(skb, 0);
 
 		switch (skb->protocol) {
@@ -1198,7 +1262,7 @@ static void qede_gro_receive(struct qede_dev *edev,
 
 send_skb:
 	skb_record_rx_queue(skb, fp->rss_id);
-	qede_skb_receive(edev, fp, skb, vlan_tag);
+	qede_skb_receive(edev, fp, skb, tpa_info->vlan_tag);
 }
 
 static inline void qede_tpa_cont(struct qede_dev *edev,
@@ -1267,10 +1331,10 @@ static void qede_tpa_end(struct qede_dev *edev,
 	 */
 	NAPI_GRO_CB(skb)->count = le16_to_cpu(cqe->num_of_coalesced_segs);
 
-	qede_gro_receive(edev, fp, skb, tpa_info->vlan_tag);
+	qede_gro_receive(edev, fp, skb, tpa_info);
 
 	tpa_info->agg_state = QEDE_AGG_STATE_NONE;
-
+	tpa_info->tunnel_type = 0;
 	return;
 err:
 	/* The BD starting the aggregation is still mapped; Re-use it for
@@ -1283,12 +1347,7 @@ err:
 	tpa_info->agg_state = QEDE_AGG_STATE_NONE;
 	dev_kfree_skb_any(tpa_info->skb);
 	tpa_info->skb = NULL;
-}
-
-static bool qede_tunn_exist(u16 flag)
-{
-	return !!(flag & (PARSING_AND_ERR_FLAGS_TUNNELEXIST_MASK <<
-			  PARSING_AND_ERR_FLAGS_TUNNELEXIST_SHIFT));
+	tpa_info->tunnel_type = 0;
 }
 
 static u8 qede_check_tunn_csum(u16 flag)
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next 3/5] qede: Add support to handle GENEVE hardware GRO packets
  2016-06-22  8:25 [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Manish Chopra
  2016-06-22  8:25 ` [PATCH net-next 1/5] net: export udp and gre gro_complete() APIs Manish Chopra
  2016-06-22  8:25 ` [PATCH net-next 2/5] qede: Add support to handle VXLAN hardware GRO packets Manish Chopra
@ 2016-06-22  8:25 ` Manish Chopra
  2016-06-22  8:25 ` [PATCH net-next 4/5] qede: Add support to handle GRE " Manish Chopra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Manish Chopra @ 2016-06-22  8:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz

This patch adds support to do necessary processing
for hardware assisted GENEVE tunnel GRO packets before
driver delivers them upto the stack.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qede/qede_main.c | 30 ++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index f94ea16..a4b445f 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -37,6 +37,7 @@
 #include <net/ip6_checksum.h>
 #include <linux/bitops.h>
 #include <net/vxlan.h>
+#include <net/geneve.h>
 
 #include "qede.h"
 
@@ -1205,6 +1206,32 @@ static void qede_handle_vxlan_tunnel_gro(struct sk_buff *skb)
 	}
 }
 
+static inline struct genevehdr *
+qede_geneve_hdr(struct sk_buff *skb, int off) {
+	return (struct genevehdr *)(skb->data + off + sizeof(struct udphdr));
+}
+
+static void qede_handle_geneve_tunnel_gro(struct sk_buff *skb)
+{
+	struct genevehdr *gh;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		gh = qede_geneve_hdr(skb, sizeof(struct iphdr));
+		qede_set_nh_th_offset(skb, VXLAN_HEADROOM + (gh->opt_len * 4));
+		udp4_gro_complete(skb, sizeof(struct iphdr));
+		break;
+	case htons(ETH_P_IPV6):
+		gh = qede_geneve_hdr(skb, sizeof(struct ipv6hdr));
+		qede_set_nh_th_offset(skb, VXLAN6_HEADROOM + (gh->opt_len * 4));
+		udp6_gro_complete(skb, sizeof(struct ipv6hdr));
+		break;
+	default:
+		WARN_ONCE(1, "Unsupported GENEVE tunnel GRO proto=0x%x\n",
+			  skb->protocol);
+	}
+}
+
 static void qede_handle_tunnel_gro(struct qede_dev *edev,
 				   struct sk_buff *skb, u8 tunnel_type)
 {
@@ -1212,6 +1239,9 @@ static void qede_handle_tunnel_gro(struct qede_dev *edev,
 	case ETH_RX_TUNN_VXLAN:
 		qede_handle_vxlan_tunnel_gro(skb);
 		break;
+	case ETH_RX_TUNN_GENEVE:
+		qede_handle_geneve_tunnel_gro(skb);
+		break;
 	default:
 		WARN_ONCE(1, "Unsupported tunnel GRO, tunnel type=0x%x\n",
 			  tunnel_type);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next 4/5] qede: Add support to handle GRE hardware GRO packets
  2016-06-22  8:25 [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Manish Chopra
                   ` (2 preceding siblings ...)
  2016-06-22  8:25 ` [PATCH net-next 3/5] qede: Add support to handle GENEVE " Manish Chopra
@ 2016-06-22  8:25 ` Manish Chopra
  2016-06-22  8:25 ` [PATCH net-next 5/5] qed: Enable hardware GRO feature for encapsulated packets Manish Chopra
  2016-06-22 16:27 ` [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Alexander Duyck
  5 siblings, 0 replies; 36+ messages in thread
From: Manish Chopra @ 2016-06-22  8:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz

This patch adds support to do necessary processing
for hardware assisted GRE tunnel GRO packets before
driver delivers them upto the stack.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qede/qede_main.c | 54 +++++++++++++++++++++++-----
 1 file changed, 46 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index a4b445f..787aef0 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -38,6 +38,7 @@
 #include <linux/bitops.h>
 #include <net/vxlan.h>
 #include <net/geneve.h>
+#include <net/gre.h>
 
 #include "qede.h"
 
@@ -1152,10 +1153,7 @@ cons_buf: /* We still need to handle bd_len_list to consume buffers */
 static void qede_gro_ip_csum(struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
-	struct tcphdr *th;
-
-	skb_set_transport_header(skb, sizeof(struct iphdr));
-	th = tcp_hdr(skb);
+	struct tcphdr *th = tcp_hdr(skb);
 
 	th->check = ~tcp_v4_check(skb->len - skb_transport_offset(skb),
 				  iph->saddr, iph->daddr, 0);
@@ -1166,10 +1164,7 @@ static void qede_gro_ip_csum(struct sk_buff *skb)
 static void qede_gro_ipv6_csum(struct sk_buff *skb)
 {
 	struct ipv6hdr *iph = ipv6_hdr(skb);
-	struct tcphdr *th;
-
-	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
-	th = tcp_hdr(skb);
+	struct tcphdr *th = tcp_hdr(skb);
 
 	th->check = ~tcp_v6_check(skb->len - skb_transport_offset(skb),
 				  &iph->saddr, &iph->daddr, 0);
@@ -1232,6 +1227,44 @@ static void qede_handle_geneve_tunnel_gro(struct sk_buff *skb)
 	}
 }
 
+static void qede_handle_gre_tunnel_gro(struct sk_buff *skb)
+{
+	unsigned int grehlen, gre_headroom;
+	struct gre_base_hdr *greh;
+	int nhoff = 0;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		nhoff = sizeof(struct iphdr);
+		break;
+	case htons(ETH_P_IPV6):
+		nhoff = sizeof(struct ipv6hdr);
+		break;
+	default:
+		WARN_ONCE(1, "Unsupported GRE tunnel GRO, proto=0x%x\n",
+			  skb->protocol);
+	}
+
+	greh = (struct gre_base_hdr *)(skb->data + nhoff);
+	grehlen = sizeof(*greh);
+
+	if (greh->flags & GRE_KEY)
+		grehlen += GRE_HEADER_SECTION;
+
+	if (greh->flags & GRE_CSUM)
+		grehlen += GRE_HEADER_SECTION;
+
+	gre_headroom = nhoff + grehlen;
+
+	/* L2 GRE */
+	if (greh->protocol == htons(ETH_P_TEB))
+		gre_headroom += ETH_HLEN;
+
+	qede_set_nh_th_offset(skb, gre_headroom);
+
+	gre_gro_complete(skb, nhoff);
+}
+
 static void qede_handle_tunnel_gro(struct qede_dev *edev,
 				   struct sk_buff *skb, u8 tunnel_type)
 {
@@ -1242,6 +1275,9 @@ static void qede_handle_tunnel_gro(struct qede_dev *edev,
 	case ETH_RX_TUNN_GENEVE:
 		qede_handle_geneve_tunnel_gro(skb);
 		break;
+	case ETH_RX_TUNN_GRE:
+		qede_handle_gre_tunnel_gro(skb);
+		break;
 	default:
 		WARN_ONCE(1, "Unsupported tunnel GRO, tunnel type=0x%x\n",
 			  tunnel_type);
@@ -1277,9 +1313,11 @@ static void qede_gro_receive(struct qede_dev *edev,
 
 		switch (skb->protocol) {
 		case htons(ETH_P_IP):
+			skb_set_transport_header(skb, sizeof(struct iphdr));
 			qede_gro_ip_csum(skb);
 			break;
 		case htons(ETH_P_IPV6):
+			skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 			qede_gro_ipv6_csum(skb);
 			break;
 		default:
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next 5/5] qed: Enable hardware GRO feature for encapsulated packets
  2016-06-22  8:25 [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Manish Chopra
                   ` (3 preceding siblings ...)
  2016-06-22  8:25 ` [PATCH net-next 4/5] qede: Add support to handle GRE " Manish Chopra
@ 2016-06-22  8:25 ` Manish Chopra
  2016-06-22 16:27 ` [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Alexander Duyck
  5 siblings, 0 replies; 36+ messages in thread
From: Manish Chopra @ 2016-06-22  8:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz

This patch actually enables the hardware to perform GRO over
encapsulated Ipv4/Ipv6 packets.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qed/qed_l2.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_l2.c b/drivers/net/ethernet/qlogic/qed/qed_l2.c
index d121a8b..e56981d 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_l2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_l2.c
@@ -92,6 +92,8 @@ int qed_sp_eth_vport_start(struct qed_hwfn *p_hwfn,
 		p_ramrod->tpa_param.tpa_min_size_to_start = p_params->mtu / 2;
 		p_ramrod->tpa_param.tpa_ipv4_en_flg = 1;
 		p_ramrod->tpa_param.tpa_ipv6_en_flg = 1;
+		p_ramrod->tpa_param.tpa_ipv4_tunn_en_flg = 1;
+		p_ramrod->tpa_param.tpa_ipv6_tunn_en_flg = 1;
 		p_ramrod->tpa_param.tpa_pkt_split_flg = 1;
 		p_ramrod->tpa_param.tpa_gro_consistent_flg = 1;
 		break;
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22  8:25 [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Manish Chopra
                   ` (4 preceding siblings ...)
  2016-06-22  8:25 ` [PATCH net-next 5/5] qed: Enable hardware GRO feature for encapsulated packets Manish Chopra
@ 2016-06-22 16:27 ` Alexander Duyck
  2016-06-22 17:16   ` Yuval Mintz
  5 siblings, 1 reply; 36+ messages in thread
From: Alexander Duyck @ 2016-06-22 16:27 UTC (permalink / raw)
  To: Manish Chopra
  Cc: David Miller, Netdev, Ariel Elior, Yuval Mintz, Tom Herbert,
	Hannes Frederic Sowa

On Wed, Jun 22, 2016 at 1:25 AM, Manish Chopra <manish.chopra@qlogic.com> wrote:
> Hi David,
>
> This series adds driver support for the processing of tunnel
> [specifically vxlan/geneve/gre tunnels] packets which are
> aggregated [GROed] by the hardware before driver passes
> such packets upto the stack.

First off I am pretty sure this isn't GRO.  This is LRO.  The
distinction is that LRO is performed in hardware and/or the driver
while GRO is performed by the kernel.  Think of this as the same
distinction between GSO and TSO.  The reason why we want to make
certain to keep them separate is that LRO has a bad habit of mangling
frames in ways that can be counter productive.  It also is not
extensible since it is implemented in hardware.  The best way to think
of it is that LRO is a subset of GRO.  If done correctly the LRO can
be as good as GRO, but if not there are usually some pieces of the
frame that end up being mangled which make it difficult to undo later.
As such I would recommend first going through and updating your "GRO"
feature to use the correct "LRO" feature flag if you haven't already.
The drivers should not really be messing with the GRO bit and should
be using the LRO bit to indicate this type of feature.

Also I don't know if you have been paying attention to recent
discussions on the mailing list but the fact is GRO over UDP tunnels
is still a subject for debate.  This patch takes things in the
opposite direction of where we are currently talking about going with
GRO.  I've added Hannes and Tom to this discussion just to make sure I
have the proper understanding of all this as my protocol knowledge is
somewhat lacking.

Ideally we need to be able to identify that a given packet terminates
on a local socket in our namespace before we could begin to perform
any sort of mangling on the local packets.  It is always possible that
we could be looking at a frame that uses the same UDP port but is not
the tunnel protocol if we are performing bridging or routing.  The
current GRO implementation fails in that regard and there are
discussions between several of us on how to deal with that.  It is
likely that we would be forcing GRO for tunnels to move a bit further
up the stack if bridging or routing so that we could verify that the
frame is not being routed back out before we could aggregate it.

Also I would be interested in knowing how your hardware handles
tunnels with outer checksums.  Is it just ignoring the frames in such
a case, ignoring the checksum, or is it actually validating the frames
and then merging the resulting checksum?

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 16:27 ` [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Alexander Duyck
@ 2016-06-22 17:16   ` Yuval Mintz
  2016-06-22 17:45     ` Alexander Duyck
  0 siblings, 1 reply; 36+ messages in thread
From: Yuval Mintz @ 2016-06-22 17:16 UTC (permalink / raw)
  To: Alexander Duyck, Manish Chopra
  Cc: David Miller, netdev, Ariel Elior, Tom Herbert, Hannes Frederic Sowa

>> This series adds driver support for the processing of tunnel
>> [specifically vxlan/geneve/gre tunnels] packets which are
>> aggregated [GROed] by the hardware before driver passes
>> such packets upto the stack.

> First off I am pretty sure this isn't GRO.  This is LRO. 
Nopes. This is GRO - each MSS-sized frame will arrive on its
own frag, whereas the headers [both  internal & external]
would arrive on the linear part of the SKB.

> Also I don't know if you have been paying attention to recent
> discussions on the mailing list but the fact is GRO over UDP tunnels
> is still a subject for debate.  This patch takes things in the
> opposite direction of where we are currently talking about going with
> GRO.  I've added Hannes and Tom to this discussion just to make sure I
> have the proper understanding of all this as my protocol knowledge is
> somewhat lacking.
I guess we're on the exact opposite side of the discussion - I.e., we're
the vendor that tries pushing offload capabilities to the device.
Do notice however that we're not planning on pushing anything new
feature-wise, just like we haven't done anything for regular TCP GRO -
All we do is allow our HW/FW to aggregate packets instead of stack.

> Ideally we need to be able to identify that a given packet terminates
> on a local socket in our namespace before we could begin to perform
> any sort of mangling on the local packets.  It is always possible that
> we could be looking at a frame that uses the same UDP port but is not
> the tunnel protocol if we are performing bridging or routing.  The
> current GRO implementation fails in that regard and there are
> discussions between several of us on how to deal with that.  It is
> likely that we would be forcing GRO for tunnels to move a bit further
> up the stack if bridging or routing so that we could verify that the
> frame is not being routed back out before we could aggregate it.
I'm aware of the on-going discussion, but I'm not sure this should
bother us greatly - the aggregation is done based on the
inner TCP connection; I.e., it's guaranteed all the frames belong to
the same TCP connection. While HW requires the UDP port for the
initial classification effort, it doesn't take decision solely on that.

> Also I would be interested in knowing how your hardware handles
> tunnels with outer checksums.  Is it just ignoring the frames in such
> a case, ignoring the checksum, or is it actually validating the frames
> and then merging the resulting checksum?
HW is validating both inner and outer csum; if it wouldn't be able to
due to some limitation, it would not try to pass the packet as a GRO
aggregation but rather as regular seperated packets.
But I don't believe it merges the checksum, only validate each
independently [CSUM_UNNECESSARY].

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 17:16   ` Yuval Mintz
@ 2016-06-22 17:45     ` Alexander Duyck
  2016-06-22 18:22       ` Yuval Mintz
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Duyck @ 2016-06-22 17:45 UTC (permalink / raw)
  To: Yuval Mintz
  Cc: Manish Chopra, David Miller, netdev, Ariel Elior, Tom Herbert,
	Hannes Frederic Sowa

On Wed, Jun 22, 2016 at 10:16 AM, Yuval Mintz <Yuval.Mintz@qlogic.com> wrote:
>>> This series adds driver support for the processing of tunnel
>>> [specifically vxlan/geneve/gre tunnels] packets which are
>>> aggregated [GROed] by the hardware before driver passes
>>> such packets upto the stack.
>
>> First off I am pretty sure this isn't GRO.  This is LRO.
> Nopes. This is GRO - each MSS-sized frame will arrive on its
> own frag, whereas the headers [both  internal & external]
> would arrive on the linear part of the SKB.

No it is LRO, it just very closely mimics GRO.  If it is in hardware
it is LRO.  GRO is extensible in software and bugs can be fixed in
kernel, LRO is not extensible and any bugs in it are found in hardware
or the driver and would have to be fixed there.  It all comes down to
bug isolation.  If we find that disabling LRO makes the bug go away we
know the hardware aggregation has an issue.  Both features should not
operate on the same bit.  Otherwise when some bug is found in your
implementation it will be blamed on GRO when the bug is actually in
LRO.

I realize this causes some pain when routing or bridging as LRO is
disabled but that is kind of the point.  We don't want the hardware to
be mangling frames when we are attempting to route packets between any
two given devices.

>> Also I don't know if you have been paying attention to recent
>> discussions on the mailing list but the fact is GRO over UDP tunnels
>> is still a subject for debate.  This patch takes things in the
>> opposite direction of where we are currently talking about going with
>> GRO.  I've added Hannes and Tom to this discussion just to make sure I
>> have the proper understanding of all this as my protocol knowledge is
>> somewhat lacking.
> I guess we're on the exact opposite side of the discussion - I.e., we're
> the vendor that tries pushing offload capabilities to the device.
> Do notice however that we're not planning on pushing anything new
> feature-wise, just like we haven't done anything for regular TCP GRO -
> All we do is allow our HW/FW to aggregate packets instead of stack.

If you have been following the list then arguably you didn't fully
understand what has been going on.  I just went through and cleaned up
all the VXLAN, GENEVE, and VXLAN-GPE mess that we have been creating
and tried to get this consolidated so that we could start to have
conversations about this without things being outright rejected.  I
feel like you guys have just prooven the argument for the other side
that said as soon as we start support any of it, the vendors were
going to go nuts and try to stuff everything and the kitchen sink into
the NICs.  The fact is I am starting to think they were right.

>> Ideally we need to be able to identify that a given packet terminates
>> on a local socket in our namespace before we could begin to perform
>> any sort of mangling on the local packets.  It is always possible that
>> we could be looking at a frame that uses the same UDP port but is not
>> the tunnel protocol if we are performing bridging or routing.  The
>> current GRO implementation fails in that regard and there are
>> discussions between several of us on how to deal with that.  It is
>> likely that we would be forcing GRO for tunnels to move a bit further
>> up the stack if bridging or routing so that we could verify that the
>> frame is not being routed back out before we could aggregate it.

> I'm aware of the on-going discussion, but I'm not sure this should
> bother us greatly - the aggregation is done based on the
> inner TCP connection; I.e., it's guaranteed all the frames belong to
> the same TCP connection. While HW requires the UDP port for the
> initial classification effort, it doesn't take decision solely on that.

I realize that UDP port is only one piece of this.  Exactly what
fields are being taken into account.  That information would be useful
as we could then go though and determine what the probability is of us
having a false match between any two packets in a given UDP flow.

Also I would still argue this is LRO.  If we are doing routing it
should be disabled for us to later re-enable if we feel it is safe.
Having a feature like this enabled by default with GRO is a really bad
idea.

>> Also I would be interested in knowing how your hardware handles
>> tunnels with outer checksums.  Is it just ignoring the frames in such
>> a case, ignoring the checksum, or is it actually validating the frames
>> and then merging the resulting checksum?

> HW is validating both inner and outer csum; if it wouldn't be able to
> due to some limitation, it would not try to pass the packet as a GRO
> aggregation but rather as regular seperated packets.
> But I don't believe it merges the checksum, only validate each
> independently [CSUM_UNNECESSARY].

So it is mangling the packets then.  If this were flagged as LRO at
least if bridging or routing is enabled you won't be mangling the
frames without the user having to specifically go back through and
re-enable LRO.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 17:45     ` Alexander Duyck
@ 2016-06-22 18:22       ` Yuval Mintz
  2016-06-22 21:32         ` Alexander Duyck
  2016-06-22 21:52         ` Rick Jones
  0 siblings, 2 replies; 36+ messages in thread
From: Yuval Mintz @ 2016-06-22 18:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Manish Chopra, David Miller, netdev, Ariel Elior, Tom Herbert,
	Hannes Frederic Sowa

>>>> This series adds driver support for the processing of tunnel
>>>> [specifically vxlan/geneve/gre tunnels] packets which are
>>>> aggregated [GROed] by the hardware before driver passes
>>>> such packets upto the stack.
>>> First off I am pretty sure this isn't GRO.  This is LRO.
>> Nopes. This is GRO - each MSS-sized frame will arrive on its
>> own frag, whereas the headers [both  internal & external]
>> would arrive on the linear part of the SKB.

> No it is LRO, it just very closely mimics GRO.  If it is in hardware
> it is LRO.  GRO is extensible in software and bugs can be fixed in
> kernel, LRO is not extensible and any bugs in it are found in hardware
> or the driver and would have to be fixed there.  It all comes down to
> bug isolation.  If we find that disabling LRO makes the bug go away we
> know the hardware aggregation has an issue.  Both features should not
> operate on the same bit.  Otherwise when some bug is found in your
> implementation it will be blamed on GRO when the bug is actually in
> LRO.
I'm not aware of a definition stating GRO *has* to be extensible in SW;
AFAIK the LRO/GRO distinction revolves around multiple areas [criteria
for aggregation, method of aggregation, etc.].

> I realize this causes some pain when routing or bridging as LRO is
> disabled but that is kind of the point.  We don't want the hardware to
> be mangling frames when we are attempting to route packets between any
> two given devices.
Actually, while I might disagree on whether this is LRO/GRO, I don't think
there's any problem for us to base this on the LRO feature - I.e., when we
started working on qede we didn't bother implementing LRO as we understood
it was deprecated, but this encapsulated aggregation is configurable; If it
makes life easier for everyone if we make the configuration based on the LRO
configuration, so that when LRO is disabled we won't have this turned on
it can be easily done.

>>> Also I don't know if you have been paying attention to recent
>>> discussions on the mailing list but the fact is GRO over UDP tunnels
>>> is still a subject for debate.  This patch takes things in the
>>> opposite direction of where we are currently talking about going with
>>> GRO.  I've added Hannes and Tom to this discussion just to make sure I
>>> have the proper understanding of all this as my protocol knowledge is
>>> somewhat lacking.
>> I guess we're on the exact opposite side of the discussion - I.e., we're
>> the vendor that tries pushing offload capabilities to the device.
>> Do notice however that we're not planning on pushing anything new
>> feature-wise, just like we haven't done anything for regular TCP GRO -
>> All we do is allow our HW/FW to aggregate packets instead of stack.

> If you have been following the list then arguably you didn't fully
> understand what has been going on.  I just went through and cleaned up
> all the VXLAN, GENEVE, and VXLAN-GPE mess that we have been creating
> and tried to get this consolidated so that we could start to have
> conversations about this without things being outright rejected.  I
> feel like you guys have just prooven the argument for the other side
> that said as soon as we start support any of it, the vendors were
> going to go nuts and try to stuff everything and the kitchen sink into
> the NICs.  The fact is I am starting to think they were right.
You hurt my feeling; I started going nuts ages ago ;-)
But seriously, this isn't really anything new but rather a step forward in
the direction we've already taken - bnx2x/qede are already performing
the same for non-encapsulated TCP.
And while I understand why you're being suspicious of such an addition,
I don't entirely see how it affects any of that discussion - I already yielded
that we can make this configurable, so if any routing decision would be
added that result in packets NOT supposed to be aggregated, the feature
can be turned off [at worst, at the expanse of it not having its benefit
on other connections].

>>> Ideally we need to be able to identify that a given packet terminates
>>> on a local socket in our namespace before we could begin to perform
>>> any sort of mangling on the local packets.  It is always possible that
>>> we could be looking at a frame that uses the same UDP port but is not
>>> the tunnel protocol if we are performing bridging or routing.  The
>>> current GRO implementation fails in that regard and there are
>>> discussions between several of us on how to deal with that.  It is
>>> likely that we would be forcing GRO for tunnels to move a bit further
>>>> up the stack if bridging or routing so that we could verify that the
>>> frame is not being routed back out before we could aggregate it.

>> I'm aware of the on-going discussion, but I'm not sure this should
>> bother us greatly - the aggregation is done based on the
>> inner TCP connection; I.e., it's guaranteed all the frames belong to
>> the same TCP connection. While HW requires the UDP port for the
>> initial classification effort, it doesn't take decision solely on that.

> I realize that UDP port is only one piece of this.  Exactly what
> fields are being taken into account.  That information would be useful
> as we could then go though and determine what the probability is of us
> having a false match between any two packets in a given UDP flow.
While we can surely do that, I think that's really beside the point;
I.e., I don't expect any of you to debug issues arising from a bad HW/FW
implementation. But if you REALLY want, I can go and ask for exact
details.

> Also I would still argue this is LRO.  If we are doing routing it
> should be disabled for us to later re-enable if we feel it is safe.
> Having a feature like this enabled by default with GRO is a really bad
> idea.
Which is exactly what happens in qede/bnx2x for regular TCP.
[I understand this area is a much hotter potato at the moment then
simple GRO when we've submitted that content for both drivers;
But I won't mention it again - the sins of the fathers are not excuses
for any wrong-doings today].

>>> Also I would be interested in knowing how your hardware handles
>>> tunnels with outer checksums.  Is it just ignoring the frames in such
>>> a case, ignoring the checksum, or is it actually validating the frames
>>> and then merging the resulting checksum?

>> HW is validating both inner and outer csum; if it wouldn't be able to
>> due to some limitation, it would not try to pass the packet as a GRO
>> aggregation but rather as regular seperated packets.
>> But I don't believe it merges the checksum, only validate each
>> independently [CSUM_UNNECESSARY].

>So it is mangling the packets then. 
Obviously - you can't aggregate things without touching them.

> If this were flagged as LRO at
> least if bridging or routing is enabled you won't be mangling the
> frames without the user having to specifically go back through and
> re-enable LRO.

- Alex
    

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 18:22       ` Yuval Mintz
@ 2016-06-22 21:32         ` Alexander Duyck
  2016-06-22 22:32           ` Hannes Frederic Sowa
  2016-06-22 23:42           ` Eric Dumazet
  2016-06-22 21:52         ` Rick Jones
  1 sibling, 2 replies; 36+ messages in thread
From: Alexander Duyck @ 2016-06-22 21:32 UTC (permalink / raw)
  To: Yuval Mintz
  Cc: Manish Chopra, David Miller, netdev, Ariel Elior, Tom Herbert,
	Hannes Frederic Sowa

On Wed, Jun 22, 2016 at 11:22 AM, Yuval Mintz <Yuval.Mintz@qlogic.com> wrote:
>>>>> This series adds driver support for the processing of tunnel
>>>>> [specifically vxlan/geneve/gre tunnels] packets which are
>>>>> aggregated [GROed] by the hardware before driver passes
>>>>> such packets upto the stack.
>>>> First off I am pretty sure this isn't GRO.  This is LRO.
>>> Nopes. This is GRO - each MSS-sized frame will arrive on its
>>> own frag, whereas the headers [both  internal & external]
>>> would arrive on the linear part of the SKB.
>
>> No it is LRO, it just very closely mimics GRO.  If it is in hardware
>> it is LRO.  GRO is extensible in software and bugs can be fixed in
>> kernel, LRO is not extensible and any bugs in it are found in hardware
>> or the driver and would have to be fixed there.  It all comes down to
>> bug isolation.  If we find that disabling LRO makes the bug go away we
>> know the hardware aggregation has an issue.  Both features should not
>> operate on the same bit.  Otherwise when some bug is found in your
>> implementation it will be blamed on GRO when the bug is actually in
>> LRO.
> I'm not aware of a definition stating GRO *has* to be extensible in SW;
> AFAIK the LRO/GRO distinction revolves around multiple areas [criteria
> for aggregation, method of aggregation, etc.].

The idea behind GRO was to make it so that we had a generic way to
handle this in software.  For the most part drivers doing LRO in
software were doing the same thing that the GRO was doing.  The only
reason it was deprecated is because GRO was capable of doing more than
LRO could since we add one parser and suddenly all devices saw the
benefit instead of just one specific device.  It is best to keep those
two distinct solutions and then let the user sort out if they want to
have the aggregation done by the device or the kernel.

>> I realize this causes some pain when routing or bridging as LRO is
>> disabled but that is kind of the point.  We don't want the hardware to
>> be mangling frames when we are attempting to route packets between any
>> two given devices.

> Actually, while I might disagree on whether this is LRO/GRO, I don't think
> there's any problem for us to base this on the LRO feature - I.e., when we
> started working on qede we didn't bother implementing LRO as we understood
> it was deprecated, but this encapsulated aggregation is configurable; If it
> makes life easier for everyone if we make the configuration based on the LRO
> configuration, so that when LRO is disabled we won't have this turned on
> it can be easily done.

If you can move this over to an LRO feature I think it would go a long
way towards cleaning up many of my complaints with this.  LRO really
isn't acceptable for routing or bridging scenarios whereas GRO
sometimes is.  If we place your code under the same limitations as the
LRO feature bit then we can probably look at allowing the tunnel
aggregation to be performed since we can be more certain that the
tunnel will be terminating at the local endpoint.

>>>> Also I don't know if you have been paying attention to recent
>>>> discussions on the mailing list but the fact is GRO over UDP tunnels
>>>> is still a subject for debate.  This patch takes things in the
>>>> opposite direction of where we are currently talking about going with
>>>> GRO.  I've added Hannes and Tom to this discussion just to make sure I
>>>> have the proper understanding of all this as my protocol knowledge is
>>>> somewhat lacking.
>>> I guess we're on the exact opposite side of the discussion - I.e., we're
>>> the vendor that tries pushing offload capabilities to the device.
>>> Do notice however that we're not planning on pushing anything new
>>> feature-wise, just like we haven't done anything for regular TCP GRO -
>>> All we do is allow our HW/FW to aggregate packets instead of stack.
>
>> If you have been following the list then arguably you didn't fully
>> understand what has been going on.  I just went through and cleaned up
>> all the VXLAN, GENEVE, and VXLAN-GPE mess that we have been creating
>> and tried to get this consolidated so that we could start to have
>> conversations about this without things being outright rejected.  I
>> feel like you guys have just prooven the argument for the other side
>> that said as soon as we start support any of it, the vendors were
>> going to go nuts and try to stuff everything and the kitchen sink into
>> the NICs.  The fact is I am starting to think they were right.

> You hurt my feeling; I started going nuts ages ago ;-)

I'm not saying anything about going nuts, I'm just saying you kind of
decided to sprint out into a mine-field where I have been treading the
last week or so.  I'm just wanting to avoid having to see all this
code getting ripped out.

> But seriously, this isn't really anything new but rather a step forward in
> the direction we've already taken - bnx2x/qede are already performing
> the same for non-encapsulated TCP.

Right.  That goes on for other NICs as well.  Most of them use the LRO
feature flag for it tough.  Odds are bnx2x and qede should probably be
updated to do so as well.  I didn't realize we had NICs that were
re-purposing the GRO bit this way.  It would be a recipe for us to run
into issues from the software side as well because if we change
something in the requirements for GRO there is no guarantee that we
could support it in your hardware so that would be another argument
for forking your feature off into LRO on the hardware.

> And while I understand why you're being suspicious of such an addition,
> I don't entirely see how it affects any of that discussion - I already yielded
> that we can make this configurable, so if any routing decision would be
> added that result in packets NOT supposed to be aggregated, the feature
> can be turned off [at worst, at the expanse of it not having its benefit
> on other connections].

It basically just increased the scope of all the VXLAN, GENEVE, and
VXLAN-GPE offloads.  Most NICs were only using this for Rx checksum,
RSS, and a couple unfortunately to identify Tx flows for offload.  We
want to try and keep the scope of the offloads enabled via port
identification to the bare minimum.  Some of the community are still
of the opinion that we shouldn't support them at all and should just
rip them out.

>>>> Ideally we need to be able to identify that a given packet terminates
>>>> on a local socket in our namespace before we could begin to perform
>>>> any sort of mangling on the local packets.  It is always possible that
>>>> we could be looking at a frame that uses the same UDP port but is not
>>>> the tunnel protocol if we are performing bridging or routing.  The
>>>> current GRO implementation fails in that regard and there are
>>>> discussions between several of us on how to deal with that.  It is
>>>> likely that we would be forcing GRO for tunnels to move a bit further
>>>>> up the stack if bridging or routing so that we could verify that the
>>>> frame is not being routed back out before we could aggregate it.
>
>>> I'm aware of the on-going discussion, but I'm not sure this should
>>> bother us greatly - the aggregation is done based on the
>>> inner TCP connection; I.e., it's guaranteed all the frames belong to
>>> the same TCP connection. While HW requires the UDP port for the
>>> initial classification effort, it doesn't take decision solely on that.
>
>> I realize that UDP port is only one piece of this.  Exactly what
>> fields are being taken into account.  That information would be useful
>> as we could then go though and determine what the probability is of us
>> having a false match between any two packets in a given UDP flow.

> While we can surely do that, I think that's really beside the point;
> I.e., I don't expect any of you to debug issues arising from a bad HW/FW
> implementation. But if you REALLY want, I can go and ask for exact
> details.

I don't need the exact details but it is something that you should
probably look into from your side.  I can tell you that I have seen a
number of firmware bugs from various NICs over the last year or two.
Debugging them can be a real pain when features are running in the
background that you aren't aware of.  That is another reason for
wanting LRO and GRO separated so that I know I am dealing with
hardware versus software without having to know the details of your
NIC.

>> Also I would still argue this is LRO.  If we are doing routing it
>> should be disabled for us to later re-enable if we feel it is safe.
>> Having a feature like this enabled by default with GRO is a really bad
>> idea.

> Which is exactly what happens in qede/bnx2x for regular TCP.

I'm not sure I follow you here.  Are you saying this feature is
disabled by default or that it is supposed to be disabled when routing
or bridging?  From what I can tell it looks like what you are saying
is true for the bnx2x as it appears to be using the NETIF_F_LRO bit
based on a quick grep, but I don't see this bit used anywhere in qede.

> [I understand this area is a much hotter potato at the moment then
> simple GRO when we've submitted that content for both drivers;
> But I won't mention it again - the sins of the fathers are not excuses
> for any wrong-doings today].

Yeah, I don't really care about if this is how it was before.
Basically I just want it to be configured in such a way that somebody
doesn't have to have proprietary knowledge about the driver in order
to debug it if we end up finding a firmware bug.  I should be able to
clear the LRO bit and have hardware aggregation stop and hand
everything off to the kernel for aggregation so I can determine if the
issue is the kernel or the device/driver.

>>>> Also I would be interested in knowing how your hardware handles
>>>> tunnels with outer checksums.  Is it just ignoring the frames in such
>>>> a case, ignoring the checksum, or is it actually validating the frames
>>>> and then merging the resulting checksum?
>
>>> HW is validating both inner and outer csum; if it wouldn't be able to
>>> due to some limitation, it would not try to pass the packet as a GRO
>>> aggregation but rather as regular seperated packets.
>>> But I don't believe it merges the checksum, only validate each
>>> independently [CSUM_UNNECESSARY].
>
>>So it is mangling the packets then.
> Obviously - you can't aggregate things without touching them.

Right.  Just making the point for those that might not be following
the conversation since we are on a public list.

Still I would have probably preferred to have the checksum repopulated
after aggregation, but the fact that it isn't is not really all that
big a deal.  It just is painful for those what would prefer that we
were using CHECKSUM_COMPLETE for all these tunnels.

I hope this helps to clarify where I stand on this.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 18:22       ` Yuval Mintz
  2016-06-22 21:32         ` Alexander Duyck
@ 2016-06-22 21:52         ` Rick Jones
  2016-06-22 22:47           ` Eric Dumazet
  2016-06-26 19:53           ` David Miller
  1 sibling, 2 replies; 36+ messages in thread
From: Rick Jones @ 2016-06-22 21:52 UTC (permalink / raw)
  To: Yuval Mintz, Alexander Duyck
  Cc: Manish Chopra, David Miller, netdev, Ariel Elior, Tom Herbert,
	Hannes Frederic Sowa

On 06/22/2016 11:22 AM, Yuval Mintz wrote:
> But seriously, this isn't really anything new but rather a step forward in
> the direction we've already taken - bnx2x/qede are already performing
> the same for non-encapsulated TCP.

Since you mention bnx2x...   I would argue that the NIC firmware on 
those NICs driven by bnx2x is doing it badly.  Not so much from a 
functional standpoint I suppose, but from a performance one.  The 
NIC-firmware GRO done there has this rather unfortunate assumption about 
"all MSSes will be directly driven by my own physical MTU" and when it 
sees segments of a size other than would be suggested by the physical 
MTU, will coalesce only two segments together.  They then do not get 
further coalesced in the stack.

Suffice it to say this does not do well from a performance standpoint.

One can disable LRO via ethtool for these NICs, but what that does is 
disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one 
must also get the bnx2x module loaded with "disable-tpa=1" so the Linux 
stack GRO gets used instead.

Had the bnx2x-driven NICs' firmware not had that rather unfortunate 
assumption about MSSes I probably would never have noticed.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 21:32         ` Alexander Duyck
@ 2016-06-22 22:32           ` Hannes Frederic Sowa
  2016-06-22 23:42           ` Eric Dumazet
  1 sibling, 0 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2016-06-22 22:32 UTC (permalink / raw)
  To: Alexander Duyck, Yuval Mintz
  Cc: Manish Chopra, David Miller, netdev, Ariel Elior, Tom Herbert,
	Hannes Frederic Sowa

On 22.06.2016 14:32, Alexander Duyck wrote:
> On Wed, Jun 22, 2016 at 11:22 AM, Yuval Mintz <Yuval.Mintz@qlogic.com> wrote:
>>>>>> This series adds driver support for the processing of tunnel
>>>>>> [specifically vxlan/geneve/gre tunnels] packets which are
>>>>>> aggregated [GROed] by the hardware before driver passes
>>>>>> such packets upto the stack.
>>>>> First off I am pretty sure this isn't GRO.  This is LRO.
>>>> Nopes. This is GRO - each MSS-sized frame will arrive on its
>>>> own frag, whereas the headers [both  internal & external]
>>>> would arrive on the linear part of the SKB.
>>
>>> No it is LRO, it just very closely mimics GRO.  If it is in hardware
>>> it is LRO.  GRO is extensible in software and bugs can be fixed in
>>> kernel, LRO is not extensible and any bugs in it are found in hardware
>>> or the driver and would have to be fixed there.  It all comes down to
>>> bug isolation.  If we find that disabling LRO makes the bug go away we
>>> know the hardware aggregation has an issue.  Both features should not
>>> operate on the same bit.  Otherwise when some bug is found in your
>>> implementation it will be blamed on GRO when the bug is actually in
>>> LRO.
>> I'm not aware of a definition stating GRO *has* to be extensible in SW;
>> AFAIK the LRO/GRO distinction revolves around multiple areas [criteria
>> for aggregation, method of aggregation, etc.].
> 
> The idea behind GRO was to make it so that we had a generic way to
> handle this in software.  For the most part drivers doing LRO in
> software were doing the same thing that the GRO was doing.  The only
> reason it was deprecated is because GRO was capable of doing more than
> LRO could since we add one parser and suddenly all devices saw the
> benefit instead of just one specific device.  It is best to keep those
> two distinct solutions and then let the user sort out if they want to
> have the aggregation done by the device or the kernel.

The most important point while talking about GRO is that it is
symmetrical. The GSO functions need to be able to completely undo the
effect GRO did.

LRO is, for example, completely disabled for all devices in the
namespace as soon as forwarding is enabled because segmentation isn't
available for LRO. This lead to the definition of GRO and GSO as far as
I was told. Probably I would also personally take this as the definition
of GRO. Things like maximum segment size, flags, extensions and
everything must be aggregated and metadata be stored, so that GSO can
exactly build the original fragments (or some reasonable of the originals).

Again, LRO does not provide this, thus we have to disable it on all
devices that participate in forwarding. AFAIK one problem we constantly
saw is that the MSS was not correctly stored in the meta data and thus
the segmentation didn't create same-sized packets.

There are other small rules enforced nowadays. If we upgrade GSO side we
need to be able to modify GRO as well, vice versa.

The problem with a hardware implementation is that we might not be able
to modify core kernel code anymore, because hardware got into this
contract on how it implements GRO, so I agree with Alex, it would better
be kept a software only implementation.

>>> I realize this causes some pain when routing or bridging as LRO is
>>> disabled but that is kind of the point.  We don't want the hardware to
>>> be mangling frames when we are attempting to route packets between any
>>> two given devices.
> 
>> Actually, while I might disagree on whether this is LRO/GRO, I don't think
>> there's any problem for us to base this on the LRO feature - I.e., when we
>> started working on qede we didn't bother implementing LRO as we understood
>> it was deprecated, but this encapsulated aggregation is configurable; If it
>> makes life easier for everyone if we make the configuration based on the LRO
>> configuration, so that when LRO is disabled we won't have this turned on
>> it can be easily done.
> 
> If you can move this over to an LRO feature I think it would go a long
> way towards cleaning up many of my complaints with this.  LRO really
> isn't acceptable for routing or bridging scenarios whereas GRO
> sometimes is.  If we place your code under the same limitations as the
> LRO feature bit then we can probably look at allowing the tunnel
> aggregation to be performed since we can be more certain that the
> tunnel will be terminating at the local endpoint.
> 
>>>>> Also I don't know if you have been paying attention to recent
>>>>> discussions on the mailing list but the fact is GRO over UDP tunnels
>>>>> is still a subject for debate.  This patch takes things in the
>>>>> opposite direction of where we are currently talking about going with
>>>>> GRO.  I've added Hannes and Tom to this discussion just to make sure I
>>>>> have the proper understanding of all this as my protocol knowledge is
>>>>> somewhat lacking.
>>>> I guess we're on the exact opposite side of the discussion - I.e., we're
>>>> the vendor that tries pushing offload capabilities to the device.
>>>> Do notice however that we're not planning on pushing anything new
>>>> feature-wise, just like we haven't done anything for regular TCP GRO -
>>>> All we do is allow our HW/FW to aggregate packets instead of stack.
>>
>>> If you have been following the list then arguably you didn't fully
>>> understand what has been going on.  I just went through and cleaned up
>>> all the VXLAN, GENEVE, and VXLAN-GPE mess that we have been creating
>>> and tried to get this consolidated so that we could start to have
>>> conversations about this without things being outright rejected.  I
>>> feel like you guys have just prooven the argument for the other side
>>> that said as soon as we start support any of it, the vendors were
>>> going to go nuts and try to stuff everything and the kitchen sink into
>>> the NICs.  The fact is I am starting to think they were right.
> 
>> You hurt my feeling; I started going nuts ages ago ;-)
> 
> I'm not saying anything about going nuts, I'm just saying you kind of
> decided to sprint out into a mine-field where I have been treading the
> last week or so.  I'm just wanting to avoid having to see all this
> code getting ripped out.
> 
>> But seriously, this isn't really anything new but rather a step forward in
>> the direction we've already taken - bnx2x/qede are already performing
>> the same for non-encapsulated TCP.
> 
> Right.  That goes on for other NICs as well.  Most of them use the LRO
> feature flag for it tough.  Odds are bnx2x and qede should probably be
> updated to do so as well.  I didn't realize we had NICs that were
> re-purposing the GRO bit this way.  It would be a recipe for us to run
> into issues from the software side as well because if we change
> something in the requirements for GRO there is no guarantee that we
> could support it in your hardware so that would be another argument
> for forking your feature off into LRO on the hardware.

Yes, I am surprised to see this as well. My intuition was that this is
only a in-kernel feature.

>> And while I understand why you're being suspicious of such an addition,
>> I don't entirely see how it affects any of that discussion - I already yielded
>> that we can make this configurable, so if any routing decision would be
>> added that result in packets NOT supposed to be aggregated, the feature
>> can be turned off [at worst, at the expanse of it not having its benefit
>> on other connections].
> 
> It basically just increased the scope of all the VXLAN, GENEVE, and
> VXLAN-GPE offloads.  Most NICs were only using this for Rx checksum,
> RSS, and a couple unfortunately to identify Tx flows for offload.  We
> want to try and keep the scope of the offloads enabled via port
> identification to the bare minimum.  Some of the community are still
> of the opinion that we shouldn't support them at all and should just
> rip them out.
> 
>>>>> Ideally we need to be able to identify that a given packet terminates
>>>>> on a local socket in our namespace before we could begin to perform
>>>>> any sort of mangling on the local packets.  It is always possible that
>>>>> we could be looking at a frame that uses the same UDP port but is not
>>>>> the tunnel protocol if we are performing bridging or routing.  The
>>>>> current GRO implementation fails in that regard and there are
>>>>> discussions between several of us on how to deal with that.  It is
>>>>> likely that we would be forcing GRO for tunnels to move a bit further
>>>>>> up the stack if bridging or routing so that we could verify that the
>>>>> frame is not being routed back out before we could aggregate it.
>>
>>>> I'm aware of the on-going discussion, but I'm not sure this should
>>>> bother us greatly - the aggregation is done based on the
>>>> inner TCP connection; I.e., it's guaranteed all the frames belong to
>>>> the same TCP connection. While HW requires the UDP port for the
>>>> initial classification effort, it doesn't take decision solely on that.

Is it? If you just have a UDP port number you can also interpret some
random data as TCP header. The chances are pretty small that you find
seq numbers that match quite nicely but you are simply not allowed to
assume that. The contract what data is inside a UDP packet is also based
on the machines talking with each other.

I don't think you get the destination address for the tunnel endpoint
from the kernel right now, because Alex' last series doesn't support
that, yet.

>>> I realize that UDP port is only one piece of this.  Exactly what
>>> fields are being taken into account.  That information would be useful
>>> as we could then go though and determine what the probability is of us
>>> having a false match between any two packets in a given UDP flow.
> 
>> While we can surely do that, I think that's really beside the point;
>> I.e., I don't expect any of you to debug issues arising from a bad HW/FW
>> implementation. But if you REALLY want, I can go and ask for exact
>> details.
> 
> I don't need the exact details but it is something that you should
> probably look into from your side.  I can tell you that I have seen a
> number of firmware bugs from various NICs over the last year or two.
> Debugging them can be a real pain when features are running in the
> background that you aren't aware of.  That is another reason for
> wanting LRO and GRO separated so that I know I am dealing with
> hardware versus software without having to know the details of your
> NIC.
> 
>>> Also I would still argue this is LRO.  If we are doing routing it
>>> should be disabled for us to later re-enable if we feel it is safe.
>>> Having a feature like this enabled by default with GRO is a really bad
>>> idea.
> 
>> Which is exactly what happens in qede/bnx2x for regular TCP.
> 
> I'm not sure I follow you here.  Are you saying this feature is
> disabled by default or that it is supposed to be disabled when routing
> or bridging?  From what I can tell it looks like what you are saying
> is true for the bnx2x as it appears to be using the NETIF_F_LRO bit
> based on a quick grep, but I don't see this bit used anywhere in qede.

If qede/bnx2x use LRO feature bit already it will automatically be
disabled as soon as the device enters a bridge/ovs constellation or if
forwarding, bonding is enabled. This is absolutely necessary to keep the
GRO<->GSO contract in place.

>> [I understand this area is a much hotter potato at the moment then
>> simple GRO when we've submitted that content for both drivers;
>> But I won't mention it again - the sins of the fathers are not excuses
>> for any wrong-doings today].
> 
> Yeah, I don't really care about if this is how it was before.
> Basically I just want it to be configured in such a way that somebody
> doesn't have to have proprietary knowledge about the driver in order
> to debug it if we end up finding a firmware bug.  I should be able to
> clear the LRO bit and have hardware aggregation stop and hand
> everything off to the kernel for aggregation so I can determine if the
> issue is the kernel or the device/driver.

Also ack!

>>>>> Also I would be interested in knowing how your hardware handles
>>>>> tunnels with outer checksums.  Is it just ignoring the frames in such
>>>>> a case, ignoring the checksum, or is it actually validating the frames
>>>>> and then merging the resulting checksum?
>>
>>>> HW is validating both inner and outer csum; if it wouldn't be able to
>>>> due to some limitation, it would not try to pass the packet as a GRO
>>>> aggregation but rather as regular seperated packets.
>>>> But I don't believe it merges the checksum, only validate each
>>>> independently [CSUM_UNNECESSARY].
>>
>>> So it is mangling the packets then.
>> Obviously - you can't aggregate things without touching them.

We pretty much try to keep the differences in meta data, so we can
restore everything to the previous situation. This sometimes doesn't
work completely, like e.g. different size mss or DF bits. But it is in
kernel control.

> Right.  Just making the point for those that might not be following
> the conversation since we are on a public list.
>
> Still I would have probably preferred to have the checksum repopulated
> after aggregation, but the fact that it isn't is not really all that
> big a deal.  It just is painful for those what would prefer that we
> were using CHECKSUM_COMPLETE for all these tunnels.
> 
> I hope this helps to clarify where I stand on this.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 21:52         ` Rick Jones
@ 2016-06-22 22:47           ` Eric Dumazet
  2016-06-22 22:56             ` Alexander Duyck
  2016-06-22 23:10             ` Rick Jones
  2016-06-26 19:53           ` David Miller
  1 sibling, 2 replies; 36+ messages in thread
From: Eric Dumazet @ 2016-06-22 22:47 UTC (permalink / raw)
  To: Rick Jones
  Cc: Yuval Mintz, Alexander Duyck, Manish Chopra, David Miller,
	netdev, Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:
> On 06/22/2016 11:22 AM, Yuval Mintz wrote:
> > But seriously, this isn't really anything new but rather a step forward in
> > the direction we've already taken - bnx2x/qede are already performing
> > the same for non-encapsulated TCP.
> 
> Since you mention bnx2x...   I would argue that the NIC firmware on 
> those NICs driven by bnx2x is doing it badly.  Not so much from a 
> functional standpoint I suppose, but from a performance one.  The 
> NIC-firmware GRO done there has this rather unfortunate assumption about 
> "all MSSes will be directly driven by my own physical MTU" and when it 
> sees segments of a size other than would be suggested by the physical 
> MTU, will coalesce only two segments together.  They then do not get 
> further coalesced in the stack.
> 
> Suffice it to say this does not do well from a performance standpoint.
> 
> One can disable LRO via ethtool for these NICs, but what that does is 
> disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one 
> must also get the bnx2x module loaded with "disable-tpa=1" so the Linux 
> stack GRO gets used instead.
> 
> Had the bnx2x-driven NICs' firmware not had that rather unfortunate 
> assumption about MSSes I probably would never have noticed.

I do not see this behavior on my bnx2x nics ?

ip ro add 10.246.11.52 via 10.246.11.254 dev eth0 mtu 1000
lpk51:~# ./netperf -H 10.246.11.52 -l 1000
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
10.246.11.52 () port 0 AF_INET


On receiver :

15:46:08.296241 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 303360, win 8192, options [nop,nop,TS val 1245217243 ecr
1245306446], length 0
15:46:08.296430 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 303360:327060, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296441 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 327060, win 8192, options [nop,nop,TS val 1245217243 ecr
1245306446], length 0
15:46:08.296644 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 327060:350760, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296655 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 350760, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.296854 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 350760:374460, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296897 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 374460, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.297054 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 374460:398160, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.297099 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 398160, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.297258 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 398160:420912, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 22752
15:46:08.297301 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 420912, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 22:47           ` Eric Dumazet
@ 2016-06-22 22:56             ` Alexander Duyck
  2016-06-22 23:31               ` Eric Dumazet
  2016-06-22 23:52               ` Rick Jones
  2016-06-22 23:10             ` Rick Jones
  1 sibling, 2 replies; 36+ messages in thread
From: Alexander Duyck @ 2016-06-22 22:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Yuval Mintz, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On Wed, Jun 22, 2016 at 3:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:
>> On 06/22/2016 11:22 AM, Yuval Mintz wrote:
>> > But seriously, this isn't really anything new but rather a step forward in
>> > the direction we've already taken - bnx2x/qede are already performing
>> > the same for non-encapsulated TCP.
>>
>> Since you mention bnx2x...   I would argue that the NIC firmware on
>> those NICs driven by bnx2x is doing it badly.  Not so much from a
>> functional standpoint I suppose, but from a performance one.  The
>> NIC-firmware GRO done there has this rather unfortunate assumption about
>> "all MSSes will be directly driven by my own physical MTU" and when it
>> sees segments of a size other than would be suggested by the physical
>> MTU, will coalesce only two segments together.  They then do not get
>> further coalesced in the stack.
>>
>> Suffice it to say this does not do well from a performance standpoint.
>>
>> One can disable LRO via ethtool for these NICs, but what that does is
>> disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one
>> must also get the bnx2x module loaded with "disable-tpa=1" so the Linux
>> stack GRO gets used instead.
>>
>> Had the bnx2x-driven NICs' firmware not had that rather unfortunate
>> assumption about MSSes I probably would never have noticed.
>
> I do not see this behavior on my bnx2x nics ?

It could be that you and Rick are running different firmware. I
believe you can expose that via "ethtool -i".  This is the ugly bit
about all this.  We are offloading GRO into the firmware of these
devices with no idea how any of it works and by linking GRO to LRO on
the same device you are stuck having to accept either the firmware
offload or nothing at all.  That is kind of the point Rick was trying
to get at.

The preferred setup would be to have any aggregation offload on the
device be flagged as LRO and then the offload performed by the kernel
be GRO.  By linking the two features as has apparently happened with
bnx2x and qede it limits the users options and kind of forces them
into an all or nothing situation unless they have insight into
proprietary driver options to disable these features.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 22:47           ` Eric Dumazet
  2016-06-22 22:56             ` Alexander Duyck
@ 2016-06-22 23:10             ` Rick Jones
  2016-06-23  0:48               ` Rick Jones
  1 sibling, 1 reply; 36+ messages in thread
From: Rick Jones @ 2016-06-22 23:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yuval Mintz, Alexander Duyck, Manish Chopra, David Miller,
	netdev, Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On 06/22/2016 03:47 PM, Eric Dumazet wrote:
> On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:
>> On 06/22/2016 11:22 AM, Yuval Mintz wrote:
>>> But seriously, this isn't really anything new but rather a step forward in
>>> the direction we've already taken - bnx2x/qede are already performing
>>> the same for non-encapsulated TCP.
>>
>> Since you mention bnx2x...   I would argue that the NIC firmware on
>> those NICs driven by bnx2x is doing it badly.  Not so much from a
>> functional standpoint I suppose, but from a performance one.  The
>> NIC-firmware GRO done there has this rather unfortunate assumption about
>> "all MSSes will be directly driven by my own physical MTU" and when it
>> sees segments of a size other than would be suggested by the physical
>> MTU, will coalesce only two segments together.  They then do not get
>> further coalesced in the stack.
>>
>> Suffice it to say this does not do well from a performance standpoint.
>>
>> One can disable LRO via ethtool for these NICs, but what that does is
>> disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one
>> must also get the bnx2x module loaded with "disable-tpa=1" so the Linux
>> stack GRO gets used instead.
>>
>> Had the bnx2x-driven NICs' firmware not had that rather unfortunate
>> assumption about MSSes I probably would never have noticed.
>
> I do not see this behavior on my bnx2x nics ?
>
> ip ro add 10.246.11.52 via 10.246.11.254 dev eth0 mtu 1000
> lpk51:~# ./netperf -H 10.246.11.52 -l 1000
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 10.246.11.52 () port 0 AF_INET

I first saw this with VMs which themselves had 1400 byte MTUs on their 
vNICs, speaking though bnx2x-driven NICs with a 1500 byte MTU, but I did 
later reproduce it by tweaking the MTU of my sending side NIC to 
something like 1400 bytes and running a "bare iron" netperf.  I believe 
you may be able to achieve the same thing by having netperf set a 
smaller MSS via the test-specific -G option.

My systems are presently in the midst of an install but I should be able 
to demonstrate it in the morning (US Pacific time, modulo the shuttle 
service of a car repair place)

> On receiver :

Paranoid question, but is LRO disabled on the receiver?  I don't know 
that LRO exhibits the behaviour, just GRO-in-the-NIC.

rick

>
> 15:46:08.296241 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
> ack 303360, win 8192, options [nop,nop,TS val 1245217243 ecr
> 1245306446], length 0
> 15:46:08.296430 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
> seq 303360:327060, ack 1, win 229, options [nop,nop,TS val 1245306446
> ecr 1245217242], length 23700
> 15:46:08.296441 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
> ack 327060, win 8192, options [nop,nop,TS val 1245217243 ecr
> 1245306446], length 0
> 15:46:08.296644 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
> seq 327060:350760, ack 1, win 229, options [nop,nop,TS val 1245306446
> ecr 1245217242], length 23700
> 15:46:08.296655 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
> ack 350760, win 8192, options [nop,nop,TS val 1245217244 ecr
> 1245306446], length 0
> 15:46:08.296854 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
> seq 350760:374460, ack 1, win 229, options [nop,nop,TS val 1245306446
> ecr 1245217242], length 23700
> 15:46:08.296897 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
> ack 374460, win 8192, options [nop,nop,TS val 1245217244 ecr
> 1245306446], length 0
> 15:46:08.297054 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
> seq 374460:398160, ack 1, win 229, options [nop,nop,TS val 1245306446
> ecr 1245217242], length 23700
> 15:46:08.297099 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
> ack 398160, win 8192, options [nop,nop,TS val 1245217244 ecr
> 1245306446], length 0
> 15:46:08.297258 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
> seq 398160:420912, ack 1, win 229, options [nop,nop,TS val 1245306446
> ecr 1245217242], length 22752
> 15:46:08.297301 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
> ack 420912, win 8192, options [nop,nop,TS val 1245217244 ecr
> 1245306446], length 0
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 22:56             ` Alexander Duyck
@ 2016-06-22 23:31               ` Eric Dumazet
  2016-06-22 23:59                 ` Tom Herbert
  2016-06-23  0:11                 ` Alexander Duyck
  2016-06-22 23:52               ` Rick Jones
  1 sibling, 2 replies; 36+ messages in thread
From: Eric Dumazet @ 2016-06-22 23:31 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Rick Jones, Yuval Mintz, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On Wed, 2016-06-22 at 15:56 -0700, Alexander Duyck wrote:

> It could be that you and Rick are running different firmware. I
> believe you can expose that via "ethtool -i".  This is the ugly bit
> about all this.  We are offloading GRO into the firmware of these
> devices with no idea how any of it works and by linking GRO to LRO on
> the same device you are stuck having to accept either the firmware
> offload or nothing at all.  That is kind of the point Rick was trying
> to get at.


Well, we offload TSO to the NIC. Or checksums as much as we can.

At one point we have to trust the NIC we plug in our hosts, right ?

Why RX is so different than TX exactly ?

Yes, LRO is hard to implement properly compared to TSO,
but not something that is _fundamentally_ broken.

Claiming that hardware assist GRO is not possible is a plain mantra.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 21:32         ` Alexander Duyck
  2016-06-22 22:32           ` Hannes Frederic Sowa
@ 2016-06-22 23:42           ` Eric Dumazet
  1 sibling, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2016-06-22 23:42 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yuval Mintz, Manish Chopra, David Miller, netdev, Ariel Elior,
	Tom Herbert, Hannes Frederic Sowa

On Wed, 2016-06-22 at 14:32 -0700, Alexander Duyck wrote:
	
> The idea behind GRO was to make it so that we had a generic way to
> handle this in software.  For the most part drivers doing LRO in
> software were doing the same thing that the GRO was doing.  The only
> reason it was deprecated is because GRO was capable of doing more than
> LRO could since we add one parser and suddenly all devices saw the
> benefit instead of just one specific device.  It is best to keep those
> two distinct solutions and then let the user sort out if they want to
> have the aggregation done by the device or the kernel.

Presumably we could add features flags to selectively enable part of LRO
(really simply GRO offloading) for NIC that partially match GRO
requirements.

Patch 5/5 seems to enable the hardware feature(s) :

+               p_ramrod->tpa_param.tpa_ipv4_tunn_en_flg = 1;
+               p_ramrod->tpa_param.tpa_ipv6_tunn_en_flg = 1;

So this NIC seems to have a way to control its LRO engine.

If some horrible bug happens for GRE+IPv6+TCP, you could disable LRO
coping with GRE encapsulation, instead of disabling whole LRO

We have software fallback. Nice, with quite heavy cpu cost.

If we can use offload without breaking rules, use it.

Some NIC have terrible LRO performance (I wont give details here),
but others are ok.

Some NIC have terrible TSO performance for small number of segments.
(gso_segs < 4)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 22:56             ` Alexander Duyck
  2016-06-22 23:31               ` Eric Dumazet
@ 2016-06-22 23:52               ` Rick Jones
  2016-06-23  0:18                 ` Alexander Duyck
  1 sibling, 1 reply; 36+ messages in thread
From: Rick Jones @ 2016-06-22 23:52 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet
  Cc: Yuval Mintz, Manish Chopra, David Miller, netdev, Ariel Elior,
	Tom Herbert, Hannes Frederic Sowa

On 06/22/2016 03:56 PM, Alexander Duyck wrote:
> On Wed, Jun 22, 2016 at 3:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:
>>> Had the bnx2x-driven NICs' firmware not had that rather unfortunate
>>> assumption about MSSes I probably would never have noticed.


> It could be that you and Rick are running different firmware. I
> believe you can expose that via "ethtool -i".  This is the ugly bit
> about all this.  We are offloading GRO into the firmware of these
> devices with no idea how any of it works and by linking GRO to LRO on
> the same device you are stuck having to accept either the firmware
> offload or nothing at all.  That is kind of the point Rick was trying
> to get at.

I think you are typing a bit too far ahead into my keyboard with that 
last sentence.  And I may not have been sufficiently complete in what I 
wrote.  If the bnx2x-driven NICs' firmware had been coalescing more than 
two segments together, not only would I probably not have noticed, I 
probably would not have been upset to learn it was NIC-firmware GRO 
rather than stack.

My complaint is the specific bug of coalescing only two segments when 
their size is unexpected, and the difficulty present in disabling the 
bnx2x-driven NICs' firmware GRO.  I don't have a problem necessarily 
with the existence of NIC-firmware GRO in general.  I just want to be 
able to enable/disable it easily.

rick jones

Of course, what I really want are much, Much, MUCH larger MTUs.  It 
isn't for nothing that I used to refer to TSO as "Poor man's Jumbo 
Frames" :)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 23:31               ` Eric Dumazet
@ 2016-06-22 23:59                 ` Tom Herbert
  2016-06-23  0:11                 ` Alexander Duyck
  1 sibling, 0 replies; 36+ messages in thread
From: Tom Herbert @ 2016-06-22 23:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Rick Jones, Yuval Mintz, Manish Chopra,
	David Miller, netdev, Ariel Elior, Hannes Frederic Sowa

On Wed, Jun 22, 2016 at 4:31 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-06-22 at 15:56 -0700, Alexander Duyck wrote:
>
>> It could be that you and Rick are running different firmware. I
>> believe you can expose that via "ethtool -i".  This is the ugly bit
>> about all this.  We are offloading GRO into the firmware of these
>> devices with no idea how any of it works and by linking GRO to LRO on
>> the same device you are stuck having to accept either the firmware
>> offload or nothing at all.  That is kind of the point Rick was trying
>> to get at.
>
>
> Well, we offload TSO to the NIC. Or checksums as much as we can.
>
> At one point we have to trust the NIC we plug in our hosts, right ?
>
> Why RX is so different than TX exactly ?
>
LRO requires parsing the packet and hence DPI for things like foo/UDP.
TSO is much more straightforward and does not rely on parsing (e.g.
LRO would require NIC to parse EH, TSO doesn't). Also, TSO is
work-conserving, whereas LRO is not. I don't believe there is any
standard for how NICs are supposed to determined when to set packets
to host. All of this makes LRO much less deterministic and means
there's a lot of variance between NIC implementations. Plus there's
also the question of whether a vendor is testing IPv6. I do think all
this can be "fixed" once we have programmable NICs and we can program
our own implementation of LRO, but until then I think there is
inherent risk in using it. GRO rocks though!

Tom

> Yes, LRO is hard to implement properly compared to TSO,
> but not something that is _fundamentally_ broken.
>
> Claiming that hardware assist GRO is not possible is a plain mantra.
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 23:31               ` Eric Dumazet
  2016-06-22 23:59                 ` Tom Herbert
@ 2016-06-23  0:11                 ` Alexander Duyck
  2016-06-23  4:10                   ` Yuval Mintz
  1 sibling, 1 reply; 36+ messages in thread
From: Alexander Duyck @ 2016-06-23  0:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Yuval Mintz, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On Wed, Jun 22, 2016 at 4:31 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-06-22 at 15:56 -0700, Alexander Duyck wrote:
>
>> It could be that you and Rick are running different firmware. I
>> believe you can expose that via "ethtool -i".  This is the ugly bit
>> about all this.  We are offloading GRO into the firmware of these
>> devices with no idea how any of it works and by linking GRO to LRO on
>> the same device you are stuck having to accept either the firmware
>> offload or nothing at all.  That is kind of the point Rick was trying
>> to get at.
>
>
> Well, we offload TSO to the NIC. Or checksums as much as we can.
>
> At one point we have to trust the NIC we plug in our hosts, right ?
>
> Why RX is so different than TX exactly ?

Exactly my point.  The problem I have with the bnx2x and qede drivers
is that the NIC driver is using the GRO flag to enable LRO on the
device. It is like arguing that the NIC can do segmentation based on
the GSO flag instead of the TSO flag.  We should be keeping TSO/LRO as
the device features while GSO/GRO are kernel features that don't
directly change any settings on the device.

> Yes, LRO is hard to implement properly compared to TSO,
> but not something that is _fundamentally_ broken.

Yes and no.  More often then not there are issues that pop up as
protocols change.  I'm not assuming they are fundamentally broken.
The bit that I consider broken is that I cannot disable the NIC LRO
functionality without having to disable the stack from doing GRO.  All
I am asking for is LRO controls all of the NIC aggregation features
and GRO be kept a software only feature instead of being tied to a
hardware feature.

> Claiming that hardware assist GRO is not possible is a plain mantra.

I have no issue claiming hardware assist GRO is possible.  My problem
is saying that the GRO feature flag can be used to enable it.  I would
argue that any packet aggregation at the device or driver level is LRO
regardless of how close it comes to GRO feature wise.  GRO should only
be occurring in the receive path after calling the appropriate GRO
function.  Otherwise it becomes really difficult to work around any
possible issues that the hardware assisted GRO introduces without
paying a huge penalty.  We need to keep these feature flags separate.
I thought that was the whole reason why we have the distinction
between LRO and GRO in the first place.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 23:52               ` Rick Jones
@ 2016-06-23  0:18                 ` Alexander Duyck
  0 siblings, 0 replies; 36+ messages in thread
From: Alexander Duyck @ 2016-06-23  0:18 UTC (permalink / raw)
  To: Rick Jones
  Cc: Eric Dumazet, Yuval Mintz, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On Wed, Jun 22, 2016 at 4:52 PM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 06/22/2016 03:56 PM, Alexander Duyck wrote:
>>
>> On Wed, Jun 22, 2016 at 3:47 PM, Eric Dumazet <eric.dumazet@gmail.com>
>> wrote:
>>>
>>> On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:
>>>>
>>>> Had the bnx2x-driven NICs' firmware not had that rather unfortunate
>>>> assumption about MSSes I probably would never have noticed.
>
>
>
>> It could be that you and Rick are running different firmware. I
>> believe you can expose that via "ethtool -i".  This is the ugly bit
>> about all this.  We are offloading GRO into the firmware of these
>> devices with no idea how any of it works and by linking GRO to LRO on
>> the same device you are stuck having to accept either the firmware
>> offload or nothing at all.  That is kind of the point Rick was trying
>> to get at.
>
>
> I think you are typing a bit too far ahead into my keyboard with that last
> sentence.  And I may not have been sufficiently complete in what I wrote.
> If the bnx2x-driven NICs' firmware had been coalescing more than two
> segments together, not only would I probably not have noticed, I probably
> would not have been upset to learn it was NIC-firmware GRO rather than
> stack.
>
> My complaint is the specific bug of coalescing only two segments when their
> size is unexpected, and the difficulty present in disabling the bnx2x-driven
> NICs' firmware GRO.  I don't have a problem necessarily with the existence
> of NIC-firmware GRO in general.  I just want to be able to enable/disable it
> easily.

Right.  Which is why I thought we were supposed to be using the LRO
flag when a NIC is performing the GRO instead of it being done by the
kernel.

I have no problem with us doing hardware GRO.  The only issue I have
is with us identifying it as GRO.

In my opinion the way I would have preferred to have seen the bnx2x
handle all this is to make it that the NETIF_F_LRO flag controlled if
the NIC performed aggregation or not and the module parameter
determine if the LRO conforms to the GRO type layout for the frame.

As it is now I can easily see this causing issues if someone updates
how GRO/GSO handles frames without realizing they now need to make
sure the bnx2x and qede drivers need to generate the same frame layout
as well.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 23:10             ` Rick Jones
@ 2016-06-23  0:48               ` Rick Jones
  2016-06-23  9:03                 ` Yuval Mintz
  0 siblings, 1 reply; 36+ messages in thread
From: Rick Jones @ 2016-06-23  0:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yuval Mintz, Alexander Duyck, Manish Chopra, David Miller,
	netdev, Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On 06/22/2016 04:10 PM, Rick Jones wrote:
> My systems are presently in the midst of an install but I should be able
> to demonstrate it in the morning (US Pacific time, modulo the shuttle
> service of a car repair place)

The installs finished sooner than I thought.  So, receiver:


root@np-cp1-comp0001-mgmt:/home/stack# uname -a
Linux np-cp1-comp0001-mgmt 4.4.11-2-amd64-hpelinux #hpelinux1 SMP Mon 
May 23 15:39:22 UTC 2016 x86_64 GNU/Linux
root@np-cp1-comp0001-mgmt:/home/stack# ethtool -i hed2
driver: bnx2x
version: 1.712.30-0
firmware-version: bc 7.10.10
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

the hed2 interface is a port of an HPE 630M NIC, based on the BCM57840:

05:00.0 Ethernet controller: Broadcom Corporation BCM57840 NetXtreme II 
10/20-Gigabit Ethernet (rev 11)
	Subsystem: Hewlett-Packard Company HP FlexFabric 20Gb 2-port 630M Adapter

(The pci.ids entry being from before that 10 GbE IP was purchased from 
Broadcom by QLogic...)

Verify that LRO is disabled (IIRC it is enabled by default):

root@np-cp1-comp0001-mgmt:/home/stack# ethtool -k hed2 | grep large
large-receive-offload: off

Verify that disable_tpa is not set:

root@np-cp1-comp0001-mgmt:/home/stack# cat 
/sys/module/bnx2x/parameters/disable_tpa
0

So this means we will see NIC-firmware GRO.

Start a tcpdump on the receiver:
root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -s 96 -c 2000000 -i hed2 
-w foo.pcap port 12867
tcpdump: listening on hed2, link-type EN10MB (Ethernet), capture size 96 
bytes

Start a netperf test targeting that system, specifying a smaller MSS:

stack@np-cp1-comp0002-mgmt:~$ ./netperf -H np-cp1-comp0001-guest -- -G 
1400 -P 12867 -O throughput,transport_mss
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-comp0001-guest () port 12867 AF_INET : demo
Throughput Transport
            MSS
            bytes

3372.82    1388

Come back to the receiver and post-process the tcpdump capture to get 
the average segment size for the data segments:

2000000 packets captured
2000916 packets received by filter
0 packets dropped by kernel
root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -n -r foo.pcap | fgrep -v 
"length 0" | awk '{sum += $NF}END{print "Average:",sum/NR}'
reading from file foo.pcap, link-type EN10MB (Ethernet)
Average: 2741.93

and finally a snippet of the capture:

00:37:47.333414 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [S], seq 
1236484791, win 28000, options [mss 1400,sackOK,TS val 1491134 ecr 
0,nop,wscale 7], length 0
00:37:47.333488 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [S.], 
seq 134167501, ack 1236484792, win 28960, options [mss 1460,sackOK,TS 
val 1499053 ecr 1491134,nop,wscale 7], length 0
00:37:47.333731 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], ack 
1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 0
00:37:47.333788 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
1:2777, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333815 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
2777, win 270, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333822 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
2777:5553, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333837 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
5553, win 313, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333842 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
5553:8329, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333856 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
8329:11105, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333869 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
8329, win 357, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333879 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
11105:13881, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333891 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
11105, win 400, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333911 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
13881, win 444, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333964 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
13881:16657, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333982 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
16657:19433, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333989 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
19433:22209, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333994 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
22209:24985, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.334011 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
24985:27761, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.334018 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
27761:30537, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.334025 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
30537:33313, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.334031 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
33313:36089, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776

2776 being twice 1388.

happy benchmarking,

rick jones

root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -n -r foo.pcap | fgrep -v 
-e "length 0" | fgrep length | awk '{print $NF}' | sort | uniq -c | sort -nr
reading from file foo.pcap, link-type EN10MB (Ethernet)
1400584 2776
    5930 1388
      17 2544
      15 1620
      13 2560
      13 2000
      12 2604
      12 2504
      12 2456
      12 1708
      ...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-23  0:11                 ` Alexander Duyck
@ 2016-06-23  4:10                   ` Yuval Mintz
  2016-06-23  4:17                     ` Yuval Mintz
  0 siblings, 1 reply; 36+ messages in thread
From: Yuval Mintz @ 2016-06-23  4:10 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet
  Cc: Rick Jones, Manish Chopra, David Miller, netdev, Ariel Elior,
	Tom Herbert, Hannes Frederic Sowa

>> Claiming that hardware assist GRO is not possible is a plain mantra.

> I have no issue claiming hardware assist GRO is possible.  My problem
> is saying that the GRO feature flag can be used to enable it.  I would
> argue that any packet aggregation at the device or driver level is LRO
> regardless of how close it comes to GRO feature wise.  GRO should only
> be occurring in the receive path after calling the appropriate GRO
> function.  Otherwise it becomes really difficult to work around any
> possible issues that the hardware assisted GRO introduces without
> paying a huge penalty.  We need to keep these feature flags separate.
> I thought that was the whole reason why we have the distinction
> between LRO and GRO in the first place.

Then again, if you're basically saying every HW-assisted offload on
receive should be done under LRO flag, what would be the use case
where a GRO-assisted offload would help?

I.e., afaik LRO is superior to GRO in `brute force' -
it creates better packed packets and utilizes memory better
[with all the obvious cons such as inability for defragmentation].
So if you'd have the choice of having an adpater perform 'classic'
LRO aggregation or something that resembles a GRO packet,
what would be the gain from doing the latter?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-23  4:10                   ` Yuval Mintz
@ 2016-06-23  4:17                     ` Yuval Mintz
  2016-06-23 17:07                       ` Alexander Duyck
  0 siblings, 1 reply; 36+ messages in thread
From: Yuval Mintz @ 2016-06-23  4:17 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet
  Cc: Rick Jones, Manish Chopra, David Miller, netdev, Ariel Elior,
	Tom Herbert, Hannes Frederic Sowa

> Then again, if you're basically saying every HW-assisted offload on
> receive should be done under LRO flag, what would be the use case
> where a GRO-assisted offload would help?

> I.e., afaik LRO is superior to GRO in `brute force' -
> it creates better packed packets and utilizes memory better
> [with all the obvious cons such as inability for defragmentation].
> So if you'd have the choice of having an adpater perform 'classic'
> LRO aggregation or something that resembles a GRO packet,
> what would be the gain from doing the latter?

Just to relate to bnx2x/qede differences in current implementation -
when this GRO hw-offload was added to bnx2x, it has already
supported classical LRO, and due to above statement whenever LRO
was set driver aggregated incoming traffic as classic LRO.
I agree that in hindsight the lack of distinction between sw/hw GRO
was hurting us.

qede isn't implementing LRO, so we could easily mark this feature
under LRO there - but question is, given that the adapter can support
LRO, if we're going to suffer from all the shotrages that arise from
putting this feature under LRO, why should we bother?

You can argue that we might need a new feature bit for control
over such a feature; If we don't do that, is there any gain in all of this?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-23  0:48               ` Rick Jones
@ 2016-06-23  9:03                 ` Yuval Mintz
  0 siblings, 0 replies; 36+ messages in thread
From: Yuval Mintz @ 2016-06-23  9:03 UTC (permalink / raw)
  To: Rick Jones, Eric Dumazet
  Cc: Alexander Duyck, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

> > My systems are presently in the midst of an install but I should be
> > able to demonstrate it in the morning (US Pacific time, modulo the
> > shuttle service of a car repair place)
> 
> stack@np-cp1-comp0002-mgmt:~$ ./netperf -H np-cp1-comp0001-guest -- -G
> 1400 -P 12867 -O throughput,transport_mss MIGRATED TCP STREAM TEST from
> 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-comp0001-guest () port 12867
> AF_INET : demo Throughput Transport
>             MSS
>             bytes
> 
> 3372.82    1388
...
> root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -n -r foo.pcap | fgrep -v
> "length 0" | awk '{sum += $NF}END{print "Average:",sum/NR}'
> reading from file foo.pcap, link-type EN10MB (Ethernet)
> Average: 2741.93

Yes, it's a known FW limitation - if MSS is smaller than configured MTU,
aggregation will be closed on 2nd packet; It might get changed in some future
FW version. But I agree it reinforces the need of having some kind of
user-knob for controlling this offloaded feature.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-23  4:17                     ` Yuval Mintz
@ 2016-06-23 17:07                       ` Alexander Duyck
  2016-06-23 21:06                         ` Yuval Mintz
  2016-06-24 13:09                         ` Edward Cree
  0 siblings, 2 replies; 36+ messages in thread
From: Alexander Duyck @ 2016-06-23 17:07 UTC (permalink / raw)
  To: Yuval Mintz
  Cc: Eric Dumazet, Rick Jones, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On Wed, Jun 22, 2016 at 9:17 PM, Yuval Mintz <Yuval.Mintz@qlogic.com> wrote:
>> Then again, if you're basically saying every HW-assisted offload on
>> receive should be done under LRO flag, what would be the use case
>> where a GRO-assisted offload would help?
>
>> I.e., afaik LRO is superior to GRO in `brute force' -
>> it creates better packed packets and utilizes memory better
>> [with all the obvious cons such as inability for defragmentation].
>> So if you'd have the choice of having an adpater perform 'classic'
>> LRO aggregation or something that resembles a GRO packet,
>> what would be the gain from doing the latter?

LRO and GRO shouldn't really differ in packing or anything like that.
The big difference between the two is that LRO is destructive while
GRO is not.  Specifically in the case of GRO you should be able to
take the resultant frame, feed it through GSO, and get the original
stream of frames back out.  So you can pack the frames however you
want the only key is that you must capture all the correct offsets and
set the gso_size correct for the flow.

> Just to relate to bnx2x/qede differences in current implementation -
> when this GRO hw-offload was added to bnx2x, it has already
> supported classical LRO, and due to above statement whenever LRO
> was set driver aggregated incoming traffic as classic LRO.
> I agree that in hindsight the lack of distinction between sw/hw GRO
> was hurting us.

In the case of bnx2x it sounds like you have issues that are
significantly hurting the performance versus classic software GRO.  If
that is the case you might want to simply flip the logic for the
module parameter that Rick mentioned and just disable the hardware
assisted GRO unless it is specifically requested.

> qede isn't implementing LRO, so we could easily mark this feature
> under LRO there - but question is, given that the adapter can support
> LRO, if we're going to suffer from all the shotrages that arise from
> putting this feature under LRO, why should we bother?

The idea is to address feature isolation.  The fact is the hardware
exists outside of kernel control.  If you end up linking an internal
kernel feature to your device like this you are essentially stripping
the option of using the kernel feature.

I would prefer to see us extend LRO to support "close enough GRO"
instead of have us extend GRO to also include LRO.  That way when we
encounter issues like the FW limitation that Rick encountered he can
just go in and disable the LRO and have true GRO kick in which would
be significantly better than having to poke around through
documentation to find a module parameter that can force the feature
off.  Really the fact that you have to use a module parameter is
frowned upon as well as most drivers aren't supposed to be using those
in the netdev tree.

> You can argue that we might need a new feature bit for control
> over such a feature; If we don't do that, is there any gain in all of this?

I would argue that yes there are many cases where we will be able to
show gain.  The fact is there is a strong likelihood of the GRO on
your parts having some differences either now, or at some point in the
future as the code evolves.  As I mentioned there was already some
talk about possibly needing to push the UDP tunnel aggregation out of
GRO and perhaps handling it sometime after IP look up had verified
that the destination was in fact a local address in the namespace.  In
addition it makes the changes to include the tunnel encapsulation much
more acceptable as LRO is already naturally dropped in the routing and
bridging cases if I recall correctly.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-23 17:07                       ` Alexander Duyck
@ 2016-06-23 21:06                         ` Yuval Mintz
  2016-06-23 23:20                           ` Alexander Duyck
  2016-06-24 13:09                         ` Edward Cree
  1 sibling, 1 reply; 36+ messages in thread
From: Yuval Mintz @ 2016-06-23 21:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Rick Jones, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

>>> Then again, if you're basically saying every HW-assisted offload on
>>> receive should be done under LRO flag, what would be the use case
>>> where a GRO-assisted offload would help?
>>> I.e., afaik LRO is superior to GRO in `brute force' -
>>> it creates better packed packets and utilizes memory better
>>> [with all the obvious cons such as inability for defragmentation].
>>> So if you'd have the choice of having an adpater perform 'classic'
>>> LRO aggregation or something that resembles a GRO packet,
>>> what would be the gain from doing the latter?

> LRO and GRO shouldn't really differ in packing or anything like that.
> The big difference between the two is that LRO is destructive while
> GRO is not.  Specifically in the case of GRO you should be able to
> take the resultant frame, feed it through GSO, and get the original
> stream of frames back out.  So you can pack the frames however you
> want the only key is that you must capture all the correct offsets and
> set the gso_size correct for the flow.

While the implementation might lack in things [such as issues with
future implementation], following your logic it is GRO - I.e., forwarding
scenarios work fine with HW assisted GRO.

>> Just to relate to bnx2x/qede differences in current implementation -
>> when this GRO hw-offload was added to bnx2x, it has already
>> supported classical LRO, and due to above statement whenever LRO
>> was set driver aggregated incoming traffic as classic LRO.
>> I agree that in hindsight the lack of distinction between sw/hw GRO
>> was hurting us.

> In the case of bnx2x it sounds like you have issues that are
> significantly hurting the performance versus classic software GRO.  If
> that is the case you might want to simply flip the logic for the
> module parameter that Rick mentioned and just disable the hardware
> assisted GRO unless it is specifically requested.

A bit hard to flip; The module parameter also disables LRO support.
And given that module parameters is mostly a thing of the past, I
don't think we should strive fixing things through additional changes
in that area.

> > qede isn't implementing LRO, so we could easily mark this feature
> > under LRO there - but question is, given that the adapter can support
> > LRO, if we're going to suffer from all the shotrages that arise from
> > putting this feature under LRO, why should we bother?

> The idea is to address feature isolation.  The fact is the hardware
> exists outside of kernel control.  If you end up linking an internal
> kernel feature to your device like this you are essentially stripping
> the option of using the kernel feature.

> I would prefer to see us extend LRO to support "close enough GRO"
> instead of have us extend GRO to also include LRO. 

Again - why? What's the benefit of HW doing LRO and trying to
control imitate GRO, if it's still carrying all the LRO baggage
[specifically, disabling it on forwarding] as opposed to simply
doing classic LRO?

> > You can argue that we might need a new feature bit for control
> > over such a feature; If we don't do that, is there any gain in all of this?

> I would argue that yes there are many cases where we will be able to
> show gain.  The fact is there is a strong likelihood of the GRO on
> your parts having some differences either now, or at some point in the
> future as the code evolves.  As I mentioned there was already some
> talk about possibly needing to push the UDP tunnel aggregation out of
> GRO and perhaps handling it sometime after IP look up had verified
> that the destination was in fact a local address in the namespace.  In
> addition it makes the changes to include the tunnel encapsulation much
> more acceptable as LRO is already naturally dropped in the routing and
> bridging cases if I recall correctly.

I think it all boils down to the question of "do we actually want to have
HW-assisted GRO?". If we do [and not necessarily for the UDP-tunnel
scenario] then we need to have it distinct from LRO, otherwise there's
very little gain. If we believe tGRO should remain SW-only, then
I think the discussion is mott; We need to stop trying this, and offload
only LRO - in which case we can aggregate it in whichever 'destructive'
[correct] format we like, without trying to have it resemble GRO.
     

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-23 21:06                         ` Yuval Mintz
@ 2016-06-23 23:20                           ` Alexander Duyck
  2016-06-24  5:20                             ` Yuval Mintz
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Duyck @ 2016-06-23 23:20 UTC (permalink / raw)
  To: Yuval Mintz
  Cc: Eric Dumazet, Rick Jones, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On Thu, Jun 23, 2016 at 2:06 PM, Yuval Mintz <Yuval.Mintz@qlogic.com> wrote:
>>>> Then again, if you're basically saying every HW-assisted offload on
>>>> receive should be done under LRO flag, what would be the use case
>>>> where a GRO-assisted offload would help?
>>>> I.e., afaik LRO is superior to GRO in `brute force' -
>>>> it creates better packed packets and utilizes memory better
>>>> [with all the obvious cons such as inability for defragmentation].
>>>> So if you'd have the choice of having an adpater perform 'classic'
>>>> LRO aggregation or something that resembles a GRO packet,
>>>> what would be the gain from doing the latter?
>
>> LRO and GRO shouldn't really differ in packing or anything like that.
>> The big difference between the two is that LRO is destructive while
>> GRO is not.  Specifically in the case of GRO you should be able to
>> take the resultant frame, feed it through GSO, and get the original
>> stream of frames back out.  So you can pack the frames however you
>> want the only key is that you must capture all the correct offsets and
>> set the gso_size correct for the flow.
>
> While the implementation might lack in things [such as issues with
> future implementation], following your logic it is GRO - I.e., forwarding
> scenarios work fine with HW assisted GRO.
>
>>> Just to relate to bnx2x/qede differences in current implementation -
>>> when this GRO hw-offload was added to bnx2x, it has already
>>> supported classical LRO, and due to above statement whenever LRO
>>> was set driver aggregated incoming traffic as classic LRO.
>>> I agree that in hindsight the lack of distinction between sw/hw GRO
>>> was hurting us.
>
>> In the case of bnx2x it sounds like you have issues that are
>> significantly hurting the performance versus classic software GRO.  If
>> that is the case you might want to simply flip the logic for the
>> module parameter that Rick mentioned and just disable the hardware
>> assisted GRO unless it is specifically requested.
>
> A bit hard to flip; The module parameter also disables LRO support.
> And given that module parameters is mostly a thing of the past, I
> don't think we should strive fixing things through additional changes
> in that area.
>
>> > qede isn't implementing LRO, so we could easily mark this feature
>> > under LRO there - but question is, given that the adapter can support
>> > LRO, if we're going to suffer from all the shotrages that arise from
>> > putting this feature under LRO, why should we bother?
>
>> The idea is to address feature isolation.  The fact is the hardware
>> exists outside of kernel control.  If you end up linking an internal
>> kernel feature to your device like this you are essentially stripping
>> the option of using the kernel feature.
>
>> I would prefer to see us extend LRO to support "close enough GRO"
>> instead of have us extend GRO to also include LRO.
>
> Again - why? What's the benefit of HW doing LRO and trying to
> control imitate GRO, if it's still carrying all the LRO baggage
> [specifically, disabling it on forwarding] as opposed to simply
> doing classic LRO?

In most cases that is all LRO is.  LRO is trying to do the best it can
to emulate GRO.  The reason why it was pushed off into a separate
feature bit is that it never quite works out that way and most vendors
end up with something that comes close but is always missing a few
items.

The "hardware assisted GRO" is really nothing more than a marketing
term and a justification to ignore the fact that we should probably be
disabling it when routing or bridging is enabled.  What you are doing
is LRO but using the wrong bit to test for the feature.  We already
know of one firmware bug you guys have which makes it clear that the
bnx2x is not doing hardware assisted GRO it is doing something else
since it performs much worse than GRO if the MSS is less than what it
would be based on the MTU.

>> > You can argue that we might need a new feature bit for control
>> > over such a feature; If we don't do that, is there any gain in all of this?
>
>> I would argue that yes there are many cases where we will be able to
>> show gain.  The fact is there is a strong likelihood of the GRO on
>> your parts having some differences either now, or at some point in the
>> future as the code evolves.  As I mentioned there was already some
>> talk about possibly needing to push the UDP tunnel aggregation out of
>> GRO and perhaps handling it sometime after IP look up had verified
>> that the destination was in fact a local address in the namespace.  In
>> addition it makes the changes to include the tunnel encapsulation much
>> more acceptable as LRO is already naturally dropped in the routing and
>> bridging cases if I recall correctly.
>
> I think it all boils down to the question of "do we actually want to have
> HW-assisted GRO?". If we do [and not necessarily for the UDP-tunnel
> scenario] then we need to have it distinct from LRO, otherwise there's
> very little gain. If we believe tGRO should remain SW-only, then
> I think the discussion is mott; We need to stop trying this, and offload
> only LRO - in which case we can aggregate it in whichever 'destructive'
> [correct] format we like, without trying to have it resemble GRO.

I believe GRO should be software only.  The LRO feature is meant to
represent hardware assisted aggregation.  If you want to take steps to
define it further or reduce the limitations so that it is not disabled
when routing or bridging is enabled for a device I would be fine with
that.

One of the reasons why LRO is disabled by default for routing and
bridging is because the feature has always been somewhat poorly
defined and resulted in us getting all sort of frames, some that were
route-able and some that weren't.  Arguably your hardware is on the
better end of this but it is still not without its issues. Perhaps
this might be an opportunity to go through and redefine what LRO is so
that we can make better use of it like we did with the ntuple/nfc
stuff a few years ago.  Then it would be easier to go through and make
it programmable so we can do things like turn on and off tunnel
offloads, maybe IPv4/IPv6 support, and perhaps have a flag for
identifying GRO like LRO versus everything else.  That way the end
user has more control over the configuration and can pick and choose
what they want.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-23 23:20                           ` Alexander Duyck
@ 2016-06-24  5:20                             ` Yuval Mintz
  2016-06-24 16:44                               ` Alexander Duyck
  0 siblings, 1 reply; 36+ messages in thread
From: Yuval Mintz @ 2016-06-24  5:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Rick Jones, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

>We already know of one firmware bug you guys have which makes 
>it clear that the bnx2x is not doing hardware assisted GRO it is doing 
>something else since it performs much worse than GRO if the MSS is 
>less than what it would be based on the MTU.

It's a bit nitpicky, isn't it? Claiming this flaw means it's not GRO.
I.e., you obviously wouldn't have claimed it beacme GRO if it
was fixed.

Not saying it makes a lot of difference, though.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-23 17:07                       ` Alexander Duyck
  2016-06-23 21:06                         ` Yuval Mintz
@ 2016-06-24 13:09                         ` Edward Cree
  2016-06-24 16:31                           ` Tom Herbert
  1 sibling, 1 reply; 36+ messages in thread
From: Edward Cree @ 2016-06-24 13:09 UTC (permalink / raw)
  To: Alexander Duyck, Yuval Mintz
  Cc: Eric Dumazet, Rick Jones, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa, Bert Kenward

On 23/06/16 18:07, Alexander Duyck wrote:
> I would prefer to see us extend LRO to support "close enough GRO"
> instead of have us extend GRO to also include LRO.
This reminds me of something I've been meaning to bring up (sorry for
slightly OT, but it might turn out relevant after all).
In sfc we have an (out-of-tree and likely staying that way) LRO that's
entirely in software.  The only reason it exists is for users who want
the 'permissive' merging behaviour of LRO, i.e. they don't need the
guarantees of reversibility and by merging more stuff they can get
slightly higher performance.
I wonder if it would be a good idea for the GRO implementation to have
some knobs to allow setting it to behave in this way.
That would imply a scheme to define various GRO/SSR semantics, which
then would also be a convenient interface for a driver to report the
semantics of its hardware LRO if it has any.
And it would make crystal clear that the difference between GRO and
LRO is kernel vs hardware, rather than reversible vs not.

-Ed

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-24 13:09                         ` Edward Cree
@ 2016-06-24 16:31                           ` Tom Herbert
  2016-06-24 17:21                             ` Edward Cree
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Herbert @ 2016-06-24 16:31 UTC (permalink / raw)
  To: Edward Cree
  Cc: Alexander Duyck, Yuval Mintz, Eric Dumazet, Rick Jones,
	Manish Chopra, David Miller, netdev, Ariel Elior,
	Hannes Frederic Sowa, Bert Kenward

On Fri, Jun 24, 2016 at 6:09 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 23/06/16 18:07, Alexander Duyck wrote:
>> I would prefer to see us extend LRO to support "close enough GRO"
>> instead of have us extend GRO to also include LRO.
> This reminds me of something I've been meaning to bring up (sorry for
> slightly OT, but it might turn out relevant after all).
> In sfc we have an (out-of-tree and likely staying that way) LRO that's
> entirely in software.  The only reason it exists is for users who want
> the 'permissive' merging behaviour of LRO, i.e. they don't need the
> guarantees of reversibility and by merging more stuff they can get
> slightly higher performance.
> I wonder if it would be a good idea for the GRO implementation to have
> some knobs to allow setting it to behave in this way.
> That would imply a scheme to define various GRO/SSR semantics, which
> then would also be a convenient interface for a driver to report the
> semantics of its hardware LRO if it has any.
> And it would make crystal clear that the difference between GRO and
> LRO is kernel vs hardware, rather than reversible vs not.
>
Ed,

Because you took this OT... ;-)

LRO/GRO is the one of the five common offloads that has no generic
analogue and requires protocol specific logic. For instance each
IP-over-foo encapsulation needs kernel code for GRO, device/driver
code for LRO. I think the answer here is to make both GRO and LRO to
be user programmable via BPF. That is, instead of needing to add code
or buy a new device to support every new protocol, we really just want
to write a program for it that runs in any environment. In the case of
LRO this becomes especially important since it resolves the "black
box" nature of LRO in devices, so design goals like ensuring LRO is
"close enough to GRO" become something we (the stack) have some
control over.

We've already moved GRO for to be a attribute of a UDP sockets, it is
not much of a stretch now to allow applications to define their own
GRO for their protocols (I'm thinking protocols like QUIC or TOU could
use this).

For programmable LRO I think the solution is to use XDP. For instance
protocol specific parsing would done by the BPF program to identify
the flows, and the infrastructure would provide the backend handling.
The advantage of XDP for this is that it is not platform specific, so
programmable LRO could be implemented in the driver (probably
leveraging existing LRO solution like sfc), or it could be implemented
implemented in HW using the exact same program (with HW support for
BPF/XDP). Since such a program allows arbitrary parsing and flow
lookup, we can match on specific n-tuples as needed to resolve the
UDP-encapsulation identification problem.

Tom

> -Ed

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-24  5:20                             ` Yuval Mintz
@ 2016-06-24 16:44                               ` Alexander Duyck
  0 siblings, 0 replies; 36+ messages in thread
From: Alexander Duyck @ 2016-06-24 16:44 UTC (permalink / raw)
  To: Yuval Mintz
  Cc: Eric Dumazet, Rick Jones, Manish Chopra, David Miller, netdev,
	Ariel Elior, Tom Herbert, Hannes Frederic Sowa

On Thu, Jun 23, 2016 at 10:20 PM, Yuval Mintz <Yuval.Mintz@qlogic.com> wrote:
>>We already know of one firmware bug you guys have which makes
>>it clear that the bnx2x is not doing hardware assisted GRO it is doing
>>something else since it performs much worse than GRO if the MSS is
>>less than what it would be based on the MTU.
>
> It's a bit nitpicky, isn't it? Claiming this flaw means it's not GRO.
> I.e., you obviously wouldn't have claimed it beacme GRO if it
> was fixed.
>
> Not saying it makes a lot of difference, though.

The fact is LRO and GRO are two separate things.  Even without the bug
in the firmware I would still be saying the are two very different
things.  Your GRO implementation only supports a subset of the
features that GRO has in the hardware.  The way the kernel has
implemented things we should keep GRO and GSO symmetric if at all
possible.  There aren't currently any GSO hardware offloads so there
probably shouldn't be any GRO hardware offloads.  On the other hand
devices do support TSO hardware offloads so it would make sense to
match that up and support LRO as the hardware equivalent of GRO.

Anyway that is my opinion.  It may be nitpicky but I don't fee that we
should be re-purposing feature bits that were meant to be software
features to represent hardware ones.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-24 16:31                           ` Tom Herbert
@ 2016-06-24 17:21                             ` Edward Cree
  2016-06-26  6:09                               ` Yuval Mintz
  0 siblings, 1 reply; 36+ messages in thread
From: Edward Cree @ 2016-06-24 17:21 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexander Duyck, Yuval Mintz, Eric Dumazet, Rick Jones,
	Manish Chopra, David Miller, netdev, Ariel Elior,
	Hannes Frederic Sowa, Bert Kenward

On 24/06/16 17:31, Tom Herbert wrote:
> Ed,
> Because you took this OT... ;-)
>
> LRO/GRO is the one of the five common offloads that has no generic
> analogue and requires protocol specific logic. For instance each
> IP-over-foo encapsulation needs kernel code for GRO, device/driver
> code for LRO. I think the answer here is to make both GRO and LRO to
> be user programmable via BPF.
I agree that the only way to make LRO generic is to go for hardware
BPF.  However, I think that's likely to cause a _lot_ of headaches to
implement and my hope is that we can instead get acceptable receive
performance from GRO, RSS, and maybe things like the skb bundling I
posted a while back.
For instance, if your 'source port hack' were to mix in the TNI as
well as the inner flow fields it already uses, I think that could
improve hash spreading and thus GRO would perform better.
Fundamentally I believe that robust, responsive hardware LRO is not
workable as the hardware would have to decide to hold onto packets in
the hope of merge candidates arriving soon after.  Whereas in the
software layer (GRO, bundling...), the packets are already coming in
bursts thanks to the way napi polling behaves.
But I'd love to be proved wrong :)  The 'hybrid' approach of using
bpf in hw to identify flows for sw to gro does seem plausible, maybe
having bpf to compute the rxhash is the answer?

-Ed

(disclaimer: definitely not speaking for my employer here, these are
my personal views only.)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-24 17:21                             ` Edward Cree
@ 2016-06-26  6:09                               ` Yuval Mintz
  0 siblings, 0 replies; 36+ messages in thread
From: Yuval Mintz @ 2016-06-26  6:09 UTC (permalink / raw)
  To: Edward Cree, Tom Herbert
  Cc: Alexander Duyck, Eric Dumazet, Rick Jones, Manish Chopra,
	David Miller, netdev, Ariel Elior, Hannes Frederic Sowa,
	Bert Kenward

> Fundamentally I believe that robust, responsive hardware LRO is not workable as
> the hardware would have to decide to hold onto packets in the hope of merge
> candidates arriving soon after.  Whereas in the software layer (GRO,
> bundling...), the packets are already coming in bursts thanks to the way napi
> polling behaves.

Sounds like an additional coalescing configuration to me [assuming HW supports it].

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
  2016-06-22 21:52         ` Rick Jones
  2016-06-22 22:47           ` Eric Dumazet
@ 2016-06-26 19:53           ` David Miller
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2016-06-26 19:53 UTC (permalink / raw)
  To: rick.jones2
  Cc: Yuval.Mintz, alexander.duyck, manish.chopra, netdev, Ariel.Elior,
	tom, hannes

From: Rick Jones <rick.jones2@hpe.com>
Date: Wed, 22 Jun 2016 14:52:34 -0700

> To get that disabled, one must also get the bnx2x module loaded with
> "disable-tpa=1" so the Linux stack GRO gets used instead.

And people wonder why I'm difficult about module parameters in
drivers....

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-06-26 19:53 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-22  8:25 [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Manish Chopra
2016-06-22  8:25 ` [PATCH net-next 1/5] net: export udp and gre gro_complete() APIs Manish Chopra
2016-06-22  8:25 ` [PATCH net-next 2/5] qede: Add support to handle VXLAN hardware GRO packets Manish Chopra
2016-06-22  8:25 ` [PATCH net-next 3/5] qede: Add support to handle GENEVE " Manish Chopra
2016-06-22  8:25 ` [PATCH net-next 4/5] qede: Add support to handle GRE " Manish Chopra
2016-06-22  8:25 ` [PATCH net-next 5/5] qed: Enable hardware GRO feature for encapsulated packets Manish Chopra
2016-06-22 16:27 ` [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support Alexander Duyck
2016-06-22 17:16   ` Yuval Mintz
2016-06-22 17:45     ` Alexander Duyck
2016-06-22 18:22       ` Yuval Mintz
2016-06-22 21:32         ` Alexander Duyck
2016-06-22 22:32           ` Hannes Frederic Sowa
2016-06-22 23:42           ` Eric Dumazet
2016-06-22 21:52         ` Rick Jones
2016-06-22 22:47           ` Eric Dumazet
2016-06-22 22:56             ` Alexander Duyck
2016-06-22 23:31               ` Eric Dumazet
2016-06-22 23:59                 ` Tom Herbert
2016-06-23  0:11                 ` Alexander Duyck
2016-06-23  4:10                   ` Yuval Mintz
2016-06-23  4:17                     ` Yuval Mintz
2016-06-23 17:07                       ` Alexander Duyck
2016-06-23 21:06                         ` Yuval Mintz
2016-06-23 23:20                           ` Alexander Duyck
2016-06-24  5:20                             ` Yuval Mintz
2016-06-24 16:44                               ` Alexander Duyck
2016-06-24 13:09                         ` Edward Cree
2016-06-24 16:31                           ` Tom Herbert
2016-06-24 17:21                             ` Edward Cree
2016-06-26  6:09                               ` Yuval Mintz
2016-06-22 23:52               ` Rick Jones
2016-06-23  0:18                 ` Alexander Duyck
2016-06-22 23:10             ` Rick Jones
2016-06-23  0:48               ` Rick Jones
2016-06-23  9:03                 ` Yuval Mintz
2016-06-26 19:53           ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.