All of lore.kernel.org
 help / color / mirror / Atom feed
* [net PATCH v2 0/2] Fixes for GRO and GRE tunnels
@ 2016-04-04 16:27 Alexander Duyck
  2016-04-04 16:28 ` [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU Alexander Duyck
  2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck
  0 siblings, 2 replies; 25+ messages in thread
From: Alexander Duyck @ 2016-04-04 16:27 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem

This pair of patches addresses a few issues I have discovered over the last
week or so concerning GRO and GRE tunnels.

The first patch addresses an item I called out as an issue with FOU/GUE
encapsulating GRE, and I finally had a chance to test it and verify that
the code concerning it was broken so I took the opportunity to fix it so
that we cannot generate a FOU/GUE frame that is encapsulating a GRE tunnel
with checksum while requesting TSO/GSO for the frame.

The second patch actually addresses something I realized was an issue if we
feed a tunnel through GRO and back out through GSO.  Specifically it was
possible for GRO to generate overlapping IPv4 ID ranges as the outer IP IDs
were being ignored for tunnels.  Ignoring the IP IDs like this should only
be valid if the DF bit is set.  This is normally the case for IPIP, SIT,
and GRE tunnels, but not so for UDP tunnels.  In the case that the DF bit
is not set we store off the fact that there was a delta from what we were
expecting and when we hit the inner-most header we validate the value as to
avoid generating a frame which could lead to an IP ID collision on packets
that could eventually be fragmented.  A side effect is that the inner-most
IP ID test is relaxed as well, but the worst case scenario is that we GRO a
frame with a throw-away ID sequence anyway so if anything segmenting such a
frame with the wrong IP IDs should have no negative effects.

v2: Updated RFC 6864 patch so that we only support one of two modes.  Either
    the IP ID is incrementing, or it is fixed at a specific value with DF bit
    set.

---

Alexander Duyck (2):
      GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
      ipv4/GRO: Make GRO conform to RFC 6864


 include/linux/netdevice.h |    5 ++++-
 net/core/dev.c            |    2 ++
 net/ipv4/af_inet.c        |   25 ++++++++++++++++++-------
 net/ipv4/fou.c            |    6 ++++++
 net/ipv4/gre_offload.c    |    8 ++++++++
 net/ipv4/ip_gre.c         |   13 ++++++++++---
 net/ipv6/ip6_offload.c    |    3 ---
 7 files changed, 48 insertions(+), 14 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
  2016-04-04 16:27 [net PATCH v2 0/2] Fixes for GRO and GRE tunnels Alexander Duyck
@ 2016-04-04 16:28 ` Alexander Duyck
  2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck
  1 sibling, 0 replies; 25+ messages in thread
From: Alexander Duyck @ 2016-04-04 16:28 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem

This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE.  Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.

The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header.  As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.

Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 include/linux/netdevice.h |    5 ++++-
 net/core/dev.c            |    1 +
 net/ipv4/fou.c            |    6 ++++++
 net/ipv4/gre_offload.c    |    8 ++++++++
 net/ipv4/ip_gre.c         |   13 ++++++++++---
 5 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb0d5d09c2e4..8395308a2445 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2120,7 +2120,10 @@ struct napi_gro_cb {
 	/* Used in foo-over-udp, set in udp[46]_gro_receive */
 	u8	is_ipv6:1;
 
-	/* 7 bit hole */
+	/* Used in GRE, set in fou/gue_gro_receive */
+	u8	is_fou:1;
+
+	/* 6 bit hole */
 
 	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
 	__wsum	csum;
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe77d913..77a71cd68535 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4439,6 +4439,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 		NAPI_GRO_CB(skb)->flush = 0;
 		NAPI_GRO_CB(skb)->free = 0;
 		NAPI_GRO_CB(skb)->encap_mark = 0;
+		NAPI_GRO_CB(skb)->is_fou = 0;
 		NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
 
 		/* Setup for GRO checksum validation */
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 5a94aea280d3..a39068b4a4d9 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -203,6 +203,9 @@ static struct sk_buff **fou_gro_receive(struct sk_buff **head,
 	 */
 	NAPI_GRO_CB(skb)->encap_mark = 0;
 
+	/* Flag this frame as already having an outer encap header */
+	NAPI_GRO_CB(skb)->is_fou = 1;
+
 	rcu_read_lock();
 	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
 	ops = rcu_dereference(offloads[proto]);
@@ -368,6 +371,9 @@ static struct sk_buff **gue_gro_receive(struct sk_buff **head,
 	 */
 	NAPI_GRO_CB(skb)->encap_mark = 0;
 
+	/* Flag this frame as already having an outer encap header */
+	NAPI_GRO_CB(skb)->is_fou = 1;
+
 	rcu_read_lock();
 	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
 	ops = rcu_dereference(offloads[guehdr->proto_ctype]);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index c47539d04b88..6a5bd4317866 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -150,6 +150,14 @@ static struct sk_buff **gre_gro_receive(struct sk_buff **head,
 	if ((greh->flags & ~(GRE_KEY|GRE_CSUM)) != 0)
 		goto out;
 
+	/* We can only support GRE_CSUM if we can track the location of
+	 * the GRE header.  In the case of FOU/GUE we cannot because the
+	 * outer UDP header displaces the GRE header leaving us in a state
+	 * of limbo.
+	 */
+	if ((greh->flags & GRE_CSUM) && NAPI_GRO_CB(skb)->is_fou)
+		goto out;
+
 	type = greh->protocol;
 
 	rcu_read_lock();
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 31936d387cfd..af5d1f38217f 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -862,9 +862,16 @@ static void __gre_tunnel_init(struct net_device *dev)
 	dev->hw_features	|= GRE_FEATURES;
 
 	if (!(tunnel->parms.o_flags & TUNNEL_SEQ)) {
-		/* TCP offload with GRE SEQ is not supported. */
-		dev->features    |= NETIF_F_GSO_SOFTWARE;
-		dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+		/* TCP offload with GRE SEQ is not supported, nor
+		 * can we support 2 levels of outer headers requiring
+		 * an update.
+		 */
+		if (!(tunnel->parms.o_flags & TUNNEL_CSUM) ||
+		    (tunnel->encap.type == TUNNEL_ENCAP_NONE)) {
+			dev->features    |= NETIF_F_GSO_SOFTWARE;
+			dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+		}
+
 		/* Can use a lockless transmit, unless we generate
 		 * output sequences
 		 */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-04 16:27 [net PATCH v2 0/2] Fixes for GRO and GRE tunnels Alexander Duyck
  2016-04-04 16:28 ` [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU Alexander Duyck
@ 2016-04-04 16:31 ` Alexander Duyck
  2016-04-05  0:38   ` subashab
  2016-04-05  3:44   ` Herbert Xu
  1 sibling, 2 replies; 25+ messages in thread
From: Alexander Duyck @ 2016-04-04 16:31 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem

RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
than fragmentation and reassembly.  Currently we are looking at this field
as a way of identifying what frames can be aggregated and  which cannot for
GRO.  While this is valid for frames that do not have DF set, it is invalid
to do so if the bit is set.

In addition we were generating IPv4 ID collisions when 2 or more flows were
interleaved over the same tunnel.  To prevent that we store the result of
all IP ID checks via a "|=" instead of overwriting previous values.

With this patch we support two different approaches for the IP ID field.
The first is a non-incrementing IP ID with DF bit set.  In such a case we
simply won't write to the flush_id field in the GRO context block.  The
other option is the legacy option in which the IP ID must increment by 1
for every packet we aggregate.

In the case of the non-incrementing IP ID we will end up losing the data
that the IP ID is fixed.  However as per RFC 6864 we should be able to
write any value into the IP ID when the DF bit is set so this should cause
minimal harm.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---

v2: Updated patch so that we now only support one of two options.  Either
    the IP ID is fixed with DF bit set, or the IP ID is incrementing.  That
    allows us to support the fixed ID case as occurs with IPv6 to IPv4
    header translation and what is likely already out there for some
    devices with tunnel headers.

 net/core/dev.c         |    1 +
 net/ipv4/af_inet.c     |   25 ++++++++++++++++++-------
 net/ipv6/ip6_offload.c |    3 ---
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 77a71cd68535..3429632398a4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
 		unsigned long diffs;
 
 		NAPI_GRO_CB(p)->flush = 0;
+		NAPI_GRO_CB(p)->flush_id = 0;
 
 		if (hash != skb_get_hash_raw(p)) {
 			NAPI_GRO_CB(p)->same_flow = 0;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e481992dbae..33f6335448a2 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 
 	for (p = *head; p; p = p->next) {
 		struct iphdr *iph2;
+		u16 flush_id;
 
 		if (!NAPI_GRO_CB(p)->same_flow)
 			continue;
@@ -1347,14 +1348,24 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 			(iph->tos ^ iph2->tos) |
 			((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
 
-		/* Save the IP ID check to be included later when we get to
-		 * the transport layer so only the inner most IP ID is checked.
-		 * This is because some GSO/TSO implementations do not
-		 * correctly increment the IP ID for the outer hdrs.
-		 */
-		NAPI_GRO_CB(p)->flush_id =
-			    ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id);
 		NAPI_GRO_CB(p)->flush |= flush;
+
+		/* We must save the offset as it is possible to have multiple
+		 * flows using the same protocol and address pairs so we
+		 * need to wait until we can validate this is part of the
+		 * same flow with a 5-tuple or better to avoid unnecessary
+		 * collisions between flows.  We can support one of two
+		 * possible scenarios, either a fixed value with DF bit set
+		 * or an incrementing value with DF either set or unset.
+		 * In the case of a fixed value we will end up losing the
+		 * data that the IP ID was a fixed value, however per RFC
+		 * 6864 in such a case the actual value of the IP ID is
+		 * meant to be ignored anyway.
+		 */
+		flush_id = (u16)(id - ntohs(iph2->id));
+		if (flush_id || !(iph2->frag_off & htons(IP_DF)))
+			NAPI_GRO_CB(p)->flush_id |= flush_id ^
+						    NAPI_GRO_CB(p)->count;
 	}
 
 	NAPI_GRO_CB(skb)->flush |= flush;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 82e9f3076028..9aa53f64dffd 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff **head,
 		/* flush if Traffic Class fields are different */
 		NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000));
 		NAPI_GRO_CB(p)->flush |= flush;
-
-		/* Clear flush_id, there's really no concept of ID in IPv6. */
-		NAPI_GRO_CB(p)->flush_id = 0;
 	}
 
 	NAPI_GRO_CB(skb)->flush |= flush;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck
@ 2016-04-05  0:38   ` subashab
  2016-04-05  3:44   ` Herbert Xu
  1 sibling, 0 replies; 25+ messages in thread
From: subashab @ 2016-04-05  0:38 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem,
	netdev-owner

On 2016-04-04 10:31, Alexander Duyck wrote:
> RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes 
> other
> than fragmentation and reassembly.  Currently we are looking at this 
> field
> as a way of identifying what frames can be aggregated and  which cannot 
> for
> GRO.  While this is valid for frames that do not have DF set, it is 
> invalid
> to do so if the bit is set.
> 
> In addition we were generating IPv4 ID collisions when 2 or more flows 
> were
> interleaved over the same tunnel.  To prevent that we store the result 
> of
> all IP ID checks via a "|=" instead of overwriting previous values.
> 
> With this patch we support two different approaches for the IP ID 
> field.
> The first is a non-incrementing IP ID with DF bit set.  In such a case 
> we
> simply won't write to the flush_id field in the GRO context block.  The
> other option is the legacy option in which the IP ID must increment by 
> 1
> for every packet we aggregate.
> 
> In the case of the non-incrementing IP ID we will end up losing the 
> data
> that the IP ID is fixed.  However as per RFC 6864 we should be able to
> write any value into the IP ID when the DF bit is set so this should 
> cause
> minimal harm.
> 
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
> 
> v2: Updated patch so that we now only support one of two options.  
> Either
>     the IP ID is fixed with DF bit set, or the IP ID is incrementing.  
> That
>     allows us to support the fixed ID case as occurs with IPv6 to IPv4
>     header translation and what is likely already out there for some
>     devices with tunnel headers.
> 
>  net/core/dev.c         |    1 +
>  net/ipv4/af_inet.c     |   25 ++++++++++++++++++-------
>  net/ipv6/ip6_offload.c |    3 ---
>  3 files changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 77a71cd68535..3429632398a4 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct
> *napi, struct sk_buff *skb)
>  		unsigned long diffs;
> 
>  		NAPI_GRO_CB(p)->flush = 0;
> +		NAPI_GRO_CB(p)->flush_id = 0;
> 
>  		if (hash != skb_get_hash_raw(p)) {
>  			NAPI_GRO_CB(p)->same_flow = 0;
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 9e481992dbae..33f6335448a2 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct
> sk_buff **head,
> 
>  	for (p = *head; p; p = p->next) {
>  		struct iphdr *iph2;
> +		u16 flush_id;
> 
>  		if (!NAPI_GRO_CB(p)->same_flow)
>  			continue;
> @@ -1347,14 +1348,24 @@ static struct sk_buff
> **inet_gro_receive(struct sk_buff **head,
>  			(iph->tos ^ iph2->tos) |
>  			((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
> 
> -		/* Save the IP ID check to be included later when we get to
> -		 * the transport layer so only the inner most IP ID is checked.
> -		 * This is because some GSO/TSO implementations do not
> -		 * correctly increment the IP ID for the outer hdrs.
> -		 */
> -		NAPI_GRO_CB(p)->flush_id =
> -			    ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id);
>  		NAPI_GRO_CB(p)->flush |= flush;
> +
> +		/* We must save the offset as it is possible to have multiple
> +		 * flows using the same protocol and address pairs so we
> +		 * need to wait until we can validate this is part of the
> +		 * same flow with a 5-tuple or better to avoid unnecessary
> +		 * collisions between flows.  We can support one of two
> +		 * possible scenarios, either a fixed value with DF bit set
> +		 * or an incrementing value with DF either set or unset.
> +		 * In the case of a fixed value we will end up losing the
> +		 * data that the IP ID was a fixed value, however per RFC
> +		 * 6864 in such a case the actual value of the IP ID is
> +		 * meant to be ignored anyway.
> +		 */
> +		flush_id = (u16)(id - ntohs(iph2->id));
> +		if (flush_id || !(iph2->frag_off & htons(IP_DF)))
> +			NAPI_GRO_CB(p)->flush_id |= flush_id ^
> +						    NAPI_GRO_CB(p)->count;
>  	}
> 
>  	NAPI_GRO_CB(skb)->flush |= flush;
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index 82e9f3076028..9aa53f64dffd 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct
> sk_buff **head,
>  		/* flush if Traffic Class fields are different */
>  		NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000));
>  		NAPI_GRO_CB(p)->flush |= flush;
> -
> -		/* Clear flush_id, there's really no concept of ID in IPv6. */
> -		NAPI_GRO_CB(p)->flush_id = 0;
>  	}
> 
>  	NAPI_GRO_CB(skb)->flush |= flush;

I can see coalescing of packets which have IP ID=0 and DF=1 originating 
from a IPv6 to IPv4 translator.

Tested-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck
  2016-04-05  0:38   ` subashab
@ 2016-04-05  3:44   ` Herbert Xu
  2016-04-05  4:26     ` Alexander Duyck
  1 sibling, 1 reply; 25+ messages in thread
From: Herbert Xu @ 2016-04-05  3:44 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: tom, jesse, alexander.duyck, edumazet, netdev, davem

On Mon, Apr 04, 2016 at 09:31:21AM -0700, Alexander Duyck wrote:
> RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
> than fragmentation and reassembly.  Currently we are looking at this field
> as a way of identifying what frames can be aggregated and  which cannot for
> GRO.  While this is valid for frames that do not have DF set, it is invalid
> to do so if the bit is set.

This justification is bogus.  GRO is a completely local optimisation
that should have zero visibility to third parties.  So it makes no
sense to talk about RFC compliance of GRO.  The Linux network stack
as a whole is subject to RFC compliance, but not GRO per se.
 
> In the case of the non-incrementing IP ID we will end up losing the data
> that the IP ID is fixed.  However as per RFC 6864 we should be able to
> write any value into the IP ID when the DF bit is set so this should cause
> minimal harm.

No we should not do that, at least not by default.  GRO was designed
to be completely lossless, that is its main advantage of the various
forms of LRO which preceded it.

If you lose that people will start asking it to be disabled for
routers/bridges and we'll be back in the same old mess that we
had with LRO.

So please do this properly and preserve the information in the packet.
As I said earlier all it takes is one single bit, like we do with ECN.
If you put it in the feature bit you'll also allow us to distinguish
between TSO drivers that produce fixed IDs vs. incrementing IDs.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05  3:44   ` Herbert Xu
@ 2016-04-05  4:26     ` Alexander Duyck
  2016-04-05  4:32       ` Herbert Xu
  0 siblings, 1 reply; 25+ messages in thread
From: Alexander Duyck @ 2016-04-05  4:26 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev,
	David Miller

On Mon, Apr 4, 2016 at 8:44 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Mon, Apr 04, 2016 at 09:31:21AM -0700, Alexander Duyck wrote:
>> RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
>> than fragmentation and reassembly.  Currently we are looking at this field
>> as a way of identifying what frames can be aggregated and  which cannot for
>> GRO.  While this is valid for frames that do not have DF set, it is invalid
>> to do so if the bit is set.
>
> This justification is bogus.  GRO is a completely local optimisation
> that should have zero visibility to third parties.  So it makes no
> sense to talk about RFC compliance of GRO.  The Linux network stack
> as a whole is subject to RFC compliance, but not GRO per se.

The problem is right now we are mangling the IP ID for outer headers
on tunnels.  We end up totally ignoring the delta between the values
so if you have two flows that get interleaved over the same tunnel GRO
will currently mash the IP IDs for the two tunnels so that they end up
overlapping.

The reason why I keep referencing RFC 6864 is because it specifies
that the IP ID field must not be read if the DF bit is set, and that
if we are manipulating headers we can handle the IP ID as though we
are the transmitting station.  What this means is that if DF is not
set we have to have unique values per packet, otherwise we can ignore
the values if DF is set.

>> In the case of the non-incrementing IP ID we will end up losing the data
>> that the IP ID is fixed.  However as per RFC 6864 we should be able to
>> write any value into the IP ID when the DF bit is set so this should cause
>> minimal harm.
>
> No we should not do that, at least not by default.  GRO was designed
> to be completely lossless, that is its main advantage of the various
> forms of LRO which preceded it.

Well the problem is it isn't right now.  Instead in the case of
tunnels it allows you to generate overlapping sequences of IP IDs.

> If you lose that people will start asking it to be disabled for
> routers/bridges and we'll be back in the same old mess that we
> had with LRO.

The question I would have is what are you really losing with increment
from 0 versus fixed 0?  From what I see it is essentially just garbage
in/garbage out.

> So please do this properly and preserve the information in the packet.
> As I said earlier all it takes is one single bit, like we do with ECN.
> If you put it in the feature bit you'll also allow us to distinguish
> between TSO drivers that produce fixed IDs vs. incrementing IDs.

Actually it will take at least 2.  One for the outer headers and one
for the inner headers.  I'll also have to add tracking for each to the
GRO code so that we don't merge frames that go from a fixed ID to
incrementing one or visa-versa since I suspect that is probably
floating around out there too as my GSO partial code was going to end
up doing some of that.

- Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05  4:26     ` Alexander Duyck
@ 2016-04-05  4:32       ` Herbert Xu
  2016-04-05 15:07         ` Edward Cree
  2016-04-05 15:52         ` Alexander Duyck
  0 siblings, 2 replies; 25+ messages in thread
From: Herbert Xu @ 2016-04-05  4:32 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev,
	David Miller

On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
> 
> The problem is right now we are mangling the IP ID for outer headers
> on tunnels.  We end up totally ignoring the delta between the values
> so if you have two flows that get interleaved over the same tunnel GRO
> will currently mash the IP IDs for the two tunnels so that they end up
> overlapping.

Then it should be fixed.  I never reviewed those patches or I would
have objected at the time.

> The reason why I keep referencing RFC 6864 is because it specifies
> that the IP ID field must not be read if the DF bit is set, and that
> if we are manipulating headers we can handle the IP ID as though we
> are the transmitting station.  What this means is that if DF is not
> set we have to have unique values per packet, otherwise we can ignore
> the values if DF is set.

As I said GRO itself should not be visible.  The fact that it is
for tunnels is a bug.
 
> The question I would have is what are you really losing with increment
> from 0 versus fixed 0?  From what I see it is essentially just garbage
> in/garbage out.

GRO is meant to be lossless, that is, you should not be able to
detect its presence from the outside.  If you lose information then
you're breaking this rule and people will soon start asking for it
to be disabled in various situations.

I'm not against doing this per se but it should not be part of the
default configuration.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05  4:32       ` Herbert Xu
@ 2016-04-05 15:07         ` Edward Cree
  2016-04-05 15:36           ` Tom Herbert
  2016-04-05 23:45           ` David Miller
  2016-04-05 15:52         ` Alexander Duyck
  1 sibling, 2 replies; 25+ messages in thread
From: Edward Cree @ 2016-04-05 15:07 UTC (permalink / raw)
  To: Herbert Xu, Alexander Duyck
  Cc: Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev,
	David Miller

On 05/04/16 05:32, Herbert Xu wrote:
> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
>> The question I would have is what are you really losing with increment
>> from 0 versus fixed 0?  From what I see it is essentially just garbage
>> in/garbage out.
> GRO is meant to be lossless, that is, you should not be able to
> detect its presence from the outside.  If you lose information then
> you're breaking this rule and people will soon start asking for it
> to be disabled in various situations.
>
> I'm not against doing this per se but it should not be part of the
> default configuration.
I'm certainly in favour of this being configurable - indeed IMHO it should
also be possible to configure GRO with the 'looser' semantics of LRO, so
that people who want that can get it without all the horrible "don't confuse
Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it
exists are (a) improved performance from the 'loose' semantics and (b) old
kernels without GRO.  We may not be able to kill (b) but we can certainly
address (a)).

But I don't agree that the default has to be totally lossless; anyone who is
caring about the ID fields in atomic datagrams is breaking the RFCs, and can
be assumed to Know What They're Doing sufficiently to configure this.

On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up
forwarding/routing workloads.  Instead we should be looking into having lists
of SKBs traverse the stack together, splitting the list whenever e.g. the
destination changes.  That seems like it ought to be much more efficient than
rewriting headers twice, once to coalesce a superframe and once to segment it
again - and it also means this worry about GRO being lossless can go away.
But until someone tries implementing skb batches, we won't know for sure if
it works (and I don't have time right now ;)

-Ed

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05 15:07         ` Edward Cree
@ 2016-04-05 15:36           ` Tom Herbert
  2016-04-05 17:06             ` Edward Cree
  2016-04-06  0:04             ` Marcelo Ricardo Leitner
  2016-04-05 23:45           ` David Miller
  1 sibling, 2 replies; 25+ messages in thread
From: Tom Herbert @ 2016-04-05 15:36 UTC (permalink / raw)
  To: Edward Cree
  Cc: Herbert Xu, Alexander Duyck, Alexander Duyck, Jesse Gross,
	Eric Dumazet, Netdev, David Miller

On Tue, Apr 5, 2016 at 12:07 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 05/04/16 05:32, Herbert Xu wrote:
>> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
>>> The question I would have is what are you really losing with increment
>>> from 0 versus fixed 0?  From what I see it is essentially just garbage
>>> in/garbage out.
>> GRO is meant to be lossless, that is, you should not be able to
>> detect its presence from the outside.  If you lose information then
>> you're breaking this rule and people will soon start asking for it
>> to be disabled in various situations.
>>
>> I'm not against doing this per se but it should not be part of the
>> default configuration.
> I'm certainly in favour of this being configurable - indeed IMHO it should
> also be possible to configure GRO with the 'looser' semantics of LRO, so
> that people who want that can get it without all the horrible "don't confuse
> Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it
> exists are (a) improved performance from the 'loose' semantics and (b) old
> kernels without GRO.  We may not be able to kill (b) but we can certainly
> address (a)).
>
> But I don't agree that the default has to be totally lossless; anyone who is
> caring about the ID fields in atomic datagrams is breaking the RFCs, and can
> be assumed to Know What They're Doing sufficiently to configure this.
>
> On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up
> forwarding/routing workloads.  Instead we should be looking into having lists
> of SKBs traverse the stack together, splitting the list whenever e.g. the
> destination changes.  That seems like it ought to be much more efficient than
> rewriting headers twice, once to coalesce a superframe and once to segment it
> again - and it also means this worry about GRO being lossless can go away.
> But until someone tries implementing skb batches, we won't know for sure if
> it works (and I don't have time right now ;)
>
Ed,

I thought about that some. It seems like we would want to do both GRO
and retain all the individual packets in the skb so that we could use
those for forwarding instead of GSO as I think you're saying. This
would would work great in the plain forwarding case, but one problem
is what to do if the host modifies the super packet (for instance when
forwarding over a tunnel we might add encapsulation header). This
should work in GSO (although we need to address the limitations around
1 encap level), not sure this is easy if we need to add a header to
each packet in a batch.

Tom



> -Ed

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05  4:32       ` Herbert Xu
  2016-04-05 15:07         ` Edward Cree
@ 2016-04-05 15:52         ` Alexander Duyck
  2016-04-05 16:30           ` Eric Dumazet
  1 sibling, 1 reply; 25+ messages in thread
From: Alexander Duyck @ 2016-04-05 15:52 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev,
	David Miller

On Mon, Apr 4, 2016 at 9:32 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
>>
>> The problem is right now we are mangling the IP ID for outer headers
>> on tunnels.  We end up totally ignoring the delta between the values
>> so if you have two flows that get interleaved over the same tunnel GRO
>> will currently mash the IP IDs for the two tunnels so that they end up
>> overlapping.
>
> Then it should be fixed.  I never reviewed those patches or I would
> have objected at the time.

The problem is what you are proposing is much larger than I am
comfortable proposing for a net patch so I will probably go back to
targeting net-next.

>> The reason why I keep referencing RFC 6864 is because it specifies
>> that the IP ID field must not be read if the DF bit is set, and that
>> if we are manipulating headers we can handle the IP ID as though we
>> are the transmitting station.  What this means is that if DF is not
>> set we have to have unique values per packet, otherwise we can ignore
>> the values if DF is set.
>
> As I said GRO itself should not be visible.  The fact that it is
> for tunnels is a bug.

Right.  But the fact that we are even trying to get reliable data out
of IP ID is also a bug.  The fact is the IP ID isn't going to be
linear in the case of tunnels.  There is a good liklihood that the IP
ID will jump around if we are doing encapsulation using a third party
device such as a switch.  That was one of the reasons why my first
implementation just completely ignored the IP ID in the case that the
DF bit is set.  Honestly I am leaning more toward taking the approach
of going back to that implementation and adding a sysctl that would
let you disable it.  Maybe something like net.ipv4.gro.rfc6864 since
that is the RFC that spells out that we should treat IP ID as a
floating input/output if DF is set.  Basically we need to be able to
do that if the GSO partial code is going to work.

>> The question I would have is what are you really losing with increment
>> from 0 versus fixed 0?  From what I see it is essentially just garbage
>> in/garbage out.
>
> GRO is meant to be lossless, that is, you should not be able to
> detect its presence from the outside.  If you lose information then
> you're breaking this rule and people will soon start asking for it
> to be disabled in various situations.

Right.  But in this case we technically aren't losing information.
That is the thing I have been trying to point out with RFC 6864.  It
calls out that you MUST not read IP ID if the DF bit is set.

> I'm not against doing this per se but it should not be part of the
> default configuration.

I disagree I think it will have to be part of the default
configuration.  The problem is the IP ID is quickly becoming
meaningless.  When you consider that a 40Gb/s link can wrap the IP ID
value nearly 50 times a second using a 1500 MTU the IP ID field should
just be ignored anyway because you cannot guarantee that it will be
unique without limiting the Tx window size.  That was the whole point
of RFC6864.  Basically the IP ID field is so small that as we push
into the higher speeds you cannot guarantee that the field will have
any meaning so for any case where you don't need to use it you
shouldn't because it will likely not provide enough useful data.

- Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05 15:52         ` Alexander Duyck
@ 2016-04-05 16:30           ` Eric Dumazet
  2016-04-05 16:45             ` Alexander Duyck
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2016-04-05 16:30 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Herbert Xu, Alexander Duyck, Tom Herbert, Jesse Gross,
	Eric Dumazet, Netdev, David Miller

On Tue, 2016-04-05 at 08:52 -0700, Alexander Duyck wrote:

> 
> I disagree I think it will have to be part of the default
> configuration.  The problem is the IP ID is quickly becoming
> meaningless.  When you consider that a 40Gb/s link can wrap the IP ID
> value nearly 50 times a second using a 1500 MTU the IP ID field should
> just be ignored anyway because you cannot guarantee that it will be
> unique without limiting the Tx window size.  That was the whole point
> of RFC6864.  Basically the IP ID field is so small that as we push
> into the higher speeds you cannot guarantee that the field will have
> any meaning so for any case where you don't need to use it you
> shouldn't because it will likely not provide enough useful data.

Just because a few flows reach 40Gbit , we should remind that vast
majority of the Internet runs with < 50Mbits flows.

I prefer the argument of IPv6 not having ID ;)

We should do our best to keep interoperability, this is the selling
point. 

And quite frankly your last patch makes perfect sense to me :

The aggregation is done only if the TCP headers of consecutive packets
matches. So who cares of IPv4 ID really ?
This is a very minor detail. The possible gains outperform the
theoretical 'problem'

GRO already reorder flows, it never had a guarantee of being 'ínvisible'
as Herbert claims.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05 16:30           ` Eric Dumazet
@ 2016-04-05 16:45             ` Alexander Duyck
  0 siblings, 0 replies; 25+ messages in thread
From: Alexander Duyck @ 2016-04-05 16:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, Alexander Duyck, Tom Herbert, Jesse Gross,
	Eric Dumazet, Netdev, David Miller

On Tue, Apr 5, 2016 at 9:30 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-04-05 at 08:52 -0700, Alexander Duyck wrote:
>
>>
>> I disagree I think it will have to be part of the default
>> configuration.  The problem is the IP ID is quickly becoming
>> meaningless.  When you consider that a 40Gb/s link can wrap the IP ID
>> value nearly 50 times a second using a 1500 MTU the IP ID field should
>> just be ignored anyway because you cannot guarantee that it will be
>> unique without limiting the Tx window size.  That was the whole point
>> of RFC6864.  Basically the IP ID field is so small that as we push
>> into the higher speeds you cannot guarantee that the field will have
>> any meaning so for any case where you don't need to use it you
>> shouldn't because it will likely not provide enough useful data.
>
> Just because a few flows reach 40Gbit , we should remind that vast
> majority of the Internet runs with < 50Mbits flows.
>
> I prefer the argument of IPv6 not having ID ;)

Okay, maybe I'll try to use that argument more often then.. :-)

> We should do our best to keep interoperability, this is the selling
> point.
>
> And quite frankly your last patch makes perfect sense to me :

Yes.  It was a compromise, though I might still have to go through and
refine it more.  It might make sense for the IP header associated with
the TCP flow, but for outer headers it actually is worse because we
end up blocking several different possibilities.  What I might need to
do is capture the state of the DF bit as we work a flow up through the
stack and once it is in the list of GRO SKBs use that DF bit as a flag
to indicate if we support a incrementing or fixed pattern for the
values.  That way tunnels can optionally ignore the IP ID if the DF
bit is set since their values may not be as clean as that of TCP.

> The aggregation is done only if the TCP headers of consecutive packets
> matches. So who cares of IPv4 ID really ?
> This is a very minor detail. The possible gains outperform the
> theoretical 'problem'
>
> GRO already reorder flows, it never had a guarantee of being 'ínvisible'
> as Herbert claims.

I can see what he is trying to get at.  I just think it is a bit too
strict on the interpretation of what values have to be maintained.  My
plan going forward is to add a sysctl that will probably allow us some
wiggle room in regards to IP ID for GRO and GSO so that when it is
disabled we will not perform GSO partial nor allow for repeating IP ID
in GRO on devices that cannot get the IP ID right.

- Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05 15:36           ` Tom Herbert
@ 2016-04-05 17:06             ` Edward Cree
  2016-04-05 17:38               ` Tom Herbert
  2016-04-06  0:04             ` Marcelo Ricardo Leitner
  1 sibling, 1 reply; 25+ messages in thread
From: Edward Cree @ 2016-04-05 17:06 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Herbert Xu, Alexander Duyck, Alexander Duyck, Jesse Gross,
	Eric Dumazet, Netdev, David Miller

On 05/04/16 16:36, Tom Herbert wrote:
> I thought about that some. It seems like we would want to do both GRO
> and retain all the individual packets in the skb so that we could use
> those for forwarding instead of GSO as I think you're saying.
I didn't quite mean that, I meant just pass around the skb list, don't
do GRO at all.  The receiving TCP stack ends up just getting called
several times, in quick succession, without I$ loss from network stack
traversal in between.

> This
> would would work great in the plain forwarding case, but one problem
> is what to do if the host modifies the super packet (for instance when
> forwarding over a tunnel we might add encapsulation header). This
> should work in GSO (although we need to address the limitations around
> 1 encap level), not sure this is easy if we need to add a header to
> each packet in a batch.
This is indeed a problem with what I was proposing; perhaps the answer
is that as you process these SKB lists you also update something like a
GRO CB, then if you do decide to transform the packets you can coalesce
them at that point.  But doing 'the rest' of GRO might cost as much as
just transforming all the packets, in which case you only win if you want
to transform them multiple times.
And if we assume that we're going to be touching all the headers anyway,
it probably doesn't cost us much to transform all the packets in the list
since our code and data are both hot in cache.  Well, the code is cold
for the first packet in the list, but equally it's cold for the
superpacket in the current approach.

If this is as performant as GRO in the normal (non-forwarding) receive
case (and that's a *big* if, which can only be resolved by trying it), it
might make sense to just not have GRO, while TSO only gets used for
locally-generated sends, and for the case where you're forwarding between
networks with different MTUs (e.g. from a 64k VM netdevice to the wire).

What do you think?  Am I crazy?  (More than my usual?)

-Ed

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05 17:06             ` Edward Cree
@ 2016-04-05 17:38               ` Tom Herbert
  0 siblings, 0 replies; 25+ messages in thread
From: Tom Herbert @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Edward Cree
  Cc: Herbert Xu, Alexander Duyck, Alexander Duyck, Jesse Gross,
	Eric Dumazet, Netdev, David Miller

On Tue, Apr 5, 2016 at 2:06 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 05/04/16 16:36, Tom Herbert wrote:
>> I thought about that some. It seems like we would want to do both GRO
>> and retain all the individual packets in the skb so that we could use
>> those for forwarding instead of GSO as I think you're saying.
> I didn't quite mean that, I meant just pass around the skb list, don't
> do GRO at all.  The receiving TCP stack ends up just getting called
> several times, in quick succession, without I$ loss from network stack
> traversal in between.
>
>> This
>> would would work great in the plain forwarding case, but one problem
>> is what to do if the host modifies the super packet (for instance when
>> forwarding over a tunnel we might add encapsulation header). This
>> should work in GSO (although we need to address the limitations around
>> 1 encap level), not sure this is easy if we need to add a header to
>> each packet in a batch.
> This is indeed a problem with what I was proposing; perhaps the answer
> is that as you process these SKB lists you also update something like a
> GRO CB, then if you do decide to transform the packets you can coalesce
> them at that point.  But doing 'the rest' of GRO might cost as much as
> just transforming all the packets, in which case you only win if you want
> to transform them multiple times.
> And if we assume that we're going to be touching all the headers anyway,
> it probably doesn't cost us much to transform all the packets in the list
> since our code and data are both hot in cache.  Well, the code is cold
> for the first packet in the list, but equally it's cold for the
> superpacket in the current approach.
>
> If this is as performant as GRO in the normal (non-forwarding) receive
> case (and that's a *big* if, which can only be resolved by trying it), it
> might make sense to just not have GRO, while TSO only gets used for
> locally-generated sends, and for the case where you're forwarding between
> networks with different MTUs (e.g. from a 64k VM netdevice to the wire).
>
> What do you think?  Am I crazy?  (More than my usual?)
>
It's not clear to me how important optimizing the GRO to GSO
forwarding case really is. We definitely need GRO for locally received
packets, but how much benefit do we really get when applying GRO to
forwarding and does that really outweigh the latency we're adding in
the forwarding path to to GRO? Also, once we implement high
performance forwarding in XDP there is really is no need to consider
GRO, we'll do some sort of I/O batching for sure but that would be
batching packets to same TX queue not same destination.

Tom

> -Ed

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05 15:07         ` Edward Cree
  2016-04-05 15:36           ` Tom Herbert
@ 2016-04-05 23:45           ` David Miller
  2016-04-06 11:21             ` Edward Cree
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2016-04-05 23:45 UTC (permalink / raw)
  To: ecree; +Cc: herbert, alexander.duyck, aduyck, tom, jesse, edumazet, netdev

From: Edward Cree <ecree@solarflare.com>
Date: Tue, 5 Apr 2016 16:07:49 +0100

> On the gripping hand, I feel like GRO+TSO is the wrong model for
> speeding up forwarding/routing workloads.  Instead we should be
> looking into having lists of SKBs traverse the stack together,
> splitting the list whenever e.g. the destination changes.

"Destination" is a very complicated beast.  It's not just a
destination IP address.

It's not even just a full saddr/daddr/TOS triplet.

Packets can be forwarded around based upon any key whatsoever in the
headers.  Netfilter can mangle them based upon arbitrary bits in the
packet, as can the packet scheduler classifier actions.

It's therefore not profitable to try this at all, it's completely
pointless unless all the keys match up exactly.

This is why GRO _is_ the proper model to speed this stuff and do
bulk processing, because it still presents a full "packet" to all
of these layers to mangle, rewrite, route, and do whatever else
however they like.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05 15:36           ` Tom Herbert
  2016-04-05 17:06             ` Edward Cree
@ 2016-04-06  0:04             ` Marcelo Ricardo Leitner
  1 sibling, 0 replies; 25+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-04-06  0:04 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Edward Cree, Herbert Xu, Alexander Duyck, Alexander Duyck,
	Jesse Gross, Eric Dumazet, Netdev, David Miller

On Tue, Apr 05, 2016 at 12:36:40PM -0300, Tom Herbert wrote:
> On Tue, Apr 5, 2016 at 12:07 PM, Edward Cree <ecree@solarflare.com> wrote:
> > On 05/04/16 05:32, Herbert Xu wrote:
> >> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
> >>> The question I would have is what are you really losing with increment
> >>> from 0 versus fixed 0?  From what I see it is essentially just garbage
> >>> in/garbage out.
> >> GRO is meant to be lossless, that is, you should not be able to
> >> detect its presence from the outside.  If you lose information then
> >> you're breaking this rule and people will soon start asking for it
> >> to be disabled in various situations.
> >>
> >> I'm not against doing this per se but it should not be part of the
> >> default configuration.
> > I'm certainly in favour of this being configurable - indeed IMHO it should
> > also be possible to configure GRO with the 'looser' semantics of LRO, so
> > that people who want that can get it without all the horrible "don't confuse
> > Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it
> > exists are (a) improved performance from the 'loose' semantics and (b) old
> > kernels without GRO.  We may not be able to kill (b) but we can certainly
> > address (a)).
> >
> > But I don't agree that the default has to be totally lossless; anyone who is
> > caring about the ID fields in atomic datagrams is breaking the RFCs, and can
> > be assumed to Know What They're Doing sufficiently to configure this.
> >
> > On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up
> > forwarding/routing workloads.  Instead we should be looking into having lists
> > of SKBs traverse the stack together, splitting the list whenever e.g. the
> > destination changes.  That seems like it ought to be much more efficient than
> > rewriting headers twice, once to coalesce a superframe and once to segment it
> > again - and it also means this worry about GRO being lossless can go away.
> > But until someone tries implementing skb batches, we won't know for sure if
> > it works (and I don't have time right now ;)
> >
> Ed,
> 
> I thought about that some. It seems like we would want to do both GRO
> and retain all the individual packets in the skb so that we could use
> those for forwarding instead of GSO as I think you're saying. This

Retaining the individual packets would also help to make GRO feasible
for SCTP.  SCTP needs to know where each packet ended because of AUTH
chunks and we cannot rely on something like gso_size as each original
packet had it's own size.

I could do it for tx side (see my SCTP/GSO RFC patches) using
skb_gro_receive() with a specially crafted header skb, but I'm not
seeing a way to do it in rx side as I cannot guarantee incoming skbs
will follow that pattern.

  Marcelo

> would would work great in the plain forwarding case, but one problem
> is what to do if the host modifies the super packet (for instance when
> forwarding over a tunnel we might add encapsulation header). This
> should work in GSO (although we need to address the limitations around
> 1 encap level), not sure this is easy if we need to add a header to
> each packet in a batch.
> 
> Tom
> 
> 
> 
> > -Ed
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-05 23:45           ` David Miller
@ 2016-04-06 11:21             ` Edward Cree
  2016-04-06 13:53               ` Tom Herbert
  0 siblings, 1 reply; 25+ messages in thread
From: Edward Cree @ 2016-04-06 11:21 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, alexander.duyck, aduyck, tom, jesse, edumazet, netdev

On 06/04/16 00:45, David Miller wrote:
> From: Edward Cree <ecree@solarflare.com>
> Date: Tue, 5 Apr 2016 16:07:49 +0100
>
>> On the gripping hand, I feel like GRO+TSO is the wrong model for
>> speeding up forwarding/routing workloads.  Instead we should be
>> looking into having lists of SKBs traverse the stack together,
>> splitting the list whenever e.g. the destination changes.
> "Destination" is a very complicated beast.  It's not just a
> destination IP address.
>
> It's not even just a full saddr/daddr/TOS triplet.
>
> Packets can be forwarded around based upon any key whatsoever in the
> headers.  Netfilter can mangle them based upon arbitrary bits in the
> packet, as can the packet scheduler classifier actions.
>
> It's therefore not profitable to try this at all, it's completely
> pointless unless all the keys match up exactly.
Possibly I wasn't completely clear (or maybe I was and I'm just
wrong...), but I meant that _each layer_ in the stack would split the
list whenever it wants to treat two packets differently.  Whether
that's a protocol receive handler, or a netfilter or tc operation.

Obviously if you want to decide at the _beginning_ whether "all the
keys match", then you do essentially need GRO's flow-matching logic.
But even then, I find myself wondering if having GRO coalesce the
segments into a superpacket is really better than having it just make
lists of segments, and have that list traverse the stack as a single
entity.  That way lossless resegmentation remains easy.  But I suppose
that could make life difficult for things like BPF, if they want to
act upon the superframe (because we haven't built it).  If instead
they act on each of the segments, we might get different actions for
each segment and that might also be awkward; so you'd still need this
concept of 'any layer in the stack can decide to split lists up'.

-Ed

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-06 11:21             ` Edward Cree
@ 2016-04-06 13:53               ` Tom Herbert
  2016-04-06 14:26                 ` Edward Cree
  2016-04-06 15:43                 ` David Miller
  0 siblings, 2 replies; 25+ messages in thread
From: Tom Herbert @ 2016-04-06 13:53 UTC (permalink / raw)
  To: Edward Cree
  Cc: David Miller, Herbert Xu, Alexander Duyck, Alex Duyck,
	Jesse Gross, Eric Dumazet, Linux Kernel Network Developers

On Wed, Apr 6, 2016 at 8:21 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 06/04/16 00:45, David Miller wrote:
>> From: Edward Cree <ecree@solarflare.com>
>> Date: Tue, 5 Apr 2016 16:07:49 +0100
>>
>>> On the gripping hand, I feel like GRO+TSO is the wrong model for
>>> speeding up forwarding/routing workloads.  Instead we should be
>>> looking into having lists of SKBs traverse the stack together,
>>> splitting the list whenever e.g. the destination changes.
>> "Destination" is a very complicated beast.  It's not just a
>> destination IP address.
>>
>> It's not even just a full saddr/daddr/TOS triplet.
>>
>> Packets can be forwarded around based upon any key whatsoever in the
>> headers.  Netfilter can mangle them based upon arbitrary bits in the
>> packet, as can the packet scheduler classifier actions.
>>
>> It's therefore not profitable to try this at all, it's completely
>> pointless unless all the keys match up exactly.
> Possibly I wasn't completely clear (or maybe I was and I'm just
> wrong...), but I meant that _each layer_ in the stack would split the
> list whenever it wants to treat two packets differently.  Whether
> that's a protocol receive handler, or a netfilter or tc operation.
>
> Obviously if you want to decide at the _beginning_ whether "all the
> keys match", then you do essentially need GRO's flow-matching logic.
> But even then, I find myself wondering if having GRO coalesce the
> segments into a superpacket is really better than having it just make
> lists of segments, and have that list traverse the stack as a single
> entity.  That way lossless resegmentation remains easy.  But I suppose
> that could make life difficult for things like BPF, if they want to
> act upon the superframe (because we haven't built it).  If instead
> they act on each of the segments, we might get different actions for
> each segment and that might also be awkward; so you'd still need this
> concept of 'any layer in the stack can decide to split lists up'.
>
But again, this scheme is optimizing for forwarding case and doesn't
help (and probably hurts) the use case of locally terminated
connections which I would claim is more important. Packets that are
forwarded really should not be GRO'ed in the first place because of
the loss of information and added latency. The difficultly is that we
don't currently make the forwarding decision before GRO,  if we can
change this to decide whether packets are to be forwarded earlier then
we can avoid doing GRO for those.

Tom

> -Ed

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-06 13:53               ` Tom Herbert
@ 2016-04-06 14:26                 ` Edward Cree
  2016-04-06 15:39                   ` Eric Dumazet
  2016-04-06 15:43                 ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Edward Cree @ 2016-04-06 14:26 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Herbert Xu, Alexander Duyck, Alex Duyck,
	Jesse Gross, Eric Dumazet, Linux Kernel Network Developers

On 06/04/16 14:53, Tom Herbert wrote:
> But again, this scheme is optimizing for forwarding case and doesn't
> help (and probably hurts) the use case of locally terminated
> connections
Not really.  I think this has a chance to outperform GRO for locally
terminated connections, for a number of reasons:
* Doesn't look at higher-layer or inner headers until later in packet
  processing, for instance we (maybe) process every L3 header in a NAPI poll
  before looking at a single L4.  This could delay touching the second
  cacheline of some packets.
* Doesn't have to re-write headers to produce a coherent superframe.
  Instead, each segment carries its original headers around with it.  Also
  means we can skip _checking_ some headers to see if we're 'allowed' to
  coalesce (e.g. TCP TS differences, and the current fun with IP IDs).
* Can be used for protocols like UDP where the original packet boundaries
  are important (so you can't coalesce into a superframe).
Really the last of those was the original reason for this idea, helping with
forwarding is just another nice bonus that we (might) get from it.
And it's all speculative and I don't know for sure what the performance
would be like, but I won't know until I try it!
> which I would claim is more important.
No argument there :-)
> Packets that are
> forwarded really should not be GRO'ed in the first place because of
> the loss of information and added latency. The difficultly is that we
> don't currently make the forwarding decision before GRO,  if we can
> change this to decide whether packets are to be forwarded earlier then
> we can avoid doing GRO for those.
That certainly would be nice, XDP is exciting and I look forward to it.

-Ed

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-06 14:26                 ` Edward Cree
@ 2016-04-06 15:39                   ` Eric Dumazet
  2016-04-06 15:55                     ` Edward Cree
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2016-04-06 15:39 UTC (permalink / raw)
  To: Edward Cree
  Cc: Tom Herbert, David Miller, Herbert Xu, Alexander Duyck,
	Alex Duyck, Jesse Gross, Eric Dumazet,
	Linux Kernel Network Developers

On Wed, 2016-04-06 at 15:26 +0100, Edward Cree wrote:
> On 06/04/16 14:53, Tom Herbert wrote:
> > But again, this scheme is optimizing for forwarding case and doesn't
> > help (and probably hurts) the use case of locally terminated
> > connections
> Not really.  I think this has a chance to outperform GRO for locally
> terminated connections, for a number of reasons:
> * Doesn't look at higher-layer or inner headers until later in packet
>   processing, for instance we (maybe) process every L3 header in a NAPI poll
>   before looking at a single L4.  This could delay touching the second
>   cacheline of some packets.
> * Doesn't have to re-write headers to produce a coherent superframe.
>   Instead, each segment carries its original headers around with it.  Also
>   means we can skip _checking_ some headers to see if we're 'allowed' to
>   coalesce (e.g. TCP TS differences, and the current fun with IP IDs).
> * Can be used for protocols like UDP where the original packet boundaries
>   are important (so you can't coalesce into a superframe).
> Really the last of those was the original reason for this idea, helping with
> forwarding is just another nice bonus that we (might) get from it.
> And it's all speculative and I don't know for sure what the performance
> would be like, but I won't know until I try it!

Look at the mess of some helpers in net/core/skbuff.c, and imagine the
super mess it would be if using a concept of 'super packet with various
headers on each segment'. netfilter is already complex, it would become
a nightmare.

GRO on the other hand presents one virtual set of headers, then the
payload in one or multiple frags.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-06 13:53               ` Tom Herbert
  2016-04-06 14:26                 ` Edward Cree
@ 2016-04-06 15:43                 ` David Miller
  2016-04-06 17:42                   ` Tom Herbert
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2016-04-06 15:43 UTC (permalink / raw)
  To: tom; +Cc: ecree, herbert, alexander.duyck, aduyck, jesse, edumazet, netdev

From: Tom Herbert <tom@herbertland.com>
Date: Wed, 6 Apr 2016 10:53:52 -0300

> Packets that are forwarded really should not be GRO'ed in the first
> place because of the loss of information and added latency.

First of all GRO is supposed to be lossless, so please stop saying this
would be a reason to turn it off on a router.

Second of all, the biggest piece of overhead is the routing lookup,
therefore GRO batching helps enormously with routing workloads, and
therefore is appropriate to be enabled on routers.

Yes, I agree that for locally terminated stuff it helps more, but don't
turn this into a "GRO on routers, meh..." type argument.  It simply is
not true at all.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-06 15:39                   ` Eric Dumazet
@ 2016-04-06 15:55                     ` Edward Cree
  2016-04-06 16:03                       ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Edward Cree @ 2016-04-06 15:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, David Miller, Herbert Xu, Alexander Duyck,
	Alex Duyck, Jesse Gross, Eric Dumazet,
	Linux Kernel Network Developers

On 06/04/16 16:39, Eric Dumazet wrote:
> Look at the mess of some helpers in net/core/skbuff.c, and imagine the
> super mess it would be if using a concept of 'super packet with various
> headers on each segment'.
Maybe I'm still not explaining this very well, but there is _no_ concept of
'super packet [anything]' in this idea.  There is just 'list of skbs that
were all received in the same NAPI poll, and have not yet been determined to
be different'.

Any layer that doesn't want to deal with this stuff will always have the
option of "while (skb = skb_dequeue(list)) my_normal_receive_function(skb);"
and in fact I'd make that happen by default for anything that hadn't
registered a function to take a list.
> netfilter is already complex, it would become a nightmare.
A netfilter hook could, for instance, run on each packet in the list, then
partition the list into sub-lists of packets that all had the same verdict
(letting go of any that were DROP or STOLEN).  That doesn't seem like it
should be nightmarish.

-Ed

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-06 15:55                     ` Edward Cree
@ 2016-04-06 16:03                       ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2016-04-06 16:03 UTC (permalink / raw)
  To: Edward Cree
  Cc: Tom Herbert, David Miller, Herbert Xu, Alexander Duyck,
	Alex Duyck, Jesse Gross, Eric Dumazet,
	Linux Kernel Network Developers

On Wed, 2016-04-06 at 16:55 +0100, Edward Cree wrote:
> On 06/04/16 16:39, Eric Dumazet wrote:
> > Look at the mess of some helpers in net/core/skbuff.c, and imagine the
> > super mess it would be if using a concept of 'super packet with various
> > headers on each segment'.
> Maybe I'm still not explaining this very well, but there is _no_ concept of
> 'super packet [anything]' in this idea.  There is just 'list of skbs that
> were all received in the same NAPI poll, and have not yet been determined to
> be different'.
> 
> Any layer that doesn't want to deal with this stuff will always have the
> option of "while (skb = skb_dequeue(list)) my_normal_receive_function(skb);"
> and in fact I'd make that happen by default for anything that hadn't
> registered a function to take a list.
> > netfilter is already complex, it would become a nightmare.
> A netfilter hook could, for instance, run on each packet in the list, then
> partition the list into sub-lists of packets that all had the same verdict
> (letting go of any that were DROP or STOLEN).  That doesn't seem like it
> should be nightmarish.

Okay, I will let you try this alone ;)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-06 15:43                 ` David Miller
@ 2016-04-06 17:42                   ` Tom Herbert
  2016-04-06 19:30                     ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Tom Herbert @ 2016-04-06 17:42 UTC (permalink / raw)
  To: David Miller
  Cc: Edward Cree, Herbert Xu, Alexander Duyck, Alex Duyck,
	Jesse Gross, Eric Dumazet, Linux Kernel Network Developers

On Wed, Apr 6, 2016 at 12:43 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Wed, 6 Apr 2016 10:53:52 -0300
>
>> Packets that are forwarded really should not be GRO'ed in the first
>> place because of the loss of information and added latency.
>
> First of all GRO is supposed to be lossless, so please stop saying this
> would be a reason to turn it off on a router.
>
> Second of all, the biggest piece of overhead is the routing lookup,
> therefore GRO batching helps enormously with routing workloads, and
> therefore is appropriate to be enabled on routers.
>
> Yes, I agree that for locally terminated stuff it helps more, but don't
> turn this into a "GRO on routers, meh..." type argument.  It simply is
> not true at all.
>
GRO on routers will help in a limited case where there is little load
and the traffic is nicely groomed high tput TCP connections. But for
routers with significant load, handling large quantities other
protocols like UDP, GRO is not necessarily helpful and presents a
nondeterministic performance improvement. For instance, I cannot
provision a router with any assumptions that GRO will be effective for
any % of packets to save any % of CPU, we need to provision based
purely on ability to forward by pps assuming no benefit from GRO.
Technically, this provisioning argument applies to end hosts also. We
have seen real cases on video servers where servers were
mis-provisioned with the assumption that GSO is always effective. So
when were getting good GSO benefits we might be using something like
80% CPU, but if some connectivity event occurs forcing all the cwnds
of the active connections drop, we find we need 110% of CPU to
recover.

This discussion is relevant because there is a big push now to replace
dedicated HW with commodity HW running Linux, this is already happened
significantly in load balancers but I expect this to extend to even
some cases of basic switching. Besides, I seriously doubt you'll find
any commercial router in the world that does GRO.

Yes, GRO needs to work correctly in all cases, but for system
performance in routing we a bound to per packet actions like route
lookup. As I said, optimizing GRO as Edward is suggesting for
forwarding case seems to have diminishing benefits.

Tom

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
  2016-04-06 17:42                   ` Tom Herbert
@ 2016-04-06 19:30                     ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2016-04-06 19:30 UTC (permalink / raw)
  To: tom; +Cc: ecree, herbert, alexander.duyck, aduyck, jesse, edumazet, netdev

From: Tom Herbert <tom@herbertland.com>
Date: Wed, 6 Apr 2016 14:42:26 -0300

> On Wed, Apr 6, 2016 at 12:43 PM, David Miller <davem@davemloft.net> wrote:
>> From: Tom Herbert <tom@herbertland.com>
>> Date: Wed, 6 Apr 2016 10:53:52 -0300
>>
>>> Packets that are forwarded really should not be GRO'ed in the first
>>> place because of the loss of information and added latency.
>>
>> First of all GRO is supposed to be lossless, so please stop saying this
>> would be a reason to turn it off on a router.
>>
>> Second of all, the biggest piece of overhead is the routing lookup,
>> therefore GRO batching helps enormously with routing workloads, and
>> therefore is appropriate to be enabled on routers.
>>
>> Yes, I agree that for locally terminated stuff it helps more, but don't
>> turn this into a "GRO on routers, meh..." type argument.  It simply is
>> not true at all.
>>
> GRO on routers will help in a limited case where there is little load
> and the traffic is nicely groomed high tput TCP connections. But for
> routers with significant load, handling large quantities other
> protocols like UDP, GRO is not necessarily helpful and presents a
> nondeterministic performance improvement. For instance, I cannot
> provision a router with any assumptions that GRO will be effective for
> any % of packets to save any % of CPU, we need to provision based
> purely on ability to forward by pps assuming no benefit from GRO.

Just because you cannot predict how effective a facility will be,
that doesn't mean you shouldn't use it at all.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-04-06 19:30 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-04 16:27 [net PATCH v2 0/2] Fixes for GRO and GRE tunnels Alexander Duyck
2016-04-04 16:28 ` [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU Alexander Duyck
2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck
2016-04-05  0:38   ` subashab
2016-04-05  3:44   ` Herbert Xu
2016-04-05  4:26     ` Alexander Duyck
2016-04-05  4:32       ` Herbert Xu
2016-04-05 15:07         ` Edward Cree
2016-04-05 15:36           ` Tom Herbert
2016-04-05 17:06             ` Edward Cree
2016-04-05 17:38               ` Tom Herbert
2016-04-06  0:04             ` Marcelo Ricardo Leitner
2016-04-05 23:45           ` David Miller
2016-04-06 11:21             ` Edward Cree
2016-04-06 13:53               ` Tom Herbert
2016-04-06 14:26                 ` Edward Cree
2016-04-06 15:39                   ` Eric Dumazet
2016-04-06 15:55                     ` Edward Cree
2016-04-06 16:03                       ` Eric Dumazet
2016-04-06 15:43                 ` David Miller
2016-04-06 17:42                   ` Tom Herbert
2016-04-06 19:30                     ` David Miller
2016-04-05 15:52         ` Alexander Duyck
2016-04-05 16:30           ` Eric Dumazet
2016-04-05 16:45             ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.