* [net PATCH v2 0/2] Fixes for GRO and GRE tunnels @ 2016-04-04 16:27 Alexander Duyck 2016-04-04 16:28 ` [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU Alexander Duyck 2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck 0 siblings, 2 replies; 25+ messages in thread From: Alexander Duyck @ 2016-04-04 16:27 UTC (permalink / raw) To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem This pair of patches addresses a few issues I have discovered over the last week or so concerning GRO and GRE tunnels. The first patch addresses an item I called out as an issue with FOU/GUE encapsulating GRE, and I finally had a chance to test it and verify that the code concerning it was broken so I took the opportunity to fix it so that we cannot generate a FOU/GUE frame that is encapsulating a GRE tunnel with checksum while requesting TSO/GSO for the frame. The second patch actually addresses something I realized was an issue if we feed a tunnel through GRO and back out through GSO. Specifically it was possible for GRO to generate overlapping IPv4 ID ranges as the outer IP IDs were being ignored for tunnels. Ignoring the IP IDs like this should only be valid if the DF bit is set. This is normally the case for IPIP, SIT, and GRE tunnels, but not so for UDP tunnels. In the case that the DF bit is not set we store off the fact that there was a delta from what we were expecting and when we hit the inner-most header we validate the value as to avoid generating a frame which could lead to an IP ID collision on packets that could eventually be fragmented. A side effect is that the inner-most IP ID test is relaxed as well, but the worst case scenario is that we GRO a frame with a throw-away ID sequence anyway so if anything segmenting such a frame with the wrong IP IDs should have no negative effects. v2: Updated RFC 6864 patch so that we only support one of two modes. Either the IP ID is incrementing, or it is fixed at a specific value with DF bit set. --- Alexander Duyck (2): GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU ipv4/GRO: Make GRO conform to RFC 6864 include/linux/netdevice.h | 5 ++++- net/core/dev.c | 2 ++ net/ipv4/af_inet.c | 25 ++++++++++++++++++------- net/ipv4/fou.c | 6 ++++++ net/ipv4/gre_offload.c | 8 ++++++++ net/ipv4/ip_gre.c | 13 ++++++++++--- net/ipv6/ip6_offload.c | 3 --- 7 files changed, 48 insertions(+), 14 deletions(-) ^ permalink raw reply [flat|nested] 25+ messages in thread
* [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU 2016-04-04 16:27 [net PATCH v2 0/2] Fixes for GRO and GRE tunnels Alexander Duyck @ 2016-04-04 16:28 ` Alexander Duyck 2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck 1 sibling, 0 replies; 25+ messages in thread From: Alexander Duyck @ 2016-04-04 16:28 UTC (permalink / raw) To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem This patch fixes an issue I found in which we were dropping frames if we had enabled checksums on GRE headers that were encapsulated by either FOU or GUE. Without this patch I was barely able to get 1 Gb/s of throughput. With this patch applied I am now at least getting around 6 Gb/s. The issue is due to the fact that with FOU or GUE applied we do not provide a transport offset pointing to the GRE header, nor do we offload it in software as the GRE header is completely skipped by GSO and treated like a VXLAN or GENEVE type header. As such we need to prevent the stack from generating it and also prevent GRE from generating it via any interface we create. Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> --- include/linux/netdevice.h | 5 ++++- net/core/dev.c | 1 + net/ipv4/fou.c | 6 ++++++ net/ipv4/gre_offload.c | 8 ++++++++ net/ipv4/ip_gre.c | 13 ++++++++++--- 5 files changed, 29 insertions(+), 4 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index cb0d5d09c2e4..8395308a2445 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2120,7 +2120,10 @@ struct napi_gro_cb { /* Used in foo-over-udp, set in udp[46]_gro_receive */ u8 is_ipv6:1; - /* 7 bit hole */ + /* Used in GRE, set in fou/gue_gro_receive */ + u8 is_fou:1; + + /* 6 bit hole */ /* used to support CHECKSUM_COMPLETE for tunneling protocols */ __wsum csum; diff --git a/net/core/dev.c b/net/core/dev.c index b9bcbe77d913..77a71cd68535 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4439,6 +4439,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff NAPI_GRO_CB(skb)->flush = 0; NAPI_GRO_CB(skb)->free = 0; NAPI_GRO_CB(skb)->encap_mark = 0; + NAPI_GRO_CB(skb)->is_fou = 0; NAPI_GRO_CB(skb)->gro_remcsum_start = 0; /* Setup for GRO checksum validation */ diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c index 5a94aea280d3..a39068b4a4d9 100644 --- a/net/ipv4/fou.c +++ b/net/ipv4/fou.c @@ -203,6 +203,9 @@ static struct sk_buff **fou_gro_receive(struct sk_buff **head, */ NAPI_GRO_CB(skb)->encap_mark = 0; + /* Flag this frame as already having an outer encap header */ + NAPI_GRO_CB(skb)->is_fou = 1; + rcu_read_lock(); offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[proto]); @@ -368,6 +371,9 @@ static struct sk_buff **gue_gro_receive(struct sk_buff **head, */ NAPI_GRO_CB(skb)->encap_mark = 0; + /* Flag this frame as already having an outer encap header */ + NAPI_GRO_CB(skb)->is_fou = 1; + rcu_read_lock(); offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[guehdr->proto_ctype]); diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c index c47539d04b88..6a5bd4317866 100644 --- a/net/ipv4/gre_offload.c +++ b/net/ipv4/gre_offload.c @@ -150,6 +150,14 @@ static struct sk_buff **gre_gro_receive(struct sk_buff **head, if ((greh->flags & ~(GRE_KEY|GRE_CSUM)) != 0) goto out; + /* We can only support GRE_CSUM if we can track the location of + * the GRE header. In the case of FOU/GUE we cannot because the + * outer UDP header displaces the GRE header leaving us in a state + * of limbo. + */ + if ((greh->flags & GRE_CSUM) && NAPI_GRO_CB(skb)->is_fou) + goto out; + type = greh->protocol; rcu_read_lock(); diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 31936d387cfd..af5d1f38217f 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -862,9 +862,16 @@ static void __gre_tunnel_init(struct net_device *dev) dev->hw_features |= GRE_FEATURES; if (!(tunnel->parms.o_flags & TUNNEL_SEQ)) { - /* TCP offload with GRE SEQ is not supported. */ - dev->features |= NETIF_F_GSO_SOFTWARE; - dev->hw_features |= NETIF_F_GSO_SOFTWARE; + /* TCP offload with GRE SEQ is not supported, nor + * can we support 2 levels of outer headers requiring + * an update. + */ + if (!(tunnel->parms.o_flags & TUNNEL_CSUM) || + (tunnel->encap.type == TUNNEL_ENCAP_NONE)) { + dev->features |= NETIF_F_GSO_SOFTWARE; + dev->hw_features |= NETIF_F_GSO_SOFTWARE; + } + /* Can use a lockless transmit, unless we generate * output sequences */ ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-04 16:27 [net PATCH v2 0/2] Fixes for GRO and GRE tunnels Alexander Duyck 2016-04-04 16:28 ` [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU Alexander Duyck @ 2016-04-04 16:31 ` Alexander Duyck 2016-04-05 0:38 ` subashab 2016-04-05 3:44 ` Herbert Xu 1 sibling, 2 replies; 25+ messages in thread From: Alexander Duyck @ 2016-04-04 16:31 UTC (permalink / raw) To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other than fragmentation and reassembly. Currently we are looking at this field as a way of identifying what frames can be aggregated and which cannot for GRO. While this is valid for frames that do not have DF set, it is invalid to do so if the bit is set. In addition we were generating IPv4 ID collisions when 2 or more flows were interleaved over the same tunnel. To prevent that we store the result of all IP ID checks via a "|=" instead of overwriting previous values. With this patch we support two different approaches for the IP ID field. The first is a non-incrementing IP ID with DF bit set. In such a case we simply won't write to the flush_id field in the GRO context block. The other option is the legacy option in which the IP ID must increment by 1 for every packet we aggregate. In the case of the non-incrementing IP ID we will end up losing the data that the IP ID is fixed. However as per RFC 6864 we should be able to write any value into the IP ID when the DF bit is set so this should cause minimal harm. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> --- v2: Updated patch so that we now only support one of two options. Either the IP ID is fixed with DF bit set, or the IP ID is incrementing. That allows us to support the fixed ID case as occurs with IPv6 to IPv4 header translation and what is likely already out there for some devices with tunnel headers. net/core/dev.c | 1 + net/ipv4/af_inet.c | 25 ++++++++++++++++++------- net/ipv6/ip6_offload.c | 3 --- 3 files changed, 19 insertions(+), 10 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 77a71cd68535..3429632398a4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb) unsigned long diffs; NAPI_GRO_CB(p)->flush = 0; + NAPI_GRO_CB(p)->flush_id = 0; if (hash != skb_get_hash_raw(p)) { NAPI_GRO_CB(p)->same_flow = 0; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9e481992dbae..33f6335448a2 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, for (p = *head; p; p = p->next) { struct iphdr *iph2; + u16 flush_id; if (!NAPI_GRO_CB(p)->same_flow) continue; @@ -1347,14 +1348,24 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, (iph->tos ^ iph2->tos) | ((iph->frag_off ^ iph2->frag_off) & htons(IP_DF)); - /* Save the IP ID check to be included later when we get to - * the transport layer so only the inner most IP ID is checked. - * This is because some GSO/TSO implementations do not - * correctly increment the IP ID for the outer hdrs. - */ - NAPI_GRO_CB(p)->flush_id = - ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id); NAPI_GRO_CB(p)->flush |= flush; + + /* We must save the offset as it is possible to have multiple + * flows using the same protocol and address pairs so we + * need to wait until we can validate this is part of the + * same flow with a 5-tuple or better to avoid unnecessary + * collisions between flows. We can support one of two + * possible scenarios, either a fixed value with DF bit set + * or an incrementing value with DF either set or unset. + * In the case of a fixed value we will end up losing the + * data that the IP ID was a fixed value, however per RFC + * 6864 in such a case the actual value of the IP ID is + * meant to be ignored anyway. + */ + flush_id = (u16)(id - ntohs(iph2->id)); + if (flush_id || !(iph2->frag_off & htons(IP_DF))) + NAPI_GRO_CB(p)->flush_id |= flush_id ^ + NAPI_GRO_CB(p)->count; } NAPI_GRO_CB(skb)->flush |= flush; diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 82e9f3076028..9aa53f64dffd 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff **head, /* flush if Traffic Class fields are different */ NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000)); NAPI_GRO_CB(p)->flush |= flush; - - /* Clear flush_id, there's really no concept of ID in IPv6. */ - NAPI_GRO_CB(p)->flush_id = 0; } NAPI_GRO_CB(skb)->flush |= flush; ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck @ 2016-04-05 0:38 ` subashab 2016-04-05 3:44 ` Herbert Xu 1 sibling, 0 replies; 25+ messages in thread From: subashab @ 2016-04-05 0:38 UTC (permalink / raw) To: Alexander Duyck Cc: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem, netdev-owner On 2016-04-04 10:31, Alexander Duyck wrote: > RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes > other > than fragmentation and reassembly. Currently we are looking at this > field > as a way of identifying what frames can be aggregated and which cannot > for > GRO. While this is valid for frames that do not have DF set, it is > invalid > to do so if the bit is set. > > In addition we were generating IPv4 ID collisions when 2 or more flows > were > interleaved over the same tunnel. To prevent that we store the result > of > all IP ID checks via a "|=" instead of overwriting previous values. > > With this patch we support two different approaches for the IP ID > field. > The first is a non-incrementing IP ID with DF bit set. In such a case > we > simply won't write to the flush_id field in the GRO context block. The > other option is the legacy option in which the IP ID must increment by > 1 > for every packet we aggregate. > > In the case of the non-incrementing IP ID we will end up losing the > data > that the IP ID is fixed. However as per RFC 6864 we should be able to > write any value into the IP ID when the DF bit is set so this should > cause > minimal harm. > > Signed-off-by: Alexander Duyck <aduyck@mirantis.com> > --- > > v2: Updated patch so that we now only support one of two options. > Either > the IP ID is fixed with DF bit set, or the IP ID is incrementing. > That > allows us to support the fixed ID case as occurs with IPv6 to IPv4 > header translation and what is likely already out there for some > devices with tunnel headers. > > net/core/dev.c | 1 + > net/ipv4/af_inet.c | 25 ++++++++++++++++++------- > net/ipv6/ip6_offload.c | 3 --- > 3 files changed, 19 insertions(+), 10 deletions(-) > > diff --git a/net/core/dev.c b/net/core/dev.c > index 77a71cd68535..3429632398a4 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct > *napi, struct sk_buff *skb) > unsigned long diffs; > > NAPI_GRO_CB(p)->flush = 0; > + NAPI_GRO_CB(p)->flush_id = 0; > > if (hash != skb_get_hash_raw(p)) { > NAPI_GRO_CB(p)->same_flow = 0; > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index 9e481992dbae..33f6335448a2 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct > sk_buff **head, > > for (p = *head; p; p = p->next) { > struct iphdr *iph2; > + u16 flush_id; > > if (!NAPI_GRO_CB(p)->same_flow) > continue; > @@ -1347,14 +1348,24 @@ static struct sk_buff > **inet_gro_receive(struct sk_buff **head, > (iph->tos ^ iph2->tos) | > ((iph->frag_off ^ iph2->frag_off) & htons(IP_DF)); > > - /* Save the IP ID check to be included later when we get to > - * the transport layer so only the inner most IP ID is checked. > - * This is because some GSO/TSO implementations do not > - * correctly increment the IP ID for the outer hdrs. > - */ > - NAPI_GRO_CB(p)->flush_id = > - ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id); > NAPI_GRO_CB(p)->flush |= flush; > + > + /* We must save the offset as it is possible to have multiple > + * flows using the same protocol and address pairs so we > + * need to wait until we can validate this is part of the > + * same flow with a 5-tuple or better to avoid unnecessary > + * collisions between flows. We can support one of two > + * possible scenarios, either a fixed value with DF bit set > + * or an incrementing value with DF either set or unset. > + * In the case of a fixed value we will end up losing the > + * data that the IP ID was a fixed value, however per RFC > + * 6864 in such a case the actual value of the IP ID is > + * meant to be ignored anyway. > + */ > + flush_id = (u16)(id - ntohs(iph2->id)); > + if (flush_id || !(iph2->frag_off & htons(IP_DF))) > + NAPI_GRO_CB(p)->flush_id |= flush_id ^ > + NAPI_GRO_CB(p)->count; > } > > NAPI_GRO_CB(skb)->flush |= flush; > diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c > index 82e9f3076028..9aa53f64dffd 100644 > --- a/net/ipv6/ip6_offload.c > +++ b/net/ipv6/ip6_offload.c > @@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct > sk_buff **head, > /* flush if Traffic Class fields are different */ > NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000)); > NAPI_GRO_CB(p)->flush |= flush; > - > - /* Clear flush_id, there's really no concept of ID in IPv6. */ > - NAPI_GRO_CB(p)->flush_id = 0; > } > > NAPI_GRO_CB(skb)->flush |= flush; I can see coalescing of packets which have IP ID=0 and DF=1 originating from a IPv6 to IPv4 translator. Tested-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck 2016-04-05 0:38 ` subashab @ 2016-04-05 3:44 ` Herbert Xu 2016-04-05 4:26 ` Alexander Duyck 1 sibling, 1 reply; 25+ messages in thread From: Herbert Xu @ 2016-04-05 3:44 UTC (permalink / raw) To: Alexander Duyck; +Cc: tom, jesse, alexander.duyck, edumazet, netdev, davem On Mon, Apr 04, 2016 at 09:31:21AM -0700, Alexander Duyck wrote: > RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other > than fragmentation and reassembly. Currently we are looking at this field > as a way of identifying what frames can be aggregated and which cannot for > GRO. While this is valid for frames that do not have DF set, it is invalid > to do so if the bit is set. This justification is bogus. GRO is a completely local optimisation that should have zero visibility to third parties. So it makes no sense to talk about RFC compliance of GRO. The Linux network stack as a whole is subject to RFC compliance, but not GRO per se. > In the case of the non-incrementing IP ID we will end up losing the data > that the IP ID is fixed. However as per RFC 6864 we should be able to > write any value into the IP ID when the DF bit is set so this should cause > minimal harm. No we should not do that, at least not by default. GRO was designed to be completely lossless, that is its main advantage of the various forms of LRO which preceded it. If you lose that people will start asking it to be disabled for routers/bridges and we'll be back in the same old mess that we had with LRO. So please do this properly and preserve the information in the packet. As I said earlier all it takes is one single bit, like we do with ECN. If you put it in the feature bit you'll also allow us to distinguish between TSO drivers that produce fixed IDs vs. incrementing IDs. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 3:44 ` Herbert Xu @ 2016-04-05 4:26 ` Alexander Duyck 2016-04-05 4:32 ` Herbert Xu 0 siblings, 1 reply; 25+ messages in thread From: Alexander Duyck @ 2016-04-05 4:26 UTC (permalink / raw) To: Herbert Xu Cc: Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev, David Miller On Mon, Apr 4, 2016 at 8:44 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Mon, Apr 04, 2016 at 09:31:21AM -0700, Alexander Duyck wrote: >> RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other >> than fragmentation and reassembly. Currently we are looking at this field >> as a way of identifying what frames can be aggregated and which cannot for >> GRO. While this is valid for frames that do not have DF set, it is invalid >> to do so if the bit is set. > > This justification is bogus. GRO is a completely local optimisation > that should have zero visibility to third parties. So it makes no > sense to talk about RFC compliance of GRO. The Linux network stack > as a whole is subject to RFC compliance, but not GRO per se. The problem is right now we are mangling the IP ID for outer headers on tunnels. We end up totally ignoring the delta between the values so if you have two flows that get interleaved over the same tunnel GRO will currently mash the IP IDs for the two tunnels so that they end up overlapping. The reason why I keep referencing RFC 6864 is because it specifies that the IP ID field must not be read if the DF bit is set, and that if we are manipulating headers we can handle the IP ID as though we are the transmitting station. What this means is that if DF is not set we have to have unique values per packet, otherwise we can ignore the values if DF is set. >> In the case of the non-incrementing IP ID we will end up losing the data >> that the IP ID is fixed. However as per RFC 6864 we should be able to >> write any value into the IP ID when the DF bit is set so this should cause >> minimal harm. > > No we should not do that, at least not by default. GRO was designed > to be completely lossless, that is its main advantage of the various > forms of LRO which preceded it. Well the problem is it isn't right now. Instead in the case of tunnels it allows you to generate overlapping sequences of IP IDs. > If you lose that people will start asking it to be disabled for > routers/bridges and we'll be back in the same old mess that we > had with LRO. The question I would have is what are you really losing with increment from 0 versus fixed 0? From what I see it is essentially just garbage in/garbage out. > So please do this properly and preserve the information in the packet. > As I said earlier all it takes is one single bit, like we do with ECN. > If you put it in the feature bit you'll also allow us to distinguish > between TSO drivers that produce fixed IDs vs. incrementing IDs. Actually it will take at least 2. One for the outer headers and one for the inner headers. I'll also have to add tracking for each to the GRO code so that we don't merge frames that go from a fixed ID to incrementing one or visa-versa since I suspect that is probably floating around out there too as my GSO partial code was going to end up doing some of that. - Alex ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 4:26 ` Alexander Duyck @ 2016-04-05 4:32 ` Herbert Xu 2016-04-05 15:07 ` Edward Cree 2016-04-05 15:52 ` Alexander Duyck 0 siblings, 2 replies; 25+ messages in thread From: Herbert Xu @ 2016-04-05 4:32 UTC (permalink / raw) To: Alexander Duyck Cc: Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev, David Miller On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: > > The problem is right now we are mangling the IP ID for outer headers > on tunnels. We end up totally ignoring the delta between the values > so if you have two flows that get interleaved over the same tunnel GRO > will currently mash the IP IDs for the two tunnels so that they end up > overlapping. Then it should be fixed. I never reviewed those patches or I would have objected at the time. > The reason why I keep referencing RFC 6864 is because it specifies > that the IP ID field must not be read if the DF bit is set, and that > if we are manipulating headers we can handle the IP ID as though we > are the transmitting station. What this means is that if DF is not > set we have to have unique values per packet, otherwise we can ignore > the values if DF is set. As I said GRO itself should not be visible. The fact that it is for tunnels is a bug. > The question I would have is what are you really losing with increment > from 0 versus fixed 0? From what I see it is essentially just garbage > in/garbage out. GRO is meant to be lossless, that is, you should not be able to detect its presence from the outside. If you lose information then you're breaking this rule and people will soon start asking for it to be disabled in various situations. I'm not against doing this per se but it should not be part of the default configuration. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 4:32 ` Herbert Xu @ 2016-04-05 15:07 ` Edward Cree 2016-04-05 15:36 ` Tom Herbert 2016-04-05 23:45 ` David Miller 2016-04-05 15:52 ` Alexander Duyck 1 sibling, 2 replies; 25+ messages in thread From: Edward Cree @ 2016-04-05 15:07 UTC (permalink / raw) To: Herbert Xu, Alexander Duyck Cc: Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev, David Miller On 05/04/16 05:32, Herbert Xu wrote: > On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: >> The question I would have is what are you really losing with increment >> from 0 versus fixed 0? From what I see it is essentially just garbage >> in/garbage out. > GRO is meant to be lossless, that is, you should not be able to > detect its presence from the outside. If you lose information then > you're breaking this rule and people will soon start asking for it > to be disabled in various situations. > > I'm not against doing this per se but it should not be part of the > default configuration. I'm certainly in favour of this being configurable - indeed IMHO it should also be possible to configure GRO with the 'looser' semantics of LRO, so that people who want that can get it without all the horrible "don't confuse Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it exists are (a) improved performance from the 'loose' semantics and (b) old kernels without GRO. We may not be able to kill (b) but we can certainly address (a)). But I don't agree that the default has to be totally lossless; anyone who is caring about the ID fields in atomic datagrams is breaking the RFCs, and can be assumed to Know What They're Doing sufficiently to configure this. On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up forwarding/routing workloads. Instead we should be looking into having lists of SKBs traverse the stack together, splitting the list whenever e.g. the destination changes. That seems like it ought to be much more efficient than rewriting headers twice, once to coalesce a superframe and once to segment it again - and it also means this worry about GRO being lossless can go away. But until someone tries implementing skb batches, we won't know for sure if it works (and I don't have time right now ;) -Ed ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 15:07 ` Edward Cree @ 2016-04-05 15:36 ` Tom Herbert 2016-04-05 17:06 ` Edward Cree 2016-04-06 0:04 ` Marcelo Ricardo Leitner 2016-04-05 23:45 ` David Miller 1 sibling, 2 replies; 25+ messages in thread From: Tom Herbert @ 2016-04-05 15:36 UTC (permalink / raw) To: Edward Cree Cc: Herbert Xu, Alexander Duyck, Alexander Duyck, Jesse Gross, Eric Dumazet, Netdev, David Miller On Tue, Apr 5, 2016 at 12:07 PM, Edward Cree <ecree@solarflare.com> wrote: > On 05/04/16 05:32, Herbert Xu wrote: >> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: >>> The question I would have is what are you really losing with increment >>> from 0 versus fixed 0? From what I see it is essentially just garbage >>> in/garbage out. >> GRO is meant to be lossless, that is, you should not be able to >> detect its presence from the outside. If you lose information then >> you're breaking this rule and people will soon start asking for it >> to be disabled in various situations. >> >> I'm not against doing this per se but it should not be part of the >> default configuration. > I'm certainly in favour of this being configurable - indeed IMHO it should > also be possible to configure GRO with the 'looser' semantics of LRO, so > that people who want that can get it without all the horrible "don't confuse > Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it > exists are (a) improved performance from the 'loose' semantics and (b) old > kernels without GRO. We may not be able to kill (b) but we can certainly > address (a)). > > But I don't agree that the default has to be totally lossless; anyone who is > caring about the ID fields in atomic datagrams is breaking the RFCs, and can > be assumed to Know What They're Doing sufficiently to configure this. > > On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up > forwarding/routing workloads. Instead we should be looking into having lists > of SKBs traverse the stack together, splitting the list whenever e.g. the > destination changes. That seems like it ought to be much more efficient than > rewriting headers twice, once to coalesce a superframe and once to segment it > again - and it also means this worry about GRO being lossless can go away. > But until someone tries implementing skb batches, we won't know for sure if > it works (and I don't have time right now ;) > Ed, I thought about that some. It seems like we would want to do both GRO and retain all the individual packets in the skb so that we could use those for forwarding instead of GSO as I think you're saying. This would would work great in the plain forwarding case, but one problem is what to do if the host modifies the super packet (for instance when forwarding over a tunnel we might add encapsulation header). This should work in GSO (although we need to address the limitations around 1 encap level), not sure this is easy if we need to add a header to each packet in a batch. Tom > -Ed ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 15:36 ` Tom Herbert @ 2016-04-05 17:06 ` Edward Cree 2016-04-05 17:38 ` Tom Herbert 2016-04-06 0:04 ` Marcelo Ricardo Leitner 1 sibling, 1 reply; 25+ messages in thread From: Edward Cree @ 2016-04-05 17:06 UTC (permalink / raw) To: Tom Herbert Cc: Herbert Xu, Alexander Duyck, Alexander Duyck, Jesse Gross, Eric Dumazet, Netdev, David Miller On 05/04/16 16:36, Tom Herbert wrote: > I thought about that some. It seems like we would want to do both GRO > and retain all the individual packets in the skb so that we could use > those for forwarding instead of GSO as I think you're saying. I didn't quite mean that, I meant just pass around the skb list, don't do GRO at all. The receiving TCP stack ends up just getting called several times, in quick succession, without I$ loss from network stack traversal in between. > This > would would work great in the plain forwarding case, but one problem > is what to do if the host modifies the super packet (for instance when > forwarding over a tunnel we might add encapsulation header). This > should work in GSO (although we need to address the limitations around > 1 encap level), not sure this is easy if we need to add a header to > each packet in a batch. This is indeed a problem with what I was proposing; perhaps the answer is that as you process these SKB lists you also update something like a GRO CB, then if you do decide to transform the packets you can coalesce them at that point. But doing 'the rest' of GRO might cost as much as just transforming all the packets, in which case you only win if you want to transform them multiple times. And if we assume that we're going to be touching all the headers anyway, it probably doesn't cost us much to transform all the packets in the list since our code and data are both hot in cache. Well, the code is cold for the first packet in the list, but equally it's cold for the superpacket in the current approach. If this is as performant as GRO in the normal (non-forwarding) receive case (and that's a *big* if, which can only be resolved by trying it), it might make sense to just not have GRO, while TSO only gets used for locally-generated sends, and for the case where you're forwarding between networks with different MTUs (e.g. from a 64k VM netdevice to the wire). What do you think? Am I crazy? (More than my usual?) -Ed ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 17:06 ` Edward Cree @ 2016-04-05 17:38 ` Tom Herbert 0 siblings, 0 replies; 25+ messages in thread From: Tom Herbert @ 2016-04-05 17:38 UTC (permalink / raw) To: Edward Cree Cc: Herbert Xu, Alexander Duyck, Alexander Duyck, Jesse Gross, Eric Dumazet, Netdev, David Miller On Tue, Apr 5, 2016 at 2:06 PM, Edward Cree <ecree@solarflare.com> wrote: > On 05/04/16 16:36, Tom Herbert wrote: >> I thought about that some. It seems like we would want to do both GRO >> and retain all the individual packets in the skb so that we could use >> those for forwarding instead of GSO as I think you're saying. > I didn't quite mean that, I meant just pass around the skb list, don't > do GRO at all. The receiving TCP stack ends up just getting called > several times, in quick succession, without I$ loss from network stack > traversal in between. > >> This >> would would work great in the plain forwarding case, but one problem >> is what to do if the host modifies the super packet (for instance when >> forwarding over a tunnel we might add encapsulation header). This >> should work in GSO (although we need to address the limitations around >> 1 encap level), not sure this is easy if we need to add a header to >> each packet in a batch. > This is indeed a problem with what I was proposing; perhaps the answer > is that as you process these SKB lists you also update something like a > GRO CB, then if you do decide to transform the packets you can coalesce > them at that point. But doing 'the rest' of GRO might cost as much as > just transforming all the packets, in which case you only win if you want > to transform them multiple times. > And if we assume that we're going to be touching all the headers anyway, > it probably doesn't cost us much to transform all the packets in the list > since our code and data are both hot in cache. Well, the code is cold > for the first packet in the list, but equally it's cold for the > superpacket in the current approach. > > If this is as performant as GRO in the normal (non-forwarding) receive > case (and that's a *big* if, which can only be resolved by trying it), it > might make sense to just not have GRO, while TSO only gets used for > locally-generated sends, and for the case where you're forwarding between > networks with different MTUs (e.g. from a 64k VM netdevice to the wire). > > What do you think? Am I crazy? (More than my usual?) > It's not clear to me how important optimizing the GRO to GSO forwarding case really is. We definitely need GRO for locally received packets, but how much benefit do we really get when applying GRO to forwarding and does that really outweigh the latency we're adding in the forwarding path to to GRO? Also, once we implement high performance forwarding in XDP there is really is no need to consider GRO, we'll do some sort of I/O batching for sure but that would be batching packets to same TX queue not same destination. Tom > -Ed ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 15:36 ` Tom Herbert 2016-04-05 17:06 ` Edward Cree @ 2016-04-06 0:04 ` Marcelo Ricardo Leitner 1 sibling, 0 replies; 25+ messages in thread From: Marcelo Ricardo Leitner @ 2016-04-06 0:04 UTC (permalink / raw) To: Tom Herbert Cc: Edward Cree, Herbert Xu, Alexander Duyck, Alexander Duyck, Jesse Gross, Eric Dumazet, Netdev, David Miller On Tue, Apr 05, 2016 at 12:36:40PM -0300, Tom Herbert wrote: > On Tue, Apr 5, 2016 at 12:07 PM, Edward Cree <ecree@solarflare.com> wrote: > > On 05/04/16 05:32, Herbert Xu wrote: > >> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: > >>> The question I would have is what are you really losing with increment > >>> from 0 versus fixed 0? From what I see it is essentially just garbage > >>> in/garbage out. > >> GRO is meant to be lossless, that is, you should not be able to > >> detect its presence from the outside. If you lose information then > >> you're breaking this rule and people will soon start asking for it > >> to be disabled in various situations. > >> > >> I'm not against doing this per se but it should not be part of the > >> default configuration. > > I'm certainly in favour of this being configurable - indeed IMHO it should > > also be possible to configure GRO with the 'looser' semantics of LRO, so > > that people who want that can get it without all the horrible "don't confuse > > Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it > > exists are (a) improved performance from the 'loose' semantics and (b) old > > kernels without GRO. We may not be able to kill (b) but we can certainly > > address (a)). > > > > But I don't agree that the default has to be totally lossless; anyone who is > > caring about the ID fields in atomic datagrams is breaking the RFCs, and can > > be assumed to Know What They're Doing sufficiently to configure this. > > > > On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up > > forwarding/routing workloads. Instead we should be looking into having lists > > of SKBs traverse the stack together, splitting the list whenever e.g. the > > destination changes. That seems like it ought to be much more efficient than > > rewriting headers twice, once to coalesce a superframe and once to segment it > > again - and it also means this worry about GRO being lossless can go away. > > But until someone tries implementing skb batches, we won't know for sure if > > it works (and I don't have time right now ;) > > > Ed, > > I thought about that some. It seems like we would want to do both GRO > and retain all the individual packets in the skb so that we could use > those for forwarding instead of GSO as I think you're saying. This Retaining the individual packets would also help to make GRO feasible for SCTP. SCTP needs to know where each packet ended because of AUTH chunks and we cannot rely on something like gso_size as each original packet had it's own size. I could do it for tx side (see my SCTP/GSO RFC patches) using skb_gro_receive() with a specially crafted header skb, but I'm not seeing a way to do it in rx side as I cannot guarantee incoming skbs will follow that pattern. Marcelo > would would work great in the plain forwarding case, but one problem > is what to do if the host modifies the super packet (for instance when > forwarding over a tunnel we might add encapsulation header). This > should work in GSO (although we need to address the limitations around > 1 encap level), not sure this is easy if we need to add a header to > each packet in a batch. > > Tom > > > > > -Ed > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 15:07 ` Edward Cree 2016-04-05 15:36 ` Tom Herbert @ 2016-04-05 23:45 ` David Miller 2016-04-06 11:21 ` Edward Cree 1 sibling, 1 reply; 25+ messages in thread From: David Miller @ 2016-04-05 23:45 UTC (permalink / raw) To: ecree; +Cc: herbert, alexander.duyck, aduyck, tom, jesse, edumazet, netdev From: Edward Cree <ecree@solarflare.com> Date: Tue, 5 Apr 2016 16:07:49 +0100 > On the gripping hand, I feel like GRO+TSO is the wrong model for > speeding up forwarding/routing workloads. Instead we should be > looking into having lists of SKBs traverse the stack together, > splitting the list whenever e.g. the destination changes. "Destination" is a very complicated beast. It's not just a destination IP address. It's not even just a full saddr/daddr/TOS triplet. Packets can be forwarded around based upon any key whatsoever in the headers. Netfilter can mangle them based upon arbitrary bits in the packet, as can the packet scheduler classifier actions. It's therefore not profitable to try this at all, it's completely pointless unless all the keys match up exactly. This is why GRO _is_ the proper model to speed this stuff and do bulk processing, because it still presents a full "packet" to all of these layers to mangle, rewrite, route, and do whatever else however they like. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 23:45 ` David Miller @ 2016-04-06 11:21 ` Edward Cree 2016-04-06 13:53 ` Tom Herbert 0 siblings, 1 reply; 25+ messages in thread From: Edward Cree @ 2016-04-06 11:21 UTC (permalink / raw) To: David Miller Cc: herbert, alexander.duyck, aduyck, tom, jesse, edumazet, netdev On 06/04/16 00:45, David Miller wrote: > From: Edward Cree <ecree@solarflare.com> > Date: Tue, 5 Apr 2016 16:07:49 +0100 > >> On the gripping hand, I feel like GRO+TSO is the wrong model for >> speeding up forwarding/routing workloads. Instead we should be >> looking into having lists of SKBs traverse the stack together, >> splitting the list whenever e.g. the destination changes. > "Destination" is a very complicated beast. It's not just a > destination IP address. > > It's not even just a full saddr/daddr/TOS triplet. > > Packets can be forwarded around based upon any key whatsoever in the > headers. Netfilter can mangle them based upon arbitrary bits in the > packet, as can the packet scheduler classifier actions. > > It's therefore not profitable to try this at all, it's completely > pointless unless all the keys match up exactly. Possibly I wasn't completely clear (or maybe I was and I'm just wrong...), but I meant that _each layer_ in the stack would split the list whenever it wants to treat two packets differently. Whether that's a protocol receive handler, or a netfilter or tc operation. Obviously if you want to decide at the _beginning_ whether "all the keys match", then you do essentially need GRO's flow-matching logic. But even then, I find myself wondering if having GRO coalesce the segments into a superpacket is really better than having it just make lists of segments, and have that list traverse the stack as a single entity. That way lossless resegmentation remains easy. But I suppose that could make life difficult for things like BPF, if they want to act upon the superframe (because we haven't built it). If instead they act on each of the segments, we might get different actions for each segment and that might also be awkward; so you'd still need this concept of 'any layer in the stack can decide to split lists up'. -Ed ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-06 11:21 ` Edward Cree @ 2016-04-06 13:53 ` Tom Herbert 2016-04-06 14:26 ` Edward Cree 2016-04-06 15:43 ` David Miller 0 siblings, 2 replies; 25+ messages in thread From: Tom Herbert @ 2016-04-06 13:53 UTC (permalink / raw) To: Edward Cree Cc: David Miller, Herbert Xu, Alexander Duyck, Alex Duyck, Jesse Gross, Eric Dumazet, Linux Kernel Network Developers On Wed, Apr 6, 2016 at 8:21 AM, Edward Cree <ecree@solarflare.com> wrote: > On 06/04/16 00:45, David Miller wrote: >> From: Edward Cree <ecree@solarflare.com> >> Date: Tue, 5 Apr 2016 16:07:49 +0100 >> >>> On the gripping hand, I feel like GRO+TSO is the wrong model for >>> speeding up forwarding/routing workloads. Instead we should be >>> looking into having lists of SKBs traverse the stack together, >>> splitting the list whenever e.g. the destination changes. >> "Destination" is a very complicated beast. It's not just a >> destination IP address. >> >> It's not even just a full saddr/daddr/TOS triplet. >> >> Packets can be forwarded around based upon any key whatsoever in the >> headers. Netfilter can mangle them based upon arbitrary bits in the >> packet, as can the packet scheduler classifier actions. >> >> It's therefore not profitable to try this at all, it's completely >> pointless unless all the keys match up exactly. > Possibly I wasn't completely clear (or maybe I was and I'm just > wrong...), but I meant that _each layer_ in the stack would split the > list whenever it wants to treat two packets differently. Whether > that's a protocol receive handler, or a netfilter or tc operation. > > Obviously if you want to decide at the _beginning_ whether "all the > keys match", then you do essentially need GRO's flow-matching logic. > But even then, I find myself wondering if having GRO coalesce the > segments into a superpacket is really better than having it just make > lists of segments, and have that list traverse the stack as a single > entity. That way lossless resegmentation remains easy. But I suppose > that could make life difficult for things like BPF, if they want to > act upon the superframe (because we haven't built it). If instead > they act on each of the segments, we might get different actions for > each segment and that might also be awkward; so you'd still need this > concept of 'any layer in the stack can decide to split lists up'. > But again, this scheme is optimizing for forwarding case and doesn't help (and probably hurts) the use case of locally terminated connections which I would claim is more important. Packets that are forwarded really should not be GRO'ed in the first place because of the loss of information and added latency. The difficultly is that we don't currently make the forwarding decision before GRO, if we can change this to decide whether packets are to be forwarded earlier then we can avoid doing GRO for those. Tom > -Ed ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-06 13:53 ` Tom Herbert @ 2016-04-06 14:26 ` Edward Cree 2016-04-06 15:39 ` Eric Dumazet 2016-04-06 15:43 ` David Miller 1 sibling, 1 reply; 25+ messages in thread From: Edward Cree @ 2016-04-06 14:26 UTC (permalink / raw) To: Tom Herbert Cc: David Miller, Herbert Xu, Alexander Duyck, Alex Duyck, Jesse Gross, Eric Dumazet, Linux Kernel Network Developers On 06/04/16 14:53, Tom Herbert wrote: > But again, this scheme is optimizing for forwarding case and doesn't > help (and probably hurts) the use case of locally terminated > connections Not really. I think this has a chance to outperform GRO for locally terminated connections, for a number of reasons: * Doesn't look at higher-layer or inner headers until later in packet processing, for instance we (maybe) process every L3 header in a NAPI poll before looking at a single L4. This could delay touching the second cacheline of some packets. * Doesn't have to re-write headers to produce a coherent superframe. Instead, each segment carries its original headers around with it. Also means we can skip _checking_ some headers to see if we're 'allowed' to coalesce (e.g. TCP TS differences, and the current fun with IP IDs). * Can be used for protocols like UDP where the original packet boundaries are important (so you can't coalesce into a superframe). Really the last of those was the original reason for this idea, helping with forwarding is just another nice bonus that we (might) get from it. And it's all speculative and I don't know for sure what the performance would be like, but I won't know until I try it! > which I would claim is more important. No argument there :-) > Packets that are > forwarded really should not be GRO'ed in the first place because of > the loss of information and added latency. The difficultly is that we > don't currently make the forwarding decision before GRO, if we can > change this to decide whether packets are to be forwarded earlier then > we can avoid doing GRO for those. That certainly would be nice, XDP is exciting and I look forward to it. -Ed ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-06 14:26 ` Edward Cree @ 2016-04-06 15:39 ` Eric Dumazet 2016-04-06 15:55 ` Edward Cree 0 siblings, 1 reply; 25+ messages in thread From: Eric Dumazet @ 2016-04-06 15:39 UTC (permalink / raw) To: Edward Cree Cc: Tom Herbert, David Miller, Herbert Xu, Alexander Duyck, Alex Duyck, Jesse Gross, Eric Dumazet, Linux Kernel Network Developers On Wed, 2016-04-06 at 15:26 +0100, Edward Cree wrote: > On 06/04/16 14:53, Tom Herbert wrote: > > But again, this scheme is optimizing for forwarding case and doesn't > > help (and probably hurts) the use case of locally terminated > > connections > Not really. I think this has a chance to outperform GRO for locally > terminated connections, for a number of reasons: > * Doesn't look at higher-layer or inner headers until later in packet > processing, for instance we (maybe) process every L3 header in a NAPI poll > before looking at a single L4. This could delay touching the second > cacheline of some packets. > * Doesn't have to re-write headers to produce a coherent superframe. > Instead, each segment carries its original headers around with it. Also > means we can skip _checking_ some headers to see if we're 'allowed' to > coalesce (e.g. TCP TS differences, and the current fun with IP IDs). > * Can be used for protocols like UDP where the original packet boundaries > are important (so you can't coalesce into a superframe). > Really the last of those was the original reason for this idea, helping with > forwarding is just another nice bonus that we (might) get from it. > And it's all speculative and I don't know for sure what the performance > would be like, but I won't know until I try it! Look at the mess of some helpers in net/core/skbuff.c, and imagine the super mess it would be if using a concept of 'super packet with various headers on each segment'. netfilter is already complex, it would become a nightmare. GRO on the other hand presents one virtual set of headers, then the payload in one or multiple frags. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-06 15:39 ` Eric Dumazet @ 2016-04-06 15:55 ` Edward Cree 2016-04-06 16:03 ` Eric Dumazet 0 siblings, 1 reply; 25+ messages in thread From: Edward Cree @ 2016-04-06 15:55 UTC (permalink / raw) To: Eric Dumazet Cc: Tom Herbert, David Miller, Herbert Xu, Alexander Duyck, Alex Duyck, Jesse Gross, Eric Dumazet, Linux Kernel Network Developers On 06/04/16 16:39, Eric Dumazet wrote: > Look at the mess of some helpers in net/core/skbuff.c, and imagine the > super mess it would be if using a concept of 'super packet with various > headers on each segment'. Maybe I'm still not explaining this very well, but there is _no_ concept of 'super packet [anything]' in this idea. There is just 'list of skbs that were all received in the same NAPI poll, and have not yet been determined to be different'. Any layer that doesn't want to deal with this stuff will always have the option of "while (skb = skb_dequeue(list)) my_normal_receive_function(skb);" and in fact I'd make that happen by default for anything that hadn't registered a function to take a list. > netfilter is already complex, it would become a nightmare. A netfilter hook could, for instance, run on each packet in the list, then partition the list into sub-lists of packets that all had the same verdict (letting go of any that were DROP or STOLEN). That doesn't seem like it should be nightmarish. -Ed ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-06 15:55 ` Edward Cree @ 2016-04-06 16:03 ` Eric Dumazet 0 siblings, 0 replies; 25+ messages in thread From: Eric Dumazet @ 2016-04-06 16:03 UTC (permalink / raw) To: Edward Cree Cc: Tom Herbert, David Miller, Herbert Xu, Alexander Duyck, Alex Duyck, Jesse Gross, Eric Dumazet, Linux Kernel Network Developers On Wed, 2016-04-06 at 16:55 +0100, Edward Cree wrote: > On 06/04/16 16:39, Eric Dumazet wrote: > > Look at the mess of some helpers in net/core/skbuff.c, and imagine the > > super mess it would be if using a concept of 'super packet with various > > headers on each segment'. > Maybe I'm still not explaining this very well, but there is _no_ concept of > 'super packet [anything]' in this idea. There is just 'list of skbs that > were all received in the same NAPI poll, and have not yet been determined to > be different'. > > Any layer that doesn't want to deal with this stuff will always have the > option of "while (skb = skb_dequeue(list)) my_normal_receive_function(skb);" > and in fact I'd make that happen by default for anything that hadn't > registered a function to take a list. > > netfilter is already complex, it would become a nightmare. > A netfilter hook could, for instance, run on each packet in the list, then > partition the list into sub-lists of packets that all had the same verdict > (letting go of any that were DROP or STOLEN). That doesn't seem like it > should be nightmarish. Okay, I will let you try this alone ;) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-06 13:53 ` Tom Herbert 2016-04-06 14:26 ` Edward Cree @ 2016-04-06 15:43 ` David Miller 2016-04-06 17:42 ` Tom Herbert 1 sibling, 1 reply; 25+ messages in thread From: David Miller @ 2016-04-06 15:43 UTC (permalink / raw) To: tom; +Cc: ecree, herbert, alexander.duyck, aduyck, jesse, edumazet, netdev From: Tom Herbert <tom@herbertland.com> Date: Wed, 6 Apr 2016 10:53:52 -0300 > Packets that are forwarded really should not be GRO'ed in the first > place because of the loss of information and added latency. First of all GRO is supposed to be lossless, so please stop saying this would be a reason to turn it off on a router. Second of all, the biggest piece of overhead is the routing lookup, therefore GRO batching helps enormously with routing workloads, and therefore is appropriate to be enabled on routers. Yes, I agree that for locally terminated stuff it helps more, but don't turn this into a "GRO on routers, meh..." type argument. It simply is not true at all. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-06 15:43 ` David Miller @ 2016-04-06 17:42 ` Tom Herbert 2016-04-06 19:30 ` David Miller 0 siblings, 1 reply; 25+ messages in thread From: Tom Herbert @ 2016-04-06 17:42 UTC (permalink / raw) To: David Miller Cc: Edward Cree, Herbert Xu, Alexander Duyck, Alex Duyck, Jesse Gross, Eric Dumazet, Linux Kernel Network Developers On Wed, Apr 6, 2016 at 12:43 PM, David Miller <davem@davemloft.net> wrote: > From: Tom Herbert <tom@herbertland.com> > Date: Wed, 6 Apr 2016 10:53:52 -0300 > >> Packets that are forwarded really should not be GRO'ed in the first >> place because of the loss of information and added latency. > > First of all GRO is supposed to be lossless, so please stop saying this > would be a reason to turn it off on a router. > > Second of all, the biggest piece of overhead is the routing lookup, > therefore GRO batching helps enormously with routing workloads, and > therefore is appropriate to be enabled on routers. > > Yes, I agree that for locally terminated stuff it helps more, but don't > turn this into a "GRO on routers, meh..." type argument. It simply is > not true at all. > GRO on routers will help in a limited case where there is little load and the traffic is nicely groomed high tput TCP connections. But for routers with significant load, handling large quantities other protocols like UDP, GRO is not necessarily helpful and presents a nondeterministic performance improvement. For instance, I cannot provision a router with any assumptions that GRO will be effective for any % of packets to save any % of CPU, we need to provision based purely on ability to forward by pps assuming no benefit from GRO. Technically, this provisioning argument applies to end hosts also. We have seen real cases on video servers where servers were mis-provisioned with the assumption that GSO is always effective. So when were getting good GSO benefits we might be using something like 80% CPU, but if some connectivity event occurs forcing all the cwnds of the active connections drop, we find we need 110% of CPU to recover. This discussion is relevant because there is a big push now to replace dedicated HW with commodity HW running Linux, this is already happened significantly in load balancers but I expect this to extend to even some cases of basic switching. Besides, I seriously doubt you'll find any commercial router in the world that does GRO. Yes, GRO needs to work correctly in all cases, but for system performance in routing we a bound to per packet actions like route lookup. As I said, optimizing GRO as Edward is suggesting for forwarding case seems to have diminishing benefits. Tom ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-06 17:42 ` Tom Herbert @ 2016-04-06 19:30 ` David Miller 0 siblings, 0 replies; 25+ messages in thread From: David Miller @ 2016-04-06 19:30 UTC (permalink / raw) To: tom; +Cc: ecree, herbert, alexander.duyck, aduyck, jesse, edumazet, netdev From: Tom Herbert <tom@herbertland.com> Date: Wed, 6 Apr 2016 14:42:26 -0300 > On Wed, Apr 6, 2016 at 12:43 PM, David Miller <davem@davemloft.net> wrote: >> From: Tom Herbert <tom@herbertland.com> >> Date: Wed, 6 Apr 2016 10:53:52 -0300 >> >>> Packets that are forwarded really should not be GRO'ed in the first >>> place because of the loss of information and added latency. >> >> First of all GRO is supposed to be lossless, so please stop saying this >> would be a reason to turn it off on a router. >> >> Second of all, the biggest piece of overhead is the routing lookup, >> therefore GRO batching helps enormously with routing workloads, and >> therefore is appropriate to be enabled on routers. >> >> Yes, I agree that for locally terminated stuff it helps more, but don't >> turn this into a "GRO on routers, meh..." type argument. It simply is >> not true at all. >> > GRO on routers will help in a limited case where there is little load > and the traffic is nicely groomed high tput TCP connections. But for > routers with significant load, handling large quantities other > protocols like UDP, GRO is not necessarily helpful and presents a > nondeterministic performance improvement. For instance, I cannot > provision a router with any assumptions that GRO will be effective for > any % of packets to save any % of CPU, we need to provision based > purely on ability to forward by pps assuming no benefit from GRO. Just because you cannot predict how effective a facility will be, that doesn't mean you shouldn't use it at all. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 4:32 ` Herbert Xu 2016-04-05 15:07 ` Edward Cree @ 2016-04-05 15:52 ` Alexander Duyck 2016-04-05 16:30 ` Eric Dumazet 1 sibling, 1 reply; 25+ messages in thread From: Alexander Duyck @ 2016-04-05 15:52 UTC (permalink / raw) To: Herbert Xu Cc: Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev, David Miller On Mon, Apr 4, 2016 at 9:32 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: >> >> The problem is right now we are mangling the IP ID for outer headers >> on tunnels. We end up totally ignoring the delta between the values >> so if you have two flows that get interleaved over the same tunnel GRO >> will currently mash the IP IDs for the two tunnels so that they end up >> overlapping. > > Then it should be fixed. I never reviewed those patches or I would > have objected at the time. The problem is what you are proposing is much larger than I am comfortable proposing for a net patch so I will probably go back to targeting net-next. >> The reason why I keep referencing RFC 6864 is because it specifies >> that the IP ID field must not be read if the DF bit is set, and that >> if we are manipulating headers we can handle the IP ID as though we >> are the transmitting station. What this means is that if DF is not >> set we have to have unique values per packet, otherwise we can ignore >> the values if DF is set. > > As I said GRO itself should not be visible. The fact that it is > for tunnels is a bug. Right. But the fact that we are even trying to get reliable data out of IP ID is also a bug. The fact is the IP ID isn't going to be linear in the case of tunnels. There is a good liklihood that the IP ID will jump around if we are doing encapsulation using a third party device such as a switch. That was one of the reasons why my first implementation just completely ignored the IP ID in the case that the DF bit is set. Honestly I am leaning more toward taking the approach of going back to that implementation and adding a sysctl that would let you disable it. Maybe something like net.ipv4.gro.rfc6864 since that is the RFC that spells out that we should treat IP ID as a floating input/output if DF is set. Basically we need to be able to do that if the GSO partial code is going to work. >> The question I would have is what are you really losing with increment >> from 0 versus fixed 0? From what I see it is essentially just garbage >> in/garbage out. > > GRO is meant to be lossless, that is, you should not be able to > detect its presence from the outside. If you lose information then > you're breaking this rule and people will soon start asking for it > to be disabled in various situations. Right. But in this case we technically aren't losing information. That is the thing I have been trying to point out with RFC 6864. It calls out that you MUST not read IP ID if the DF bit is set. > I'm not against doing this per se but it should not be part of the > default configuration. I disagree I think it will have to be part of the default configuration. The problem is the IP ID is quickly becoming meaningless. When you consider that a 40Gb/s link can wrap the IP ID value nearly 50 times a second using a 1500 MTU the IP ID field should just be ignored anyway because you cannot guarantee that it will be unique without limiting the Tx window size. That was the whole point of RFC6864. Basically the IP ID field is so small that as we push into the higher speeds you cannot guarantee that the field will have any meaning so for any case where you don't need to use it you shouldn't because it will likely not provide enough useful data. - Alex ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 15:52 ` Alexander Duyck @ 2016-04-05 16:30 ` Eric Dumazet 2016-04-05 16:45 ` Alexander Duyck 0 siblings, 1 reply; 25+ messages in thread From: Eric Dumazet @ 2016-04-05 16:30 UTC (permalink / raw) To: Alexander Duyck Cc: Herbert Xu, Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev, David Miller On Tue, 2016-04-05 at 08:52 -0700, Alexander Duyck wrote: > > I disagree I think it will have to be part of the default > configuration. The problem is the IP ID is quickly becoming > meaningless. When you consider that a 40Gb/s link can wrap the IP ID > value nearly 50 times a second using a 1500 MTU the IP ID field should > just be ignored anyway because you cannot guarantee that it will be > unique without limiting the Tx window size. That was the whole point > of RFC6864. Basically the IP ID field is so small that as we push > into the higher speeds you cannot guarantee that the field will have > any meaning so for any case where you don't need to use it you > shouldn't because it will likely not provide enough useful data. Just because a few flows reach 40Gbit , we should remind that vast majority of the Internet runs with < 50Mbits flows. I prefer the argument of IPv6 not having ID ;) We should do our best to keep interoperability, this is the selling point. And quite frankly your last patch makes perfect sense to me : The aggregation is done only if the TCP headers of consecutive packets matches. So who cares of IPv4 ID really ? This is a very minor detail. The possible gains outperform the theoretical 'problem' GRO already reorder flows, it never had a guarantee of being 'ínvisible' as Herbert claims. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 2016-04-05 16:30 ` Eric Dumazet @ 2016-04-05 16:45 ` Alexander Duyck 0 siblings, 0 replies; 25+ messages in thread From: Alexander Duyck @ 2016-04-05 16:45 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev, David Miller On Tue, Apr 5, 2016 at 9:30 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2016-04-05 at 08:52 -0700, Alexander Duyck wrote: > >> >> I disagree I think it will have to be part of the default >> configuration. The problem is the IP ID is quickly becoming >> meaningless. When you consider that a 40Gb/s link can wrap the IP ID >> value nearly 50 times a second using a 1500 MTU the IP ID field should >> just be ignored anyway because you cannot guarantee that it will be >> unique without limiting the Tx window size. That was the whole point >> of RFC6864. Basically the IP ID field is so small that as we push >> into the higher speeds you cannot guarantee that the field will have >> any meaning so for any case where you don't need to use it you >> shouldn't because it will likely not provide enough useful data. > > Just because a few flows reach 40Gbit , we should remind that vast > majority of the Internet runs with < 50Mbits flows. > > I prefer the argument of IPv6 not having ID ;) Okay, maybe I'll try to use that argument more often then.. :-) > We should do our best to keep interoperability, this is the selling > point. > > And quite frankly your last patch makes perfect sense to me : Yes. It was a compromise, though I might still have to go through and refine it more. It might make sense for the IP header associated with the TCP flow, but for outer headers it actually is worse because we end up blocking several different possibilities. What I might need to do is capture the state of the DF bit as we work a flow up through the stack and once it is in the list of GRO SKBs use that DF bit as a flag to indicate if we support a incrementing or fixed pattern for the values. That way tunnels can optionally ignore the IP ID if the DF bit is set since their values may not be as clean as that of TCP. > The aggregation is done only if the TCP headers of consecutive packets > matches. So who cares of IPv4 ID really ? > This is a very minor detail. The possible gains outperform the > theoretical 'problem' > > GRO already reorder flows, it never had a guarantee of being 'ínvisible' > as Herbert claims. I can see what he is trying to get at. I just think it is a bit too strict on the interpretation of what values have to be maintained. My plan going forward is to add a sysctl that will probably allow us some wiggle room in regards to IP ID for GRO and GSO so that when it is disabled we will not perform GSO partial nor allow for repeating IP ID in GRO on devices that cannot get the IP ID right. - Alex ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2016-04-06 19:30 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-04-04 16:27 [net PATCH v2 0/2] Fixes for GRO and GRE tunnels Alexander Duyck 2016-04-04 16:28 ` [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU Alexander Duyck 2016-04-04 16:31 ` [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864 Alexander Duyck 2016-04-05 0:38 ` subashab 2016-04-05 3:44 ` Herbert Xu 2016-04-05 4:26 ` Alexander Duyck 2016-04-05 4:32 ` Herbert Xu 2016-04-05 15:07 ` Edward Cree 2016-04-05 15:36 ` Tom Herbert 2016-04-05 17:06 ` Edward Cree 2016-04-05 17:38 ` Tom Herbert 2016-04-06 0:04 ` Marcelo Ricardo Leitner 2016-04-05 23:45 ` David Miller 2016-04-06 11:21 ` Edward Cree 2016-04-06 13:53 ` Tom Herbert 2016-04-06 14:26 ` Edward Cree 2016-04-06 15:39 ` Eric Dumazet 2016-04-06 15:55 ` Edward Cree 2016-04-06 16:03 ` Eric Dumazet 2016-04-06 15:43 ` David Miller 2016-04-06 17:42 ` Tom Herbert 2016-04-06 19:30 ` David Miller 2016-04-05 15:52 ` Alexander Duyck 2016-04-05 16:30 ` Eric Dumazet 2016-04-05 16:45 ` Alexander Duyck
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.