[net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

All of lore.kernel.org
 help / color / mirror / Atom feed

* [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
@ 2016-02-19 19:26 Alexander Duyck
  2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck
                   ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Alexander Duyck @ 2016-02-19 19:26 UTC (permalink / raw)
  To: netdev, davem, alexander.duyck

This patch series makes it so that we enable the outer Tx checksum for IPv4
tunnels by default.  This makes the behavior consistent with how we were
handling this for IPv6.  In addition I have updated the internal flags for
these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
match up will with the ZERO_CSUM6_TX flag which was already in use for
IPv6.

For most network devices this should be a net gain in terms of performance
as having the outer header checksum present allows for devices to report
CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
to determine if the inner header checksum is valid.

Below is some data I collected with ixgbe with an X540 that demonstrates
this.  I located two PFs connected back to back in two different name
spaces and then setup a pair of tunnels on each, one with checksum enabled
and one without.

Recv   Send    Send                          Utilization
Socket Socket  Message  Elapsed              Send
Size   Size    Size     Time     Throughput  local
bytes  bytes   bytes    secs.    10^6bits/s  % S

noudpcsum:
 87380  16384  16384    30.00      8898.67   12.80
udpcsum:
 87380  16384  16384    30.00      9088.47   5.69

The one spot where this may cause a performance regression is if the
environment contains devices that can parse the inner headers and a device
supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
the case of such a device we have to fall back to using GSO to segment the
tunnel instead of TSO and as a result we may take a performance hit as seen
below with i40e.

Recv   Send    Send                          Utilization
Socket Socket  Message  Elapsed              Send
Size   Size    Size     Time     Throughput  local
bytes  bytes   bytes    secs.    10^6bits/s  % S

noudpcsum:
 87380  16384  16384    30.00      9085.21   3.32
udpcsum:
 87380  16384  16384    30.00      9089.23   5.54

In addition it will be necessary to update iproute2 so that we don't
provide the checksum attribute unless specified.  This way on older kernels
which don't have local checksum offload we will default to disabling the
outer checksum, and on newer kernels that have LCO we can default to
enabling it.

I also haven't investigated the effect this will have on OVS.  However I
suspect the impact should be minimal as the worst case scenario should be
that Tx checksumming will become enabled by default which should be
consistent with the existing behavior for IPv6.

---

Alexander Duyck (2):
      GENEVE: Support outer IPv4 Tx checksums by default
      VXLAN: Support outer IPv4 Tx checksums by default

 drivers/net/geneve.c |   16 ++++++++--------
 drivers/net/vxlan.c  |   19 +++++++++----------
 include/net/vxlan.h  |    2 +-
 3 files changed, 18 insertions(+), 19 deletions(-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums by default
  2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck
@ 2016-02-19 19:26 ` Alexander Duyck
  2016-02-19 20:28   ` Tom Herbert
  2016-02-19 19:26 ` [net-next PATCH 2/2] VXLAN: " Alexander Duyck
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2016-02-19 19:26 UTC (permalink / raw)
  To: netdev, davem, alexander.duyck

This change makes it so that if UDP CSUM is not specified we will default
to enabling it.  The main motivation behind this is the fact that with the
use of outer checksum we can greatly improve the performance for GENEVE
tunnels on hardware that doesn't know how to parse them.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/geneve.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 1b5deaf7e3c8..2acaf2b209cd 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -76,7 +76,7 @@ struct geneve_dev {
 };
 
 /* Geneve device flags */
-#define GENEVE_F_UDP_CSUM		BIT(0)
+#define GENEVE_F_UDP_ZERO_CSUM_TX	BIT(0)
 #define GENEVE_F_UDP_ZERO_CSUM6_TX	BIT(1)
 #define GENEVE_F_UDP_ZERO_CSUM6_RX	BIT(2)
 
@@ -703,7 +703,7 @@ static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb,
 	struct genevehdr *gnvh;
 	int min_headroom;
 	int err;
-	bool udp_sum = !!(flags & GENEVE_F_UDP_CSUM);
+	bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM_TX);
 
 	skb_scrub_packet(skb, xnet);
 
@@ -944,9 +944,9 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 			opts = ip_tunnel_info_opts(info);
 
 		if (key->tun_flags & TUNNEL_CSUM)
-			flags |= GENEVE_F_UDP_CSUM;
+			flags &= ~GENEVE_F_UDP_ZERO_CSUM_TX;
 		else
-			flags &= ~GENEVE_F_UDP_CSUM;
+			flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
 
 		err = geneve_build_skb(rt, skb, key->tun_flags, vni,
 				       info->options_len, opts, flags, xnet);
@@ -972,7 +972,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, fl4.saddr, fl4.daddr,
 			    tos, ttl, df, sport, geneve->dst_port,
 			    !net_eq(geneve->net, dev_net(geneve->dev)),
-			    !(flags & GENEVE_F_UDP_CSUM));
+			    !!(flags & GENEVE_F_UDP_ZERO_CSUM_TX));
 
 	return NETDEV_TX_OK;
 
@@ -1412,8 +1412,8 @@ static int geneve_newlink(struct net *net, struct net_device *dev,
 		metadata = true;
 
 	if (data[IFLA_GENEVE_UDP_CSUM] &&
-	    nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
-		flags |= GENEVE_F_UDP_CSUM;
+	    !nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
+		flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
 
 	if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] &&
 	    nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]))
@@ -1483,7 +1483,7 @@ static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	}
 
 	if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
-		       !!(geneve->flags & GENEVE_F_UDP_CSUM)) ||
+		       !(geneve->flags & GENEVE_F_UDP_ZERO_CSUM_TX)) ||
 	    nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
 		       !!(geneve->flags & GENEVE_F_UDP_ZERO_CSUM6_TX)) ||
 	    nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [net-next PATCH 2/2] VXLAN: Support outer IPv4 Tx checksums by default
  2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck
  2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck
@ 2016-02-19 19:26 ` Alexander Duyck
  2016-02-19 20:27   ` Tom Herbert
  2016-02-19 21:53 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum " Jesse Gross
  2016-02-22  3:06 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default David Miller
  3 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2016-02-19 19:26 UTC (permalink / raw)
  To: netdev, davem, alexander.duyck

This change makes it so that if UDP CSUM is not specified we will default
to enabling it.  The main motivation behind this is the fact that with the
use of outer checksum we can greatly improve the performance for VXLAN
tunnels on devices that don't know how to parse tunnel headers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/vxlan.c |   19 +++++++++----------
 include/net/vxlan.h |    2 +-
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 766e6114a37f..909f7931c297 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1957,13 +1957,6 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 			goto drop;
 		sk = vxlan->vn4_sock->sock->sk;
 
-		if (info) {
-			if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
-				df = htons(IP_DF);
-		} else {
-			udp_sum = !!(flags & VXLAN_F_UDP_CSUM);
-		}
-
 		rt = vxlan_get_route(vxlan, skb,
 				     rdst ? rdst->remote_ifindex : 0, tos,
 				     dst->sin.sin_addr.s_addr, &saddr,
@@ -1997,6 +1990,11 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 			return;
 		}
 
+		if (!info)
+			udp_sum = !(flags & VXLAN_F_UDP_ZERO_CSUM_TX);
+		else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
+			df = htons(IP_DF);
+
 		tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
 		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
 		err = vxlan_build_skb(skb, &rt->dst, sizeof(struct iphdr),
@@ -2920,8 +2918,9 @@ static int vxlan_newlink(struct net *src_net, struct net_device *dev,
 	if (data[IFLA_VXLAN_PORT])
 		conf.dst_port = nla_get_be16(data[IFLA_VXLAN_PORT]);
 
-	if (data[IFLA_VXLAN_UDP_CSUM] && nla_get_u8(data[IFLA_VXLAN_UDP_CSUM]))
-		conf.flags |= VXLAN_F_UDP_CSUM;
+	if (data[IFLA_VXLAN_UDP_CSUM] &&
+	    !nla_get_u8(data[IFLA_VXLAN_UDP_CSUM]))
+		conf.flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
 
 	if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] &&
 	    nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
@@ -3065,7 +3064,7 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	    nla_put_u32(skb, IFLA_VXLAN_LIMIT, vxlan->cfg.addrmax) ||
 	    nla_put_be16(skb, IFLA_VXLAN_PORT, vxlan->cfg.dst_port) ||
 	    nla_put_u8(skb, IFLA_VXLAN_UDP_CSUM,
-			!!(vxlan->flags & VXLAN_F_UDP_CSUM)) ||
+			!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM_TX)) ||
 	    nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
 			!!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM6_TX)) ||
 	    nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 748083de367a..6eda4ed4d78b 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -197,7 +197,7 @@ struct vxlan_dev {
 #define VXLAN_F_L2MISS			0x08
 #define VXLAN_F_L3MISS			0x10
 #define VXLAN_F_IPV6			0x20
-#define VXLAN_F_UDP_CSUM		0x40
+#define VXLAN_F_UDP_ZERO_CSUM_TX	0x40
 #define VXLAN_F_UDP_ZERO_CSUM6_TX	0x80
 #define VXLAN_F_UDP_ZERO_CSUM6_RX	0x100
 #define VXLAN_F_REMCSUM_TX		0x200

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 2/2] VXLAN: Support outer IPv4 Tx checksums by default
  2016-02-19 19:26 ` [net-next PATCH 2/2] VXLAN: " Alexander Duyck
@ 2016-02-19 20:27   ` Tom Herbert
  2016-02-19 21:36     ` Jesse Gross
  0 siblings, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-02-19 20:27 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck

On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote:
> This change makes it so that if UDP CSUM is not specified we will default
> to enabling it.  The main motivation behind this is the fact that with the
> use of outer checksum we can greatly improve the performance for VXLAN
> tunnels on devices that don't know how to parse tunnel headers.
>
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
>  drivers/net/vxlan.c |   19 +++++++++----------
>  include/net/vxlan.h |    2 +-
>  2 files changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 766e6114a37f..909f7931c297 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -1957,13 +1957,6 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>                         goto drop;
>                 sk = vxlan->vn4_sock->sock->sk;
>
> -               if (info) {
> -                       if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
> -                               df = htons(IP_DF);
> -               } else {
> -                       udp_sum = !!(flags & VXLAN_F_UDP_CSUM);
> -               }
> -
>                 rt = vxlan_get_route(vxlan, skb,
>                                      rdst ? rdst->remote_ifindex : 0, tos,
>                                      dst->sin.sin_addr.s_addr, &saddr,
> @@ -1997,6 +1990,11 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>                         return;
>                 }
>
> +               if (!info)
> +                       udp_sum = !(flags & VXLAN_F_UDP_ZERO_CSUM_TX);
> +               else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
> +                       df = htons(IP_DF);
> +
>                 tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
>                 ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
>                 err = vxlan_build_skb(skb, &rt->dst, sizeof(struct iphdr),
> @@ -2920,8 +2918,9 @@ static int vxlan_newlink(struct net *src_net, struct net_device *dev,
>         if (data[IFLA_VXLAN_PORT])
>                 conf.dst_port = nla_get_be16(data[IFLA_VXLAN_PORT]);
>
> -       if (data[IFLA_VXLAN_UDP_CSUM] && nla_get_u8(data[IFLA_VXLAN_UDP_CSUM]))
> -               conf.flags |= VXLAN_F_UDP_CSUM;
> +       if (data[IFLA_VXLAN_UDP_CSUM] &&
> +           !nla_get_u8(data[IFLA_VXLAN_UDP_CSUM]))
> +               conf.flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
>
>         if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] &&
>             nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
> @@ -3065,7 +3064,7 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
>             nla_put_u32(skb, IFLA_VXLAN_LIMIT, vxlan->cfg.addrmax) ||
>             nla_put_be16(skb, IFLA_VXLAN_PORT, vxlan->cfg.dst_port) ||
>             nla_put_u8(skb, IFLA_VXLAN_UDP_CSUM,
> -                       !!(vxlan->flags & VXLAN_F_UDP_CSUM)) ||
> +                       !(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM_TX)) ||
>             nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
>                         !!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM6_TX)) ||
>             nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
> diff --git a/include/net/vxlan.h b/include/net/vxlan.h
> index 748083de367a..6eda4ed4d78b 100644
> --- a/include/net/vxlan.h
> +++ b/include/net/vxlan.h
> @@ -197,7 +197,7 @@ struct vxlan_dev {
>  #define VXLAN_F_L2MISS                 0x08
>  #define VXLAN_F_L3MISS                 0x10
>  #define VXLAN_F_IPV6                   0x20
> -#define VXLAN_F_UDP_CSUM               0x40
> +#define VXLAN_F_UDP_ZERO_CSUM_TX       0x40
>  #define VXLAN_F_UDP_ZERO_CSUM6_TX      0x80
>  #define VXLAN_F_UDP_ZERO_CSUM6_RX      0x100
>  #define VXLAN_F_REMCSUM_TX             0x200
>

Acked-by: Tom Herbert <tom@herbertland.com>

I would also note RFC7348 specifies:

UDP Checksum: It SHOULD be transmitted as zero. ...

The RFC doesn't provide any rationale as to why this is a SHOULD
(neither is there any discussion as to whether this pertains to IPv6
which has stronger requirements for non-zero UDP checksum). I think
there are two possibilities in the intent: 1) The authors assume that
computing UDP checksums is a significant performance hit which is
dis-proven by this patch 2) They are worried about devices that are
unable to compute receive checksums, however this would be addressed
by an allowance that devices can ignore non-zero UDP checksums for
VXLAN ("When a decapsulating end point receives a packet with a
non-zero checksum, it MAY choose to verify the checksum value.")


.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums by default
  2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck
@ 2016-02-19 20:28   ` Tom Herbert
  0 siblings, 0 replies; 47+ messages in thread
From: Tom Herbert @ 2016-02-19 20:28 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck

On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote:
> This change makes it so that if UDP CSUM is not specified we will default
> to enabling it.  The main motivation behind this is the fact that with the
> use of outer checksum we can greatly improve the performance for GENEVE
> tunnels on hardware that doesn't know how to parse them.
>
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>

Acked-by: Tom Herbert <tom@herbertland.com>

> ---
>  drivers/net/geneve.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 1b5deaf7e3c8..2acaf2b209cd 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -76,7 +76,7 @@ struct geneve_dev {
>  };
>
>  /* Geneve device flags */
> -#define GENEVE_F_UDP_CSUM              BIT(0)
> +#define GENEVE_F_UDP_ZERO_CSUM_TX      BIT(0)
>  #define GENEVE_F_UDP_ZERO_CSUM6_TX     BIT(1)
>  #define GENEVE_F_UDP_ZERO_CSUM6_RX     BIT(2)
>
> @@ -703,7 +703,7 @@ static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb,
>         struct genevehdr *gnvh;
>         int min_headroom;
>         int err;
> -       bool udp_sum = !!(flags & GENEVE_F_UDP_CSUM);
> +       bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM_TX);
>
>         skb_scrub_packet(skb, xnet);
>
> @@ -944,9 +944,9 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
>                         opts = ip_tunnel_info_opts(info);
>
>                 if (key->tun_flags & TUNNEL_CSUM)
> -                       flags |= GENEVE_F_UDP_CSUM;
> +                       flags &= ~GENEVE_F_UDP_ZERO_CSUM_TX;
>                 else
> -                       flags &= ~GENEVE_F_UDP_CSUM;
> +                       flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
>
>                 err = geneve_build_skb(rt, skb, key->tun_flags, vni,
>                                        info->options_len, opts, flags, xnet);
> @@ -972,7 +972,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
>         udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, fl4.saddr, fl4.daddr,
>                             tos, ttl, df, sport, geneve->dst_port,
>                             !net_eq(geneve->net, dev_net(geneve->dev)),
> -                           !(flags & GENEVE_F_UDP_CSUM));
> +                           !!(flags & GENEVE_F_UDP_ZERO_CSUM_TX));
>
>         return NETDEV_TX_OK;
>
> @@ -1412,8 +1412,8 @@ static int geneve_newlink(struct net *net, struct net_device *dev,
>                 metadata = true;
>
>         if (data[IFLA_GENEVE_UDP_CSUM] &&
> -           nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
> -               flags |= GENEVE_F_UDP_CSUM;
> +           !nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
> +               flags |= GENEVE_F_UDP_ZERO_CSUM_TX;
>
>         if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] &&
>             nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]))
> @@ -1483,7 +1483,7 @@ static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev)
>         }
>
>         if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
> -                      !!(geneve->flags & GENEVE_F_UDP_CSUM)) ||
> +                      !(geneve->flags & GENEVE_F_UDP_ZERO_CSUM_TX)) ||
>             nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
>                        !!(geneve->flags & GENEVE_F_UDP_ZERO_CSUM6_TX)) ||
>             nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 2/2] VXLAN: Support outer IPv4 Tx checksums by default
  2016-02-19 20:27   ` Tom Herbert
@ 2016-02-19 21:36     ` Jesse Gross
  0 siblings, 0 replies; 47+ messages in thread
From: Jesse Gross @ 2016-02-19 21:36 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexander Duyck, Linux Kernel Network Developers,
	David S. Miller, Alexander Duyck

On Fri, Feb 19, 2016 at 12:27 PM, Tom Herbert <tom@herbertland.com> wrote:
> I would also note RFC7348 specifies:
>
> UDP Checksum: It SHOULD be transmitted as zero. ...
>
> The RFC doesn't provide any rationale as to why this is a SHOULD
> (neither is there any discussion as to whether this pertains to IPv6
> which has stronger requirements for non-zero UDP checksum). I think
> there are two possibilities in the intent: 1) The authors assume that
> computing UDP checksums is a significant performance hit which is
> dis-proven by this patch 2) They are worried about devices that are
> unable to compute receive checksums, however this would be addressed
> by an allowance that devices can ignore non-zero UDP checksums for
> VXLAN ("When a decapsulating end point receives a packet with a
> non-zero checksum, it MAY choose to verify the checksum value.")

It's #2.

All of the performance concerns around checksums and tunneling stem
from devices implemented using switching ASICs. In those devices,
computing/verifying checksums is so slow (software path) that they are
effectively unable to do it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck
  2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck
  2016-02-19 19:26 ` [net-next PATCH 2/2] VXLAN: " Alexander Duyck
@ 2016-02-19 21:53 ` Jesse Gross
  2016-02-19 23:10   ` Alex Duyck
  2016-02-22  3:06 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default David Miller
  3 siblings, 1 reply; 47+ messages in thread
From: Jesse Gross @ 2016-02-19 21:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Linux Kernel Network Developers, David Miller, Alexander Duyck

On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote:
> This patch series makes it so that we enable the outer Tx checksum for IPv4
> tunnels by default.  This makes the behavior consistent with how we were
> handling this for IPv6.  In addition I have updated the internal flags for
> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
> match up will with the ZERO_CSUM6_TX flag which was already in use for
> IPv6.
>
> For most network devices this should be a net gain in terms of performance
> as having the outer header checksum present allows for devices to report
> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
> to determine if the inner header checksum is valid.
>
> Below is some data I collected with ixgbe with an X540 that demonstrates
> this.  I located two PFs connected back to back in two different name
> spaces and then setup a pair of tunnels on each, one with checksum enabled
> and one without.
>
> Recv   Send    Send                          Utilization
> Socket Socket  Message  Elapsed              Send
> Size   Size    Size     Time     Throughput  local
> bytes  bytes   bytes    secs.    10^6bits/s  % S
>
> noudpcsum:
>  87380  16384  16384    30.00      8898.67   12.80
> udpcsum:
>  87380  16384  16384    30.00      9088.47   5.69
>
> The one spot where this may cause a performance regression is if the
> environment contains devices that can parse the inner headers and a device
> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
> the case of such a device we have to fall back to using GSO to segment the
> tunnel instead of TSO and as a result we may take a performance hit as seen
> below with i40e.

Do you have any numbers from 40G links? Obviously, at 10G the links
are basically saturated and while I can see a difference in the
utilization rate, I suspect that the change will be much more apparent
at higher speeds.

I'm concerned about the drop in performance for devices that currently
support offloads (almost none of which expose
NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
care most about tunnel performance are the ones that already have
these NICs and will be the most impacted by the drop.

My hope is that we can continue to use TSO on devices that only
support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
length field may vary across segments. However, in practice this is
the only on the final segment and only in cases where the total length
is not a multiple of the MSS. If we could detect cases where those
conditions are met, we could continue to use TSO with the UDP checksum
field pre-populated. A possible step even further would be to break
off the final segment into a separate packet to make things conform if
necessary. This would avoid a performance regression and I think make
this more palatable to a lot of people.

> I also haven't investigated the effect this will have on OVS.  However I
> suspect the impact should be minimal as the worst case scenario should be
> that Tx checksumming will become enabled by default which should be
> consistent with the existing behavior for IPv6.

I don't think that it should cause any problems.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-19 21:53 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum " Jesse Gross
@ 2016-02-19 23:10   ` Alex Duyck
  2016-02-20  0:08     ` Jesse Gross
  0 siblings, 1 reply; 47+ messages in thread
From: Alex Duyck @ 2016-02-19 23:10 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Linux Kernel Network Developers, David Miller, Alexander Duyck

On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>> tunnels by default.  This makes the behavior consistent with how we were
>> handling this for IPv6.  In addition I have updated the internal flags for
>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>> IPv6.
>>
>> For most network devices this should be a net gain in terms of performance
>> as having the outer header checksum present allows for devices to report
>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>> to determine if the inner header checksum is valid.
>>
>> Below is some data I collected with ixgbe with an X540 that demonstrates
>> this.  I located two PFs connected back to back in two different name
>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>> and one without.
>>
>> Recv   Send    Send                          Utilization
>> Socket Socket  Message  Elapsed              Send
>> Size   Size    Size     Time     Throughput  local
>> bytes  bytes   bytes    secs.    10^6bits/s  % S
>>
>> noudpcsum:
>>  87380  16384  16384    30.00      8898.67   12.80
>> udpcsum:
>>  87380  16384  16384    30.00      9088.47   5.69
>>
>> The one spot where this may cause a performance regression is if the
>> environment contains devices that can parse the inner headers and a device
>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>> the case of such a device we have to fall back to using GSO to segment the
>> tunnel instead of TSO and as a result we may take a performance hit as seen
>> below with i40e.
>
> Do you have any numbers from 40G links? Obviously, at 10G the links
> are basically saturated and while I can see a difference in the
> utilization rate, I suspect that the change will be much more apparent
> at higher speeds.

Unfortunately I don't have any true 40G links to test with.  The
closest I can get is to run PF to VF on an i40e.  Running that I have
seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
difference being related to the fact that we are having to
allocate/free more skbs and make more trips through the
i40e_lan_xmit_frame function resulting in more descriptors.

> I'm concerned about the drop in performance for devices that currently
> support offloads (almost none of which expose
> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
> care most about tunnel performance are the ones that already have
> these NICs and will be the most impacted by the drop.

The problem is being able to transmit fast is kind of pointless if the
receiving end cannot handle it.  We hadn't gotten around to really
getting the Rx checksum bits working until the 3.18 kernel which I
don't suspect many people are running so at this point messing with
the TSO bits isn't really making much of a difference.  Then on top of
that most devices have certain limitations on how many ports they can
handle and such.  I know the i40e is supposed to support something
like 10 port numbers, but the fm10k and ixgbe are limited to one port
as I recall.  So this whole thing is already really brittle as it is.
My goal with this change is to make the behavior more consistent
across the board.

> My hope is that we can continue to use TSO on devices that only
> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
> length field may vary across segments. However, in practice this is
> the only on the final segment and only in cases where the total length
> is not a multiple of the MSS. If we could detect cases where those
> conditions are met, we could continue to use TSO with the UDP checksum
> field pre-populated. A possible step even further would be to break
> off the final segment into a separate packet to make things conform if
> necessary. This would avoid a performance regression and I think make
> this more palatable to a lot of people.

I think Tom and I had discussed this possibility a bit at netconf.
The GSO logic is something I planned on looking at over the next
several weeks as I suspect there is probably room for improvement
there.

>> I also haven't investigated the effect this will have on OVS.  However I
>> suspect the impact should be minimal as the worst case scenario should be
>> that Tx checksumming will become enabled by default which should be
>> consistent with the existing behavior for IPv6.
>
> I don't think that it should cause any problems.

Good to hear.

Do you know if OVS has some way to control the VXLAN configuration so
that it could disable Tx checksums?  If so that would probably be a
good way to address the 40G issues assuming someone is running an
environment hat had nothing but NICs that can support the TSO and Rx
checksum on inner headers.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-19 23:10   ` Alex Duyck
@ 2016-02-20  0:08     ` Jesse Gross
  2016-02-20  0:14       ` Tom Herbert
  0 siblings, 1 reply; 47+ messages in thread
From: Jesse Gross @ 2016-02-20  0:08 UTC (permalink / raw)
  To: Alex Duyck; +Cc: Linux Kernel Network Developers, David Miller, Alexander Duyck

On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote:
> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote:
>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote:
>>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>>> tunnels by default.  This makes the behavior consistent with how we were
>>> handling this for IPv6.  In addition I have updated the internal flags for
>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>> IPv6.
>>>
>>> For most network devices this should be a net gain in terms of performance
>>> as having the outer header checksum present allows for devices to report
>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>>> to determine if the inner header checksum is valid.
>>>
>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>> this.  I located two PFs connected back to back in two different name
>>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>>> and one without.
>>>
>>> Recv   Send    Send                          Utilization
>>> Socket Socket  Message  Elapsed              Send
>>> Size   Size    Size     Time     Throughput  local
>>> bytes  bytes   bytes    secs.    10^6bits/s  % S
>>>
>>> noudpcsum:
>>>  87380  16384  16384    30.00      8898.67   12.80
>>> udpcsum:
>>>  87380  16384  16384    30.00      9088.47   5.69
>>>
>>> The one spot where this may cause a performance regression is if the
>>> environment contains devices that can parse the inner headers and a device
>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>> the case of such a device we have to fall back to using GSO to segment the
>>> tunnel instead of TSO and as a result we may take a performance hit as seen
>>> below with i40e.
>>
>> Do you have any numbers from 40G links? Obviously, at 10G the links
>> are basically saturated and while I can see a difference in the
>> utilization rate, I suspect that the change will be much more apparent
>> at higher speeds.
>
> Unfortunately I don't have any true 40G links to test with.  The
> closest I can get is to run PF to VF on an i40e.  Running that I have
> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
> difference being related to the fact that we are having to
> allocate/free more skbs and make more trips through the
> i40e_lan_xmit_frame function resulting in more descriptors.

OK, I guess that is more or less in line with what I would expect off
the top my head. There is a reasonably significant drop in the worst
case.

>> I'm concerned about the drop in performance for devices that currently
>> support offloads (almost none of which expose
>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>> care most about tunnel performance are the ones that already have
>> these NICs and will be the most impacted by the drop.
>
> The problem is being able to transmit fast is kind of pointless if the
> receiving end cannot handle it.  We hadn't gotten around to really
> getting the Rx checksum bits working until the 3.18 kernel which I
> don't suspect many people are running so at this point messing with
> the TSO bits isn't really making much of a difference.  Then on top of
> that most devices have certain limitations on how many ports they can
> handle and such.  I know the i40e is supposed to support something
> like 10 port numbers, but the fm10k and ixgbe are limited to one port
> as I recall.  So this whole thing is already really brittle as it is.
> My goal with this change is to make the behavior more consistent
> across the board.

That's true to some degree but there are certainly plenty of cases
where TSO makes a difference - lower CPU usage, transmitting to
multiple receivers, people will upgrade their kernels, etc. It's
clearly good to make things more consistent but hopefully not by
reducing existing performance. :)

>> My hope is that we can continue to use TSO on devices that only
>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>> length field may vary across segments. However, in practice this is
>> the only on the final segment and only in cases where the total length
>> is not a multiple of the MSS. If we could detect cases where those
>> conditions are met, we could continue to use TSO with the UDP checksum
>> field pre-populated. A possible step even further would be to break
>> off the final segment into a separate packet to make things conform if
>> necessary. This would avoid a performance regression and I think make
>> this more palatable to a lot of people.
>
> I think Tom and I had discussed this possibility a bit at netconf.
> The GSO logic is something I planned on looking at over the next
> several weeks as I suspect there is probably room for improvement
> there.

That sounds great.

>>> I also haven't investigated the effect this will have on OVS.  However I
>>> suspect the impact should be minimal as the worst case scenario should be
>>> that Tx checksumming will become enabled by default which should be
>>> consistent with the existing behavior for IPv6.
>>
>> I don't think that it should cause any problems.
>
> Good to hear.
>
> Do you know if OVS has some way to control the VXLAN configuration so
> that it could disable Tx checksums?  If so that would probably be a
> good way to address the 40G issues assuming someone is running an
> environment hat had nothing but NICs that can support the TSO and Rx
> checksum on inner headers.

Yes - OVS can control tx checksums on a per-endpoint basis (actually,
rx checksum present requirements as well though it's not exposed to
the user at the moment). If you had the information then you could
optimize what to use in an environment of, say, hypervisors and
hardware switches.

However, it's certainly possible that you have a mixed set of NICs
such as an encap aware NIC on the transmit side and non-aware on the
receive side. In that case, both possible checksum settings penalize
somebody: off (lose GRO on receiver), on (lose TSO on sender assuming
no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's
important to be able to use encap TSO with local checksum to avoid
these bad tradeoffs, not to mention being cleaner.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-20  0:08     ` Jesse Gross
@ 2016-02-20  0:14       ` Tom Herbert
  2016-02-20  2:18         ` Jesse Gross
  0 siblings, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-02-20  0:14 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alex Duyck, Linux Kernel Network Developers, David Miller,
	Alexander Duyck

On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote:
>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote:
>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote:
>>>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>>>> tunnels by default.  This makes the behavior consistent with how we were
>>>> handling this for IPv6.  In addition I have updated the internal flags for
>>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>>> IPv6.
>>>>
>>>> For most network devices this should be a net gain in terms of performance
>>>> as having the outer header checksum present allows for devices to report
>>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>>>> to determine if the inner header checksum is valid.
>>>>
>>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>>> this.  I located two PFs connected back to back in two different name
>>>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>>>> and one without.
>>>>
>>>> Recv   Send    Send                          Utilization
>>>> Socket Socket  Message  Elapsed              Send
>>>> Size   Size    Size     Time     Throughput  local
>>>> bytes  bytes   bytes    secs.    10^6bits/s  % S
>>>>
>>>> noudpcsum:
>>>>  87380  16384  16384    30.00      8898.67   12.80
>>>> udpcsum:
>>>>  87380  16384  16384    30.00      9088.47   5.69
>>>>
>>>> The one spot where this may cause a performance regression is if the
>>>> environment contains devices that can parse the inner headers and a device
>>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>>> the case of such a device we have to fall back to using GSO to segment the
>>>> tunnel instead of TSO and as a result we may take a performance hit as seen
>>>> below with i40e.
>>>
>>> Do you have any numbers from 40G links? Obviously, at 10G the links
>>> are basically saturated and while I can see a difference in the
>>> utilization rate, I suspect that the change will be much more apparent
>>> at higher speeds.
>>
>> Unfortunately I don't have any true 40G links to test with.  The
>> closest I can get is to run PF to VF on an i40e.  Running that I have
>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
>> difference being related to the fact that we are having to
>> allocate/free more skbs and make more trips through the
>> i40e_lan_xmit_frame function resulting in more descriptors.
>
> OK, I guess that is more or less in line with what I would expect off
> the top my head. There is a reasonably significant drop in the worst
> case.
>
>>> I'm concerned about the drop in performance for devices that currently
>>> support offloads (almost none of which expose
>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>>> care most about tunnel performance are the ones that already have
>>> these NICs and will be the most impacted by the drop.
>>
>> The problem is being able to transmit fast is kind of pointless if the
>> receiving end cannot handle it.  We hadn't gotten around to really
>> getting the Rx checksum bits working until the 3.18 kernel which I
>> don't suspect many people are running so at this point messing with
>> the TSO bits isn't really making much of a difference.  Then on top of
>> that most devices have certain limitations on how many ports they can
>> handle and such.  I know the i40e is supposed to support something
>> like 10 port numbers, but the fm10k and ixgbe are limited to one port
>> as I recall.  So this whole thing is already really brittle as it is.
>> My goal with this change is to make the behavior more consistent
>> across the board.
>
> That's true to some degree but there are certainly plenty of cases
> where TSO makes a difference - lower CPU usage, transmitting to
> multiple receivers, people will upgrade their kernels, etc. It's
> clearly good to make things more consistent but hopefully not by
> reducing existing performance. :)
>
>>> My hope is that we can continue to use TSO on devices that only
>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>>> length field may vary across segments. However, in practice this is
>>> the only on the final segment and only in cases where the total length
>>> is not a multiple of the MSS. If we could detect cases where those
>>> conditions are met, we could continue to use TSO with the UDP checksum
>>> field pre-populated. A possible step even further would be to break
>>> off the final segment into a separate packet to make things conform if
>>> necessary. This would avoid a performance regression and I think make
>>> this more palatable to a lot of people.
>>
>> I think Tom and I had discussed this possibility a bit at netconf.
>> The GSO logic is something I planned on looking at over the next
>> several weeks as I suspect there is probably room for improvement
>> there.
>
> That sounds great.
>
>>>> I also haven't investigated the effect this will have on OVS.  However I
>>>> suspect the impact should be minimal as the worst case scenario should be
>>>> that Tx checksumming will become enabled by default which should be
>>>> consistent with the existing behavior for IPv6.
>>>
>>> I don't think that it should cause any problems.
>>
>> Good to hear.
>>
>> Do you know if OVS has some way to control the VXLAN configuration so
>> that it could disable Tx checksums?  If so that would probably be a
>> good way to address the 40G issues assuming someone is running an
>> environment hat had nothing but NICs that can support the TSO and Rx
>> checksum on inner headers.
>
> Yes - OVS can control tx checksums on a per-endpoint basis (actually,
> rx checksum present requirements as well though it's not exposed to
> the user at the moment). If you had the information then you could
> optimize what to use in an environment of, say, hypervisors and
> hardware switches.
>
> However, it's certainly possible that you have a mixed set of NICs
> such as an encap aware NIC on the transmit side and non-aware on the
> receive side. In that case, both possible checksum settings penalize
> somebody: off (lose GRO on receiver), on (lose TSO on sender assuming
> no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's
> important to be able to use encap TSO with local checksum to avoid
> these bad tradeoffs, not to mention being cleaner.

By "local checksum" do you mean LCO? Seems like we should be able to
get that to work with NETIF_F_GSO_TUNNEL_CSUM.

Tom

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-20  0:14       ` Tom Herbert
@ 2016-02-20  2:18         ` Jesse Gross
  2016-02-20 19:51           ` Tom Herbert
  0 siblings, 1 reply; 47+ messages in thread
From: Jesse Gross @ 2016-02-20  2:18 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alex Duyck, Linux Kernel Network Developers, David Miller,
	Alexander Duyck

On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <jesse@kernel.org> wrote:
>> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote:
>>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote:
>>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote:
>>>>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>>>>> tunnels by default.  This makes the behavior consistent with how we were
>>>>> handling this for IPv6.  In addition I have updated the internal flags for
>>>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>>>> IPv6.
>>>>>
>>>>> For most network devices this should be a net gain in terms of performance
>>>>> as having the outer header checksum present allows for devices to report
>>>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>>>>> to determine if the inner header checksum is valid.
>>>>>
>>>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>>>> this.  I located two PFs connected back to back in two different name
>>>>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>>>>> and one without.
>>>>>
>>>>> Recv   Send    Send                          Utilization
>>>>> Socket Socket  Message  Elapsed              Send
>>>>> Size   Size    Size     Time     Throughput  local
>>>>> bytes  bytes   bytes    secs.    10^6bits/s  % S
>>>>>
>>>>> noudpcsum:
>>>>>  87380  16384  16384    30.00      8898.67   12.80
>>>>> udpcsum:
>>>>>  87380  16384  16384    30.00      9088.47   5.69
>>>>>
>>>>> The one spot where this may cause a performance regression is if the
>>>>> environment contains devices that can parse the inner headers and a device
>>>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>>>> the case of such a device we have to fall back to using GSO to segment the
>>>>> tunnel instead of TSO and as a result we may take a performance hit as seen
>>>>> below with i40e.
>>>>
>>>> Do you have any numbers from 40G links? Obviously, at 10G the links
>>>> are basically saturated and while I can see a difference in the
>>>> utilization rate, I suspect that the change will be much more apparent
>>>> at higher speeds.
>>>
>>> Unfortunately I don't have any true 40G links to test with.  The
>>> closest I can get is to run PF to VF on an i40e.  Running that I have
>>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
>>> difference being related to the fact that we are having to
>>> allocate/free more skbs and make more trips through the
>>> i40e_lan_xmit_frame function resulting in more descriptors.
>>
>> OK, I guess that is more or less in line with what I would expect off
>> the top my head. There is a reasonably significant drop in the worst
>> case.
>>
>>>> I'm concerned about the drop in performance for devices that currently
>>>> support offloads (almost none of which expose
>>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>>>> care most about tunnel performance are the ones that already have
>>>> these NICs and will be the most impacted by the drop.
>>>
>>> The problem is being able to transmit fast is kind of pointless if the
>>> receiving end cannot handle it.  We hadn't gotten around to really
>>> getting the Rx checksum bits working until the 3.18 kernel which I
>>> don't suspect many people are running so at this point messing with
>>> the TSO bits isn't really making much of a difference.  Then on top of
>>> that most devices have certain limitations on how many ports they can
>>> handle and such.  I know the i40e is supposed to support something
>>> like 10 port numbers, but the fm10k and ixgbe are limited to one port
>>> as I recall.  So this whole thing is already really brittle as it is.
>>> My goal with this change is to make the behavior more consistent
>>> across the board.
>>
>> That's true to some degree but there are certainly plenty of cases
>> where TSO makes a difference - lower CPU usage, transmitting to
>> multiple receivers, people will upgrade their kernels, etc. It's
>> clearly good to make things more consistent but hopefully not by
>> reducing existing performance. :)
>>
>>>> My hope is that we can continue to use TSO on devices that only
>>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>>>> length field may vary across segments. However, in practice this is
>>>> the only on the final segment and only in cases where the total length
>>>> is not a multiple of the MSS. If we could detect cases where those
>>>> conditions are met, we could continue to use TSO with the UDP checksum
>>>> field pre-populated. A possible step even further would be to break
>>>> off the final segment into a separate packet to make things conform if
>>>> necessary. This would avoid a performance regression and I think make
>>>> this more palatable to a lot of people.
>>>
>>> I think Tom and I had discussed this possibility a bit at netconf.
>>> The GSO logic is something I planned on looking at over the next
>>> several weeks as I suspect there is probably room for improvement
>>> there.
>>
>> That sounds great.
>>
>>>>> I also haven't investigated the effect this will have on OVS.  However I
>>>>> suspect the impact should be minimal as the worst case scenario should be
>>>>> that Tx checksumming will become enabled by default which should be
>>>>> consistent with the existing behavior for IPv6.
>>>>
>>>> I don't think that it should cause any problems.
>>>
>>> Good to hear.
>>>
>>> Do you know if OVS has some way to control the VXLAN configuration so
>>> that it could disable Tx checksums?  If so that would probably be a
>>> good way to address the 40G issues assuming someone is running an
>>> environment hat had nothing but NICs that can support the TSO and Rx
>>> checksum on inner headers.
>>
>> Yes - OVS can control tx checksums on a per-endpoint basis (actually,
>> rx checksum present requirements as well though it's not exposed to
>> the user at the moment). If you had the information then you could
>> optimize what to use in an environment of, say, hypervisors and
>> hardware switches.
>>
>> However, it's certainly possible that you have a mixed set of NICs
>> such as an encap aware NIC on the transmit side and non-aware on the
>> receive side. In that case, both possible checksum settings penalize
>> somebody: off (lose GRO on receiver), on (lose TSO on sender assuming
>> no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's
>> important to be able to use encap TSO with local checksum to avoid
>> these bad tradeoffs, not to mention being cleaner.
>
> By "local checksum" do you mean LCO?

Yes, that's what I meant.

> Seems like we should be able to
> get that to work with NETIF_F_GSO_TUNNEL_CSUM.

I assume you mean NETIF_F_GSO_TUNNEL (no _CSUM)?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-20  2:18         ` Jesse Gross
@ 2016-02-20 19:51           ` Tom Herbert
  2016-02-23  3:31             ` Jesse Gross
  2016-03-11 19:20             ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree
  0 siblings, 2 replies; 47+ messages in thread
From: Tom Herbert @ 2016-02-20 19:51 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alex Duyck, Linux Kernel Network Developers, David Miller,
	Alexander Duyck

On Fri, Feb 19, 2016 at 6:18 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <jesse@kernel.org> wrote:
>>> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote:
>>>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote:
>>>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote:
>>>>>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>>>>>> tunnels by default.  This makes the behavior consistent with how we were
>>>>>> handling this for IPv6.  In addition I have updated the internal flags for
>>>>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>>>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>>>>> IPv6.
>>>>>>
>>>>>> For most network devices this should be a net gain in terms of performance
>>>>>> as having the outer header checksum present allows for devices to report
>>>>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>>>>>> to determine if the inner header checksum is valid.
>>>>>>
>>>>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>>>>> this.  I located two PFs connected back to back in two different name
>>>>>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>>>>>> and one without.
>>>>>>
>>>>>> Recv   Send    Send                          Utilization
>>>>>> Socket Socket  Message  Elapsed              Send
>>>>>> Size   Size    Size     Time     Throughput  local
>>>>>> bytes  bytes   bytes    secs.    10^6bits/s  % S
>>>>>>
>>>>>> noudpcsum:
>>>>>>  87380  16384  16384    30.00      8898.67   12.80
>>>>>> udpcsum:
>>>>>>  87380  16384  16384    30.00      9088.47   5.69
>>>>>>
>>>>>> The one spot where this may cause a performance regression is if the
>>>>>> environment contains devices that can parse the inner headers and a device
>>>>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>>>>> the case of such a device we have to fall back to using GSO to segment the
>>>>>> tunnel instead of TSO and as a result we may take a performance hit as seen
>>>>>> below with i40e.
>>>>>
>>>>> Do you have any numbers from 40G links? Obviously, at 10G the links
>>>>> are basically saturated and while I can see a difference in the
>>>>> utilization rate, I suspect that the change will be much more apparent
>>>>> at higher speeds.
>>>>
>>>> Unfortunately I don't have any true 40G links to test with.  The
>>>> closest I can get is to run PF to VF on an i40e.  Running that I have
>>>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
>>>> difference being related to the fact that we are having to
>>>> allocate/free more skbs and make more trips through the
>>>> i40e_lan_xmit_frame function resulting in more descriptors.
>>>
>>> OK, I guess that is more or less in line with what I would expect off
>>> the top my head. There is a reasonably significant drop in the worst
>>> case.
>>>
>>>>> I'm concerned about the drop in performance for devices that currently
>>>>> support offloads (almost none of which expose
>>>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>>>>> care most about tunnel performance are the ones that already have
>>>>> these NICs and will be the most impacted by the drop.
>>>>
>>>> The problem is being able to transmit fast is kind of pointless if the
>>>> receiving end cannot handle it.  We hadn't gotten around to really
>>>> getting the Rx checksum bits working until the 3.18 kernel which I
>>>> don't suspect many people are running so at this point messing with
>>>> the TSO bits isn't really making much of a difference.  Then on top of
>>>> that most devices have certain limitations on how many ports they can
>>>> handle and such.  I know the i40e is supposed to support something
>>>> like 10 port numbers, but the fm10k and ixgbe are limited to one port
>>>> as I recall.  So this whole thing is already really brittle as it is.
>>>> My goal with this change is to make the behavior more consistent
>>>> across the board.
>>>
>>> That's true to some degree but there are certainly plenty of cases
>>> where TSO makes a difference - lower CPU usage, transmitting to
>>> multiple receivers, people will upgrade their kernels, etc. It's
>>> clearly good to make things more consistent but hopefully not by
>>> reducing existing performance. :)
>>>
>>>>> My hope is that we can continue to use TSO on devices that only
>>>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>>>>> length field may vary across segments. However, in practice this is
>>>>> the only on the final segment and only in cases where the total length
>>>>> is not a multiple of the MSS. If we could detect cases where those
>>>>> conditions are met, we could continue to use TSO with the UDP checksum
>>>>> field pre-populated. A possible step even further would be to break
>>>>> off the final segment into a separate packet to make things conform if
>>>>> necessary. This would avoid a performance regression and I think make
>>>>> this more palatable to a lot of people.
>>>>
>>>> I think Tom and I had discussed this possibility a bit at netconf.
>>>> The GSO logic is something I planned on looking at over the next
>>>> several weeks as I suspect there is probably room for improvement
>>>> there.
>>>
>>> That sounds great.
>>>
>>>>>> I also haven't investigated the effect this will have on OVS.  However I
>>>>>> suspect the impact should be minimal as the worst case scenario should be
>>>>>> that Tx checksumming will become enabled by default which should be
>>>>>> consistent with the existing behavior for IPv6.
>>>>>
>>>>> I don't think that it should cause any problems.
>>>>
>>>> Good to hear.
>>>>
>>>> Do you know if OVS has some way to control the VXLAN configuration so
>>>> that it could disable Tx checksums?  If so that would probably be a
>>>> good way to address the 40G issues assuming someone is running an
>>>> environment hat had nothing but NICs that can support the TSO and Rx
>>>> checksum on inner headers.
>>>
>>> Yes - OVS can control tx checksums on a per-endpoint basis (actually,
>>> rx checksum present requirements as well though it's not exposed to
>>> the user at the moment). If you had the information then you could
>>> optimize what to use in an environment of, say, hypervisors and
>>> hardware switches.
>>>
>>> However, it's certainly possible that you have a mixed set of NICs
>>> such as an encap aware NIC on the transmit side and non-aware on the
>>> receive side. In that case, both possible checksum settings penalize
>>> somebody: off (lose GRO on receiver), on (lose TSO on sender assuming
>>> no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's
>>> important to be able to use encap TSO with local checksum to avoid
>>> these bad tradeoffs, not to mention being cleaner.
>>
>> By "local checksum" do you mean LCO?
>
> Yes, that's what I meant.
>
Right. To use LCO with TSO we would need to ensure that all packets
are the same size so that the UDP length field and thus checksum are
constant for all created segments. But this property this would also
make any payload lengths in headers constant for all packets so that
the only fields that need be set per generated packet would be the TCP
sequence number and checksum. This simplifying assumption could be
used to make a very protocol-generic GSO/TSO (up to the transport
header)!

Conceptually, a device would just need to know the start of the
packet, the offset of the transport header, and the size of each
segment. Any bits from the start of the packet to the beginning of the
transport header are just copied to each segment, so any combination
of encapsulation/network protocols is  supported as long as they are
constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for
needing TSO support).

If we are able to do this then GSO could be a lot simpler and more
extensible. We should be able to eliminate all the GSO flags for GRE,
IPIP, SIT, UDP, checksum variants, shouldn't need to distinguish
between TCPv4 and TCPv6, and wouldn't need to disallow nested
encapsulations. The inner headers in the skbuf might also be removed.
GSO for SCTP or FCOE still needs a little thought, we'd need to
consider the possibility of needing both a CRC and checksum in a
single packet.

Tom

>> Seems like we should be able to
>> get that to work with NETIF_F_GSO_TUNNEL_CSUM.
>
> I assume you mean NETIF_F_GSO_TUNNEL (no _CSUM)?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck
                   ` (2 preceding siblings ...)
  2016-02-19 21:53 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum " Jesse Gross
@ 2016-02-22  3:06 ` David Miller
  3 siblings, 0 replies; 47+ messages in thread
From: David Miller @ 2016-02-22  3:06 UTC (permalink / raw)
  To: aduyck; +Cc: netdev, alexander.duyck

From: Alexander Duyck <aduyck@mirantis.com>
Date: Fri, 19 Feb 2016 11:26:17 -0800

> This patch series makes it so that we enable the outer Tx checksum
> for IPv4 tunnels by default.

Series applied, thanks Alex.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-20 19:51           ` Tom Herbert
@ 2016-02-23  3:31             ` Jesse Gross
  2016-02-23 15:18               ` Edward Cree
  2016-03-11 19:20             ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree
  1 sibling, 1 reply; 47+ messages in thread
From: Jesse Gross @ 2016-02-23  3:31 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alex Duyck, Linux Kernel Network Developers, David Miller,
	Alexander Duyck

On Sat, Feb 20, 2016 at 11:51 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Fri, Feb 19, 2016 at 6:18 PM, Jesse Gross <jesse@kernel.org> wrote:
>> On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert <tom@herbertland.com> wrote:
>>> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <jesse@kernel.org> wrote:
>>>> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote:
>>>>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote:
>>>>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote:
>>>>>>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>>>>>>> tunnels by default.  This makes the behavior consistent with how we were
>>>>>>> handling this for IPv6.  In addition I have updated the internal flags for
>>>>>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>>>>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>>>>>> IPv6.
>>>>>>>
>>>>>>> For most network devices this should be a net gain in terms of performance
>>>>>>> as having the outer header checksum present allows for devices to report
>>>>>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>>>>>>> to determine if the inner header checksum is valid.
>>>>>>>
>>>>>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>>>>>> this.  I located two PFs connected back to back in two different name
>>>>>>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>>>>>>> and one without.
>>>>>>>
>>>>>>> Recv   Send    Send                          Utilization
>>>>>>> Socket Socket  Message  Elapsed              Send
>>>>>>> Size   Size    Size     Time     Throughput  local
>>>>>>> bytes  bytes   bytes    secs.    10^6bits/s  % S
>>>>>>>
>>>>>>> noudpcsum:
>>>>>>>  87380  16384  16384    30.00      8898.67   12.80
>>>>>>> udpcsum:
>>>>>>>  87380  16384  16384    30.00      9088.47   5.69
>>>>>>>
>>>>>>> The one spot where this may cause a performance regression is if the
>>>>>>> environment contains devices that can parse the inner headers and a device
>>>>>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>>>>>> the case of such a device we have to fall back to using GSO to segment the
>>>>>>> tunnel instead of TSO and as a result we may take a performance hit as seen
>>>>>>> below with i40e.
>>>>>>
>>>>>> Do you have any numbers from 40G links? Obviously, at 10G the links
>>>>>> are basically saturated and while I can see a difference in the
>>>>>> utilization rate, I suspect that the change will be much more apparent
>>>>>> at higher speeds.
>>>>>
>>>>> Unfortunately I don't have any true 40G links to test with.  The
>>>>> closest I can get is to run PF to VF on an i40e.  Running that I have
>>>>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
>>>>> difference being related to the fact that we are having to
>>>>> allocate/free more skbs and make more trips through the
>>>>> i40e_lan_xmit_frame function resulting in more descriptors.
>>>>
>>>> OK, I guess that is more or less in line with what I would expect off
>>>> the top my head. There is a reasonably significant drop in the worst
>>>> case.
>>>>
>>>>>> I'm concerned about the drop in performance for devices that currently
>>>>>> support offloads (almost none of which expose
>>>>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>>>>>> care most about tunnel performance are the ones that already have
>>>>>> these NICs and will be the most impacted by the drop.
>>>>>
>>>>> The problem is being able to transmit fast is kind of pointless if the
>>>>> receiving end cannot handle it.  We hadn't gotten around to really
>>>>> getting the Rx checksum bits working until the 3.18 kernel which I
>>>>> don't suspect many people are running so at this point messing with
>>>>> the TSO bits isn't really making much of a difference.  Then on top of
>>>>> that most devices have certain limitations on how many ports they can
>>>>> handle and such.  I know the i40e is supposed to support something
>>>>> like 10 port numbers, but the fm10k and ixgbe are limited to one port
>>>>> as I recall.  So this whole thing is already really brittle as it is.
>>>>> My goal with this change is to make the behavior more consistent
>>>>> across the board.
>>>>
>>>> That's true to some degree but there are certainly plenty of cases
>>>> where TSO makes a difference - lower CPU usage, transmitting to
>>>> multiple receivers, people will upgrade their kernels, etc. It's
>>>> clearly good to make things more consistent but hopefully not by
>>>> reducing existing performance. :)
>>>>
>>>>>> My hope is that we can continue to use TSO on devices that only
>>>>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>>>>>> length field may vary across segments. However, in practice this is
>>>>>> the only on the final segment and only in cases where the total length
>>>>>> is not a multiple of the MSS. If we could detect cases where those
>>>>>> conditions are met, we could continue to use TSO with the UDP checksum
>>>>>> field pre-populated. A possible step even further would be to break
>>>>>> off the final segment into a separate packet to make things conform if
>>>>>> necessary. This would avoid a performance regression and I think make
>>>>>> this more palatable to a lot of people.
>>>>>
>>>>> I think Tom and I had discussed this possibility a bit at netconf.
>>>>> The GSO logic is something I planned on looking at over the next
>>>>> several weeks as I suspect there is probably room for improvement
>>>>> there.
>>>>
>>>> That sounds great.
>>>>
>>>>>>> I also haven't investigated the effect this will have on OVS.  However I
>>>>>>> suspect the impact should be minimal as the worst case scenario should be
>>>>>>> that Tx checksumming will become enabled by default which should be
>>>>>>> consistent with the existing behavior for IPv6.
>>>>>>
>>>>>> I don't think that it should cause any problems.
>>>>>
>>>>> Good to hear.
>>>>>
>>>>> Do you know if OVS has some way to control the VXLAN configuration so
>>>>> that it could disable Tx checksums?  If so that would probably be a
>>>>> good way to address the 40G issues assuming someone is running an
>>>>> environment hat had nothing but NICs that can support the TSO and Rx
>>>>> checksum on inner headers.
>>>>
>>>> Yes - OVS can control tx checksums on a per-endpoint basis (actually,
>>>> rx checksum present requirements as well though it's not exposed to
>>>> the user at the moment). If you had the information then you could
>>>> optimize what to use in an environment of, say, hypervisors and
>>>> hardware switches.
>>>>
>>>> However, it's certainly possible that you have a mixed set of NICs
>>>> such as an encap aware NIC on the transmit side and non-aware on the
>>>> receive side. In that case, both possible checksum settings penalize
>>>> somebody: off (lose GRO on receiver), on (lose TSO on sender assuming
>>>> no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's
>>>> important to be able to use encap TSO with local checksum to avoid
>>>> these bad tradeoffs, not to mention being cleaner.
>>>
>>> By "local checksum" do you mean LCO?
>>
>> Yes, that's what I meant.
>>
> Right. To use LCO with TSO we would need to ensure that all packets
> are the same size so that the UDP length field and thus checksum are
> constant for all created segments. But this property this would also
> make any payload lengths in headers constant for all packets so that
> the only fields that need be set per generated packet would be the TCP
> sequence number and checksum. This simplifying assumption could be
> used to make a very protocol-generic GSO/TSO (up to the transport
> header)!
>
> Conceptually, a device would just need to know the start of the
> packet, the offset of the transport header, and the size of each
> segment. Any bits from the start of the packet to the beginning of the
> transport header are just copied to each segment, so any combination
> of encapsulation/network protocols is  supported as long as they are
> constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for
> needing TSO support).
>
> If we are able to do this then GSO could be a lot simpler and more
> extensible. We should be able to eliminate all the GSO flags for GRE,
> IPIP, SIT, UDP, checksum variants, shouldn't need to distinguish
> between TCPv4 and TCPv6, and wouldn't need to disallow nested
> encapsulations. The inner headers in the skbuf might also be removed.
> GSO for SCTP or FCOE still needs a little thought, we'd need to
> consider the possibility of needing both a CRC and checksum in a
> single packet.

Yes, I think this is definitely a good direction to go in general. At
that point, the main distinguishing feature of TSO support in NICs
would basically be the depth into the packet that the card is capable
of manipulating the L4 header. I assume that the NICs that do
encapsulation offloads would be able to handle the same depth when
they are not doing encapsulation and I know that some NICs (such as
ixgbe) don't explicitly support TSO with encapsulation but can handle
headers deeper in the packet than they current expose.

The only issue that I see is that making TSO completely unaware of
outer headers will likely cause performance regressions in some cases.
Imagine if we have an incoming TCP stream with incrementing IP IDs
that we aggregate through GRO and forward. Today's TSO would be able
to recreate the stream by incrementing the ID as new segments are
created. However, if the outgoing NIC is truly only dealing with the
L4 header then it wouldn't be able to do this.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23  3:31             ` Jesse Gross
@ 2016-02-23 15:18               ` Edward Cree
  2016-02-23 16:47                 ` Tom Herbert
  0 siblings, 1 reply; 47+ messages in thread
From: Edward Cree @ 2016-02-23 15:18 UTC (permalink / raw)
  To: Jesse Gross, Tom Herbert
  Cc: Alex Duyck, Linux Kernel Network Developers, David Miller,
	Alexander Duyck

On 23/02/16 03:31, Jesse Gross wrote:
> The only issue that I see is that making TSO completely unaware of
> outer headers will likely cause performance regressions in some cases.
> Imagine if we have an incoming TCP stream with incrementing IP IDs
> that we aggregate through GRO and forward. Today's TSO would be able
> to recreate the stream by incrementing the ID as new segments are
> created. However, if the outgoing NIC is truly only dealing with the
> L4 header then it wouldn't be able to do this.
Perhaps TSO should force setting the DF bit, so that the IP ID can be
ignored.  After all, if your network is going to cause fragmentation and
reassembly, your performance will probably be bad enough that TSO won't
help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
Arguably, as soon as we perform GRO on traffic to be forwarded, we've
already violated the end-to-end principle (there are always imaginable
situations in which a different packet stream comes out than went in),
so it doesn't really matter if we go on to change the network layer
parameters in this way - it's not really the same IP datagram any more
so it's OK for its identification to change.
And of course this problem doesn't present itself for IPv6 :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 15:18               ` Edward Cree
@ 2016-02-23 16:47                 ` Tom Herbert
  2016-02-23 17:20                   ` Rick Jones
                                     ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Tom Herbert @ 2016-02-23 16:47 UTC (permalink / raw)
  To: Edward Cree
  Cc: Jesse Gross, Alex Duyck, Linux Kernel Network Developers,
	David Miller, Alexander Duyck

On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 23/02/16 03:31, Jesse Gross wrote:
>> The only issue that I see is that making TSO completely unaware of
>> outer headers will likely cause performance regressions in some cases.
>> Imagine if we have an incoming TCP stream with incrementing IP IDs
>> that we aggregate through GRO and forward. Today's TSO would be able
>> to recreate the stream by incrementing the ID as new segments are
>> created. However, if the outgoing NIC is truly only dealing with the
>> L4 header then it wouldn't be able to do this.
> Perhaps TSO should force setting the DF bit, so that the IP ID can be
> ignored.  After all, if your network is going to cause fragmentation and
> reassembly, your performance will probably be bad enough that TSO won't
> help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
> Arguably, as soon as we perform GRO on traffic to be forwarded, we've
> already violated the end-to-end principle (there are always imaginable
> situations in which a different packet stream comes out than went in),
> so it doesn't really matter if we go on to change the network layer
> parameters in this way - it's not really the same IP datagram any more
> so it's OK for its identification to change.
> And of course this problem doesn't present itself for IPv6 :)

Right, GRO should probably not coalesce packets with non-zero IP
identifiers due to the loss of information. Besides that, RFC6848 says
the IP identifier should only be set for fragmentation anyway so there
shouldn't be any issue and really no need for HW TSO (or LRO) to
support that.

We need to do increment IP identifier in UFO, but I only see one
device (neterion) that advertises NETIF_F_UFO-- honestly, removing
that feature might be another good simplification!

Tom

> --
> -Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 16:47                 ` Tom Herbert
@ 2016-02-23 17:20                   ` Rick Jones
  2016-02-23 17:38                     ` Edward Cree
  2016-02-23 17:31                   ` Jesse Gross
  2016-02-25 20:14                   ` David Miller
  2 siblings, 1 reply; 47+ messages in thread
From: Rick Jones @ 2016-02-23 17:20 UTC (permalink / raw)
  To: Tom Herbert, Edward Cree
  Cc: Jesse Gross, Alex Duyck, Linux Kernel Network Developers,
	David Miller, Alexander Duyck

On 02/23/2016 08:47 AM, Tom Herbert wrote:
> Right, GRO should probably not coalesce packets with non-zero IP
> identifiers due to the loss of information. Besides that, RFC6848 says
> the IP identifier should only be set for fragmentation anyway so there
> shouldn't be any issue and really no need for HW TSO (or LRO) to
> support that.

You sure that is RFC 6848 "Specifying Civic Address Extensions in the 
Presence Information Data Format Location Object (PIDF-LO)" ?

In whichever RFC that may be, is it a SHOULD or a MUST, and just how 
many "other" stacks might be setting a non-zero IP ID on fragments with 
DF set?

rick jones

> We need to do increment IP identifier in UFO, but I only see one
> device (neterion) that advertises NETIF_F_UFO-- honestly, removing
> that feature might be another good simplification!
>
> Tom
>
>> --
>> -Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 16:47                 ` Tom Herbert
  2016-02-23 17:20                   ` Rick Jones
@ 2016-02-23 17:31                   ` Jesse Gross
  2016-02-23 17:42                     ` Tom Herbert
  2016-02-23 18:24                     ` David Miller
  2016-02-25 20:14                   ` David Miller
  2 siblings, 2 replies; 47+ messages in thread
From: Jesse Gross @ 2016-02-23 17:31 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Edward Cree, Alex Duyck, Linux Kernel Network Developers,
	David Miller, Alexander Duyck

On Tue, Feb 23, 2016 at 8:47 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree <ecree@solarflare.com> wrote:
>> On 23/02/16 03:31, Jesse Gross wrote:
>>> The only issue that I see is that making TSO completely unaware of
>>> outer headers will likely cause performance regressions in some cases.
>>> Imagine if we have an incoming TCP stream with incrementing IP IDs
>>> that we aggregate through GRO and forward. Today's TSO would be able
>>> to recreate the stream by incrementing the ID as new segments are
>>> created. However, if the outgoing NIC is truly only dealing with the
>>> L4 header then it wouldn't be able to do this.
>> Perhaps TSO should force setting the DF bit, so that the IP ID can be
>> ignored.  After all, if your network is going to cause fragmentation and
>> reassembly, your performance will probably be bad enough that TSO won't
>> help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
>> Arguably, as soon as we perform GRO on traffic to be forwarded, we've
>> already violated the end-to-end principle (there are always imaginable
>> situations in which a different packet stream comes out than went in),
>> so it doesn't really matter if we go on to change the network layer
>> parameters in this way - it's not really the same IP datagram any more
>> so it's OK for its identification to change.
>> And of course this problem doesn't present itself for IPv6 :)
>
> Right, GRO should probably not coalesce packets with non-zero IP
> identifiers due to the loss of information. Besides that, RFC6848 says
> the IP identifier should only be set for fragmentation anyway so there
> shouldn't be any issue and really no need for HW TSO (or LRO) to
> support that.

Most OSs (including Linux with connected TCP sockets) use non-zero IP
IDs so requiring this would effectively disable GRO.

I think the practical way to go about this is to introduce a new GSO
type for L4-only offload. There are some existing types that we could
immediately convert and kill off with no impact (such as GRE) and some
new protocols that would come for free (such as MPLS) so it would be a
net win. Once the infrastructure is in, it will be easier to evaluate
what else can be simplified on a case by case basis. (i.e. Even
UDP_TUNNEL will have some potential adverse impact from this compared
to explicit support since we'd need to break off the last segment from
a TSO burst where the size isn't an even multiple of the MSS. I guess
the impact is probably small but it would be good to know.)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 17:20                   ` Rick Jones
@ 2016-02-23 17:38                     ` Edward Cree
  2016-02-23 18:08                       ` David Miller
  2016-02-23 18:11                       ` Tom Herbert
  0 siblings, 2 replies; 47+ messages in thread
From: Edward Cree @ 2016-02-23 17:38 UTC (permalink / raw)
  To: Rick Jones, Tom Herbert
  Cc: Jesse Gross, Alex Duyck, Linux Kernel Network Developers,
	David Miller, Alexander Duyck

On 23/02/16 17:20, Rick Jones wrote:
> On 02/23/2016 08:47 AM, Tom Herbert wrote:
>> Right, GRO should probably not coalesce packets with non-zero IP
>> identifiers due to the loss of information. Besides that, RFC6848 says
>> the IP identifier should only be set for fragmentation anyway so there
>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>> support that.
>
> You sure that is RFC 6848 "Specifying Civic Address Extensions in the Presence Information Data Format Location Object (PIDF-LO)" ?
PossiblyRFC 6864 "Updated Specification of the IPv4 ID Field".
> In whichever RFC that may be, is it a SHOULD or a MUST, and just how many "other" stacks might be setting a non-zero IP ID on fragments with DF set?
"The IPv4 ID field MUST NOT be used for purposes other than fragmentation
 and reassembly."(§4.1)
"Originating sources MAY set the IPv4 ID field of atomic datagrams to any
 value."(§4.1)
"All devices that examine IPv4 headers MUST ignore the IPv4 ID field of
 atomic datagrams."(§4.1)
Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4).

So it's OK to coalesce packets with different identifiers, as long as they
have DFset (and aren't fragmented already).  Also, the RFC takes pains to
point out that it "does not reserve any IPv4 ID values, including 0, as
distinguished" (§4.1), so one cannot rely on the ID always being zero.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 17:31                   ` Jesse Gross
@ 2016-02-23 17:42                     ` Tom Herbert
  2016-02-23 18:18                       ` Alexander Duyck
  2016-02-23 18:26                       ` David Miller
  2016-02-23 18:24                     ` David Miller
  1 sibling, 2 replies; 47+ messages in thread
From: Tom Herbert @ 2016-02-23 17:42 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Edward Cree, Alex Duyck, Linux Kernel Network Developers,
	David Miller, Alexander Duyck

On Tue, Feb 23, 2016 at 9:31 AM, Jesse Gross <jesse@kernel.org> wrote:
> On Tue, Feb 23, 2016 at 8:47 AM, Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> On 23/02/16 03:31, Jesse Gross wrote:
>>>> The only issue that I see is that making TSO completely unaware of
>>>> outer headers will likely cause performance regressions in some cases.
>>>> Imagine if we have an incoming TCP stream with incrementing IP IDs
>>>> that we aggregate through GRO and forward. Today's TSO would be able
>>>> to recreate the stream by incrementing the ID as new segments are
>>>> created. However, if the outgoing NIC is truly only dealing with the
>>>> L4 header then it wouldn't be able to do this.
>>> Perhaps TSO should force setting the DF bit, so that the IP ID can be
>>> ignored.  After all, if your network is going to cause fragmentation and
>>> reassembly, your performance will probably be bad enough that TSO won't
>>> help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
>>> Arguably, as soon as we perform GRO on traffic to be forwarded, we've
>>> already violated the end-to-end principle (there are always imaginable
>>> situations in which a different packet stream comes out than went in),
>>> so it doesn't really matter if we go on to change the network layer
>>> parameters in this way - it's not really the same IP datagram any more
>>> so it's OK for its identification to change.
>>> And of course this problem doesn't present itself for IPv6 :)
>>
>> Right, GRO should probably not coalesce packets with non-zero IP
>> identifiers due to the loss of information. Besides that, RFC6848 says
>> the IP identifier should only be set for fragmentation anyway so there
>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>> support that.
>
> Most OSs (including Linux with connected TCP sockets) use non-zero IP
> IDs so requiring this would effectively disable GRO.
>
> I think the practical way to go about this is to introduce a new GSO
> type for L4-only offload. There are some existing types that we could
> immediately convert and kill off with no impact (such as GRE) and some
> new protocols that would come for free (such as MPLS) so it would be a
> net win. Once the infrastructure is in, it will be easier to evaluate
> what else can be simplified on a case by case basis. (i.e. Even
> UDP_TUNNEL will have some potential adverse impact from this compared
> to explicit support since we'd need to break off the last segment from
> a TSO burst where the size isn't an even multiple of the MSS. I guess
> the impact is probably small but it would be good to know.)

Why not just fix the stack to conform to RFC6864? As Edward pointed
out we lose the actual IP ID's in GRO anyway, so attempts to set them
in GSO may be wildly incorrect from the source point of view-- even in
that case were probably better off changing the IP identifier to zero
(okay since we're already breaking the E2E model anyway :-) ).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 17:38                     ` Edward Cree
@ 2016-02-23 18:08                       ` David Miller
  2016-02-23 20:20                         ` Edward Cree
  2016-02-23 18:11                       ` Tom Herbert
  1 sibling, 1 reply; 47+ messages in thread
From: David Miller @ 2016-02-23 18:08 UTC (permalink / raw)
  To: ecree; +Cc: rick.jones2, tom, jesse, aduyck, netdev, alexander.duyck

From: Edward Cree <ecree@solarflare.com>
Date: Tue, 23 Feb 2016 17:38:28 +0000

> On 23/02/16 17:20, Rick Jones wrote:
>> On 02/23/2016 08:47 AM, Tom Herbert wrote:
>>> Right, GRO should probably not coalesce packets with non-zero IP
>>> identifiers due to the loss of information. Besides that, RFC6848 says
>>> the IP identifier should only be set for fragmentation anyway so there
>>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>>> support that.
>>
>> You sure that is RFC 6848 "Specifying Civic Address Extensions in the Presence Information Data Format Location Object (PIDF-LO)" ?
> PossiblyRFC 6864 "Updated Specification of the IPv4 ID Field".
>> In whichever RFC that may be, is it a SHOULD or a MUST, and just how many "other" stacks might be setting a non-zero IP ID on fragments with DF set?
> "The IPv4 ID field MUST NOT be used for purposes other than fragmentation
>  and reassembly."(§4.1)
> "Originating sources MAY set the IPv4 ID field of atomic datagrams to any
>  value."(§4.1)
> "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of
>  atomic datagrams."(§4.1)
> Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4).
> 
> So it's OK to coalesce packets with different identifiers, as long as they
> have DFset (and aren't fragmented already).  Also, the RFC takes pains to
> point out that it "does not reserve any IPv4 ID values, including 0, as
> distinguished" (§4.1), so one cannot rely on the ID always being zero.

Just a reminder that a very long time ago we tried setting the IP ID
field to zero for DF packets, and this broke SLHC because that expects
a monotonically increasing IP ID field.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 17:38                     ` Edward Cree
  2016-02-23 18:08                       ` David Miller
@ 2016-02-23 18:11                       ` Tom Herbert
  1 sibling, 0 replies; 47+ messages in thread
From: Tom Herbert @ 2016-02-23 18:11 UTC (permalink / raw)
  To: Edward Cree
  Cc: Rick Jones, Jesse Gross, Alex Duyck,
	Linux Kernel Network Developers, David Miller, Alexander Duyck

On Tue, Feb 23, 2016 at 9:38 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 23/02/16 17:20, Rick Jones wrote:
>> On 02/23/2016 08:47 AM, Tom Herbert wrote:
>>> Right, GRO should probably not coalesce packets with non-zero IP
>>> identifiers due to the loss of information. Besides that, RFC6848 says
>>> the IP identifier should only be set for fragmentation anyway so there
>>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>>> support that.
>>
>> You sure that is RFC 6848 "Specifying Civic Address Extensions in the Presence Information Data Format Location Object (PIDF-LO)" ?
> PossiblyRFC 6864 "Updated Specification of the IPv4 ID Field".

Yes RFC6864.

>> In whichever RFC that may be, is it a SHOULD or a MUST, and just how many "other" stacks might be setting a non-zero IP ID on fragments with DF set?
> "The IPv4 ID field MUST NOT be used for purposes other than fragmentation
>  and reassembly."(§4.1)
> "Originating sources MAY set the IPv4 ID field of atomic datagrams to any
>  value."(§4.1)
> "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of
>  atomic datagrams."(§4.1)
> Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4).
>
> So it's OK to coalesce packets with different identifiers, as long as they
> have DFset (and aren't fragmented already).  Also, the RFC takes pains to
> point out that it "does not reserve any IPv4 ID values, including 0, as
> distinguished" (§4.1), so one cannot rely on the ID always being zero.

Right, receive side is straightforward, just ignore IP IDs. Transmit
is more interesting. Operative code is in ip_select_ident_segs. The
comment as to why non-zero IDs are sent is:

          /* This is only to work around buggy Windows95/2000
           * VJ compression implementations.  If the ID field
           * does not change, they drop every other packet in
           * a TCP stream using header compression.
           */

> --
> -Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 17:42                     ` Tom Herbert
@ 2016-02-23 18:18                       ` Alexander Duyck
  2016-02-23 18:26                       ` David Miller
  1 sibling, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2016-02-23 18:18 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesse Gross, Edward Cree, Alex Duyck,
	Linux Kernel Network Developers, David Miller

On Tue, Feb 23, 2016 at 9:42 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Feb 23, 2016 at 9:31 AM, Jesse Gross <jesse@kernel.org> wrote:
>> On Tue, Feb 23, 2016 at 8:47 AM, Tom Herbert <tom@herbertland.com> wrote:
>>> On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree <ecree@solarflare.com> wrote:
>>>> On 23/02/16 03:31, Jesse Gross wrote:
>>>>> The only issue that I see is that making TSO completely unaware of
>>>>> outer headers will likely cause performance regressions in some cases.
>>>>> Imagine if we have an incoming TCP stream with incrementing IP IDs
>>>>> that we aggregate through GRO and forward. Today's TSO would be able
>>>>> to recreate the stream by incrementing the ID as new segments are
>>>>> created. However, if the outgoing NIC is truly only dealing with the
>>>>> L4 header then it wouldn't be able to do this.
>>>> Perhaps TSO should force setting the DF bit, so that the IP ID can be
>>>> ignored.  After all, if your network is going to cause fragmentation and
>>>> reassembly, your performance will probably be bad enough that TSO won't
>>>> help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
>>>> Arguably, as soon as we perform GRO on traffic to be forwarded, we've
>>>> already violated the end-to-end principle (there are always imaginable
>>>> situations in which a different packet stream comes out than went in),
>>>> so it doesn't really matter if we go on to change the network layer
>>>> parameters in this way - it's not really the same IP datagram any more
>>>> so it's OK for its identification to change.
>>>> And of course this problem doesn't present itself for IPv6 :)
>>>
>>> Right, GRO should probably not coalesce packets with non-zero IP
>>> identifiers due to the loss of information. Besides that, RFC6848 says
>>> the IP identifier should only be set for fragmentation anyway so there
>>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>>> support that.
>>
>> Most OSs (including Linux with connected TCP sockets) use non-zero IP
>> IDs so requiring this would effectively disable GRO.
>>
>> I think the practical way to go about this is to introduce a new GSO
>> type for L4-only offload. There are some existing types that we could
>> immediately convert and kill off with no impact (such as GRE) and some
>> new protocols that would come for free (such as MPLS) so it would be a
>> net win. Once the infrastructure is in, it will be easier to evaluate
>> what else can be simplified on a case by case basis. (i.e. Even
>> UDP_TUNNEL will have some potential adverse impact from this compared
>> to explicit support since we'd need to break off the last segment from
>> a TSO burst where the size isn't an even multiple of the MSS. I guess
>> the impact is probably small but it would be good to know.)
>
> Why not just fix the stack to conform to RFC6864? As Edward pointed
> out we lose the actual IP ID's in GRO anyway, so attempts to set them
> in GSO may be wildly incorrect from the source point of view-- even in
> that case were probably better off changing the IP identifier to zero
> (okay since we're already breaking the E2E model anyway :-) ).

The wording of RFC6864 seems to imply that we can ignore the IP ID
field in the case of DF being set and the MF and fragment offset bits
being cleared.  It states we can use an arbitrary value for the IP ID
on "atomic" frames so we can probably just leave it at whatever the
initial value is for the frame to be segmented, not need to force it
to 0.

The problem as I see it is that we will need to update GRO so that it
is willing to accept frames with an inner IP ID that is not
incrementing for atomic frames before we can really get into the GSO
side of the equation.  From what I can tell it looks like currently it
doesn't honor that and requires IP ID to increment in order to
coalesce frames.

I will look into trying to setup TSO for these devices like I did
NETIF_F_HW_CSUM for the Intel parts.  Basically we will leave the
outer IP header and the inner transport header to be handled by the
hardware, and then we can compute the length and checksum for the UDP
header and inner IP header.  That way we can deal with things like
VLAN tags that need to be inserted before the outer network header
while maintaining the IP ID for the outer IP header as well since most
devices seem to handle that correctly.

- Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 17:31                   ` Jesse Gross
  2016-02-23 17:42                     ` Tom Herbert
@ 2016-02-23 18:24                     ` David Miller
  2016-02-24  9:58                       ` David Laight
  1 sibling, 1 reply; 47+ messages in thread
From: David Miller @ 2016-02-23 18:24 UTC (permalink / raw)
  To: jesse; +Cc: tom, ecree, aduyck, netdev, alexander.duyck

From: Jesse Gross <jesse@kernel.org>
Date: Tue, 23 Feb 2016 09:31:09 -0800

> Most OSs (including Linux with connected TCP sockets) use non-zero IP
> IDs so requiring this would effectively disable GRO.

+1

Any OS that wants to work with SLHC, as I mentioned, has to emit
monotonically increasing IP ID values in all packets, even those with
DF set.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 17:42                     ` Tom Herbert
  2016-02-23 18:18                       ` Alexander Duyck
@ 2016-02-23 18:26                       ` David Miller
  2016-02-23 18:32                         ` Tom Herbert
  1 sibling, 1 reply; 47+ messages in thread
From: David Miller @ 2016-02-23 18:26 UTC (permalink / raw)
  To: tom; +Cc: jesse, ecree, aduyck, netdev, alexander.duyck

From: Tom Herbert <tom@herbertland.com>
Date: Tue, 23 Feb 2016 09:42:00 -0800

> Why not just fix the stack to conform to RFC6864? As Edward pointed
> out we lose the actual IP ID's in GRO anyway, so attempts to set them
> in GSO may be wildly incorrect from the source point of view-- even in
> that case were probably better off changing the IP identifier to zero
> (okay since we're already breaking the E2E model anyway :-) ).

Tom, you can't, you'll break TCP header compression schemes which
expect a monotonically increasing IP ID value.

We tried setting the IP ID to zero for frames with DF set, it broke
things.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 18:26                       ` David Miller
@ 2016-02-23 18:32                         ` Tom Herbert
  0 siblings, 0 replies; 47+ messages in thread
From: Tom Herbert @ 2016-02-23 18:32 UTC (permalink / raw)
  To: David Miller
  Cc: Jesse Gross, Edward Cree, Alex Duyck,
	Linux Kernel Network Developers, Alexander Duyck

On Tue, Feb 23, 2016 at 10:26 AM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Tue, 23 Feb 2016 09:42:00 -0800
>
>> Why not just fix the stack to conform to RFC6864? As Edward pointed
>> out we lose the actual IP ID's in GRO anyway, so attempts to set them
>> in GSO may be wildly incorrect from the source point of view-- even in
>> that case were probably better off changing the IP identifier to zero
>> (okay since we're already breaking the E2E model anyway :-) ).
>
> Tom, you can't, you'll break TCP header compression schemes which
> expect a monotonically increasing IP ID value.
>
> We tried setting the IP ID to zero for frames with DF set, it broke
> things.

All the more reason to move to IPv6 ;-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 18:08                       ` David Miller
@ 2016-02-23 20:20                         ` Edward Cree
  2016-02-23 23:11                           ` David Miller
  2016-02-24  0:53                           ` Tom Herbert
  0 siblings, 2 replies; 47+ messages in thread
From: Edward Cree @ 2016-02-23 20:20 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, tom, jesse, aduyck, netdev, alexander.duyck

On 23/02/16 18:08, David Miller wrote:
> From: Edward Cree <ecree@solarflare.com>
> Date: Tue, 23 Feb 2016 17:38:28 +0000
>
>> "The IPv4 ID field MUST NOT be used for purposes other than fragmentation
>>  and reassembly."(§4.1)
>> "Originating sources MAY set the IPv4 ID field of atomic datagrams to any
>>  value."(§4.1)
>> "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of
>>  atomic datagrams."(§4.1)
>> Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4).
>>
>> So it's OK to coalesce packets with different identifiers, as long as they
>> have DFset (and aren't fragmented already).  Also, the RFC takes pains to
>> point out that it "does not reserve any IPv4 ID values, including 0, as
>> distinguished" (§4.1), so one cannot rely on the ID always being zero.
> Just a reminder that a very long time ago we tried setting the IP ID
> field to zero for DF packets, and this broke SLHC because that expects
> a monotonically increasing IP ID field.
If I'm reading it right, Linux's SLHC implementation can cope [1] with
unchanging IP IDs (or IDs that do something other than increment), which it
encodes with the NEW_I bit.  RFC1144 supports this, remarking "note that it
may be zero or negative" (§3.2.3) of the ID delta.  So VJ implementations that
can't handle it really are buggy, not just exhibiting a difference of opinion.

So it seems like the problem case is where some sort of SLIP gateway lies
between a Linux system (say, a webserver sending with TSO) and a buggy client;
in that case, even if the client can't be fixed it seems like the gateway
could be (either to send NEW_I or just lie about the incoming IP IDs and claim
they're incrementing - the latter is entirely safe for a DF packet).

[ I really hope we can figure out a way not to change IP IDs, because I think
an inverted version of Tom's generic TSO could also work as a generic h/w GRO
accelerator.  In its simplest form, the hardware just remembers the previous
packet, then gives us the length of the common prefix.  If that ends four bytes
into a TCP header, then the packet is a candidate for GRO; otherwise not.  I
haven't worked out all the details yet, but it's clear that non-constant IP IDs
would cause problems.  (If it's only the innermost IP ID that's changing, we
can probably still cope, but the performance gain will be less and the
implementation could start to get ugly.) ]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 20:20                         ` Edward Cree
@ 2016-02-23 23:11                           ` David Miller
  2016-02-24  0:53                           ` Tom Herbert
  1 sibling, 0 replies; 47+ messages in thread
From: David Miller @ 2016-02-23 23:11 UTC (permalink / raw)
  To: ecree; +Cc: rick.jones2, tom, jesse, aduyck, netdev, alexander.duyck

From: Edward Cree <ecree@solarflare.com>
Date: Tue, 23 Feb 2016 20:20:50 +0000

> So VJ implementations that can't handle it really are buggy, not
> just exhibiting a difference of opinion.

Buggy or not, they exist, and we have to cope with them.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 20:20                         ` Edward Cree
  2016-02-23 23:11                           ` David Miller
@ 2016-02-24  0:53                           ` Tom Herbert
  2016-02-24 17:30                             ` Edward Cree
  1 sibling, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-02-24  0:53 UTC (permalink / raw)
  To: Edward Cree
  Cc: David Miller, Rick Jones, Jesse Gross, Alex Duyck,
	Linux Kernel Network Developers, Alexander Duyck

> [ I really hope we can figure out a way not to change IP IDs, because I think
> an inverted version of Tom's generic TSO could also work as a generic h/w GRO
> accelerator.  In its simplest form, the hardware just remembers the previous
> packet, then gives us the length of the common prefix.  If that ends four bytes
> into a TCP header, then the packet is a candidate for GRO; otherwise not.  I
> haven't worked out all the details yet, but it's clear that non-constant IP IDs
> would cause problems.  (If it's only the innermost IP ID that's changing, we
> can probably still cope, but the performance gain will be less and the
> implementation could start to get ugly.) ]

That's an interesting idea. This should work in IPv6 now and nearly
all encapsulation protocols (GRE w/ csum doesn't work this way for
instance), the logic to just match to a GRO state would be as simple
as comparing packet hashes and then do a memcmp over the headers. For
IPv4 maybe do a masked compare-- we still need to validate the header
checksum but that is easily checked without checksumming over the
whole header by just adding in the diff in the dynamic fields (i.e.
something like C_new == C_Old + (IP_ID_new - IP_ID_old)

Tom

>
> --
> -Ed
>
> [1] http://lxr.free-electrons.com/source/drivers/net/slip/slhc.c#L425

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 18:24                     ` David Miller
@ 2016-02-24  9:58                       ` David Laight
  2016-02-24 15:41                         ` David Miller
  0 siblings, 1 reply; 47+ messages in thread
From: David Laight @ 2016-02-24  9:58 UTC (permalink / raw)
  To: 'David Miller', jesse; +Cc: tom, ecree, aduyck, netdev, alexander.duyck

From: David Miller
> Sent: 23 February 2016 18:25
> 
> From: Jesse Gross <jesse@kernel.org>
> Date: Tue, 23 Feb 2016 09:31:09 -0800
> 
> > Most OSs (including Linux with connected TCP sockets) use non-zero IP
> > IDs so requiring this would effectively disable GRO.
> 
> +1
> 
> Any OS that wants to work with SLHC, as I mentioned, has to emit
> monotonically increasing IP ID values in all packets, even those with
> DF set.

Doesn't that leak a lot of info about the sending system?
ISTR one OS deliberately randomising the IP ID values in order
to avoid giving out information about the number of packets being sent.

	David

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-24  9:58                       ` David Laight
@ 2016-02-24 15:41                         ` David Miller
  0 siblings, 0 replies; 47+ messages in thread
From: David Miller @ 2016-02-24 15:41 UTC (permalink / raw)
  To: David.Laight; +Cc: jesse, tom, ecree, aduyck, netdev, alexander.duyck

From: David Laight <David.Laight@ACULAB.COM>
Date: Wed, 24 Feb 2016 09:58:03 +0000

> From: David Miller
>> Sent: 23 February 2016 18:25
>> 
>> From: Jesse Gross <jesse@kernel.org>
>> Date: Tue, 23 Feb 2016 09:31:09 -0800
>> 
>> > Most OSs (including Linux with connected TCP sockets) use non-zero IP
>> > IDs so requiring this would effectively disable GRO.
>> 
>> +1
>> 
>> Any OS that wants to work with SLHC, as I mentioned, has to emit
>> monotonically increasing IP ID values in all packets, even those with
>> DF set.
> 
> Doesn't that leak a lot of info about the sending system?
> ISTR one OS deliberately randomising the IP ID values in order
> to avoid giving out information about the number of packets being sent.

The ID generater is per-flow, therefore I don't think this is an
issue.

And if it is an issue, then it exists for fragmented traffic on
every machine on the planet.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-24  0:53                           ` Tom Herbert
@ 2016-02-24 17:30                             ` Edward Cree
  0 siblings, 0 replies; 47+ messages in thread
From: Edward Cree @ 2016-02-24 17:30 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Rick Jones, Jesse Gross, Alex Duyck,
	Linux Kernel Network Developers, Alexander Duyck

On 24/02/16 00:53, Tom Herbert wrote:
> That's an interesting idea. This should work in IPv6 now and nearly
> all encapsulation protocols (GRE w/ csum doesn't work this way for
> instance)
You mean GRE with sequence numbers?  csum should be fine, it'sjust the
usual LCO setup (i.e. only depends on headers).

-Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
  2016-02-23 16:47                 ` Tom Herbert
  2016-02-23 17:20                   ` Rick Jones
  2016-02-23 17:31                   ` Jesse Gross
@ 2016-02-25 20:14                   ` David Miller
  2 siblings, 0 replies; 47+ messages in thread
From: David Miller @ 2016-02-25 20:14 UTC (permalink / raw)
  To: tom; +Cc: ecree, jesse, aduyck, netdev, alexander.duyck

From: Tom Herbert <tom@herbertland.com>
Date: Tue, 23 Feb 2016 08:47:30 -0800

> Right, GRO should probably not coalesce packets with non-zero IP
> identifiers due to the loss of information.

If they are monotonically increasing, which is the only case worth
caring about, it absolutely should.

Because that can be done without any loss of information.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-02-20 19:51           ` Tom Herbert
  2016-02-23  3:31             ` Jesse Gross
@ 2016-03-11 19:20             ` Edward Cree
  2016-03-11 19:57               ` Tom Herbert
  2016-03-11 20:22               ` David Miller
  1 sibling, 2 replies; 47+ messages in thread
From: Edward Cree @ 2016-03-11 19:20 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers

On 20/02/16 19:51, Tom Herbert wrote:
> Right. To use LCO with TSO we would need to ensure that all packets
> are the same size so that the UDP length field and thus checksum are
> constant for all created segments. But this property this would also
> make any payload lengths in headers constant for all packets so that
> the only fields that need be set per generated packet would be the TCP
> sequence number and checksum. This simplifying assumption could be
> used to make a very protocol-generic GSO/TSO (up to the transport
> header)!
>
> Conceptually, a device would just need to know the start of the
> packet, the offset of the transport header, and the size of each
> segment. Any bits from the start of the packet to the beginning of the
> transport header are just copied to each segment, so any combination
> of encapsulation/network protocols is  supported as long as they are
> constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for
> needing TSO support).
Tom,

Are you planning to / working on implementing this?  If not, I might have a
crack at it; I've talked to our firmware guys and (provisionally) we think
we can support it in current sfc hardware.
Or were there any blocking problems raised in the thread?  My understanding
of the IP ID issue was that it only matters for the inner frame, because
the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
have missed something.

-Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-03-11 19:20             ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree
@ 2016-03-11 19:57               ` Tom Herbert
  2016-03-11 19:59                 ` Edward Cree
  2016-03-11 20:22               ` David Miller
  1 sibling, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-03-11 19:57 UTC (permalink / raw)
  To: Edward Cree; +Cc: Linux Kernel Network Developers

On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 20/02/16 19:51, Tom Herbert wrote:
>> Right. To use LCO with TSO we would need to ensure that all packets
>> are the same size so that the UDP length field and thus checksum are
>> constant for all created segments. But this property this would also
>> make any payload lengths in headers constant for all packets so that
>> the only fields that need be set per generated packet would be the TCP
>> sequence number and checksum. This simplifying assumption could be
>> used to make a very protocol-generic GSO/TSO (up to the transport
>> header)!
>>
>> Conceptually, a device would just need to know the start of the
>> packet, the offset of the transport header, and the size of each
>> segment. Any bits from the start of the packet to the beginning of the
>> transport header are just copied to each segment, so any combination
>> of encapsulation/network protocols is  supported as long as they are
>> constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for
>> needing TSO support).
> Tom,
>
> Are you planning to / working on implementing this?  If not, I might have a
> crack at it; I've talked to our firmware guys and (provisionally) we think
> we can support it in current sfc hardware.
> Or were there any blocking problems raised in the thread?  My understanding
> of the IP ID issue was that it only matters for the inner frame, because
> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
> have missed something.
>
Right, then the interface would need to just include the offset of the
IP ID. But doesn't this break using LCO with GSO though-- i.e. the
outer checksum and inner checksum still need to be updated per packet
so we need to tell device where outer checksum(s) is.

Thanks,
Tom

> -Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-03-11 19:57               ` Tom Herbert
@ 2016-03-11 19:59                 ` Edward Cree
  2016-03-11 20:16                   ` Tom Herbert
  0 siblings, 1 reply; 47+ messages in thread
From: Edward Cree @ 2016-03-11 19:59 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers

On 11/03/16 19:57, Tom Herbert wrote:
> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote:
>> Tom,
>>
>> Are you planning to / working on implementing this?  If not, I might have a
>> crack at it; I've talked to our firmware guys and (provisionally) we think
>> we can support it in current sfc hardware.
>> Or were there any blocking problems raised in the thread?  My understanding
>> of the IP ID issue was that it only matters for the inner frame, because
>> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
>> have missed something.
>>
> Right, then the interface would need to just include the offset of the
> IP ID. But doesn't this break using LCO with GSO though-- i.e. the
> outer checksum and inner checksum still need to be updated per packet
> so we need to tell device where outer checksum(s) is.
No, outer checksum shouldn't change: IP ID is protected by inner IP header
checksum, which device will edit.  No?

-Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-03-11 19:59                 ` Edward Cree
@ 2016-03-11 20:16                   ` Tom Herbert
  2016-03-11 20:24                     ` Edward Cree
  0 siblings, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-03-11 20:16 UTC (permalink / raw)
  To: Edward Cree; +Cc: Linux Kernel Network Developers

On Fri, Mar 11, 2016 at 11:59 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 11/03/16 19:57, Tom Herbert wrote:
>> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> Tom,
>>>
>>> Are you planning to / working on implementing this?  If not, I might have a
>>> crack at it; I've talked to our firmware guys and (provisionally) we think
>>> we can support it in current sfc hardware.
>>> Or were there any blocking problems raised in the thread?  My understanding
>>> of the IP ID issue was that it only matters for the inner frame, because
>>> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
>>> have missed something.
>>>
>> Right, then the interface would need to just include the offset of the
>> IP ID. But doesn't this break using LCO with GSO though-- i.e. the
>> outer checksum and inner checksum still need to be updated per packet
>> so we need to tell device where outer checksum(s) is.
> No, outer checksum shouldn't change: IP ID is protected by inner IP header
> checksum, which device will edit.  No?

Right, the interface would probably still need offset to the IPv4 hdr?

>
> -Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO
  2016-03-11 19:20             ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree
  2016-03-11 19:57               ` Tom Herbert
@ 2016-03-11 20:22               ` David Miller
  1 sibling, 0 replies; 47+ messages in thread
From: David Miller @ 2016-03-11 20:22 UTC (permalink / raw)
  To: ecree; +Cc: tom, netdev

From: Edward Cree <ecree@solarflare.com>
Date: Fri, 11 Mar 2016 19:20:41 +0000

> Are you planning to / working on implementing this?  If not, I might have a
> crack at it; I've talked to our firmware guys and (provisionally) we think
> we can support it in current sfc hardware.
> Or were there any blocking problems raised in the thread?  My understanding
> of the IP ID issue was that it only matters for the inner frame, because
> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
> have missed something.

I've thought a lot more about the IP ID issue, and I am now starting
to learn towards allowing it to be set to zero for "DF" packets.

Considering what we gain in return, not working optimally with an
ancient SLIP header compression implementation is a small loss.

Any other opinions?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-03-11 20:16                   ` Tom Herbert
@ 2016-03-11 20:24                     ` Edward Cree
  2016-03-11 21:09                       ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Edward Cree @ 2016-03-11 20:24 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers

On 11/03/16 20:16, Tom Herbert wrote:
> On Fri, Mar 11, 2016 at 11:59 AM, Edward Cree <ecree@solarflare.com> wrote:
>> On 11/03/16 19:57, Tom Herbert wrote:
>>> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote:
>>>> Tom,
>>>>
>>>> Are you planning to / working on implementing this?  If not, I might have a
>>>> crack at it; I've talked to our firmware guys and (provisionally) we think
>>>> we can support it in current sfc hardware.
>>>> Or were there any blocking problems raised in the thread?  My understanding
>>>> of the IP ID issue was that it only matters for the inner frame, because
>>>> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
>>>> have missed something.
>>>>
>>> Right, then the interface would need to just include the offset of the
>>> IP ID. But doesn't this break using LCO with GSO though-- i.e. the
>>> outer checksum and inner checksum still need to be updated per packet
>>> so we need to tell device where outer checksum(s) is.
>> No, outer checksum shouldn't change: IP ID is protected by inner IP header
>> checksum, which device will edit.  No?
> Right, the interface would probably still need offset to the IPv4 hdr?
Yes; I'm assuming the interface could just be "offset to inner IP header",
and the hardware knows well enough what IP and TCP headers look like that
it can figure out the rest (including skipping over options if e.g. ihl>5).
So, do you want to try and implement it or shall I?

-Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-03-11 20:24                     ` Edward Cree
@ 2016-03-11 21:09                       ` Alexander Duyck
  2016-03-11 21:29                         ` Edward Cree
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2016-03-11 21:09 UTC (permalink / raw)
  To: Edward Cree; +Cc: Tom Herbert, Linux Kernel Network Developers

On Fri, Mar 11, 2016 at 12:24 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 11/03/16 20:16, Tom Herbert wrote:
>> On Fri, Mar 11, 2016 at 11:59 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> On 11/03/16 19:57, Tom Herbert wrote:
>>>> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote:
>>>>> Tom,
>>>>>
>>>>> Are you planning to / working on implementing this?  If not, I might have a
>>>>> crack at it; I've talked to our firmware guys and (provisionally) we think
>>>>> we can support it in current sfc hardware.
>>>>> Or were there any blocking problems raised in the thread?  My understanding
>>>>> of the IP ID issue was that it only matters for the inner frame, because
>>>>> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
>>>>> have missed something.
>>>>>
>>>> Right, then the interface would need to just include the offset of the
>>>> IP ID. But doesn't this break using LCO with GSO though-- i.e. the
>>>> outer checksum and inner checksum still need to be updated per packet
>>>> so we need to tell device where outer checksum(s) is.
>>> No, outer checksum shouldn't change: IP ID is protected by inner IP header
>>> checksum, which device will edit.  No?
>> Right, the interface would probably still need offset to the IPv4 hdr?
> Yes; I'm assuming the interface could just be "offset to inner IP header",
> and the hardware knows well enough what IP and TCP headers look like that
> it can figure out the rest (including skipping over options if e.g. ihl>5).
> So, do you want to try and implement it or shall I?

I've already started looking into this and was waiting for feedback
from Dave about the IPv4 ID issue which is looks like he is okay with
a static ID value as long as the DF bit is set.

The only real issue with the "generic" TSO is that it isn't going to
be so generic.  We have different devices that will support different
levels of stuff.  For example the ixgbe drivers will need to treat the
outer tunnel header as one giant L2 header.  As a result we will need
to populate all the fields in the outer header including the outer IP
ID, checksum, udp->len, and UDP or GRE checksum if requested.  For
i40e I think this gets a bit simpler as they already handle the outer
IPv4 ID and checksum.  I think there we may need to only populate the
checksum for it to work out correctly.  As such I may look at coming
up with a number of functions so that we can mix and match based on
what is needed in order to assemble a partially segmented frame.

The other issue I am working on at the moment to enable all this is to
fix the differents between csum_tcpudp_magic and csum_ipv6_magic in
terms of handling packet lengths greater than 65535.  Currently we are
messing up the checksum in relation to IPv6 since we are using the
truncated uh->len value.  I'll be submitting some patches later today
that will hopefully get that fixed and that in turn should make the
rest of the segmentation work easier.

- Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-03-11 21:09                       ` Alexander Duyck
@ 2016-03-11 21:29                         ` Edward Cree
  2016-03-11 22:31                           ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Edward Cree @ 2016-03-11 21:29 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Tom Herbert, Linux Kernel Network Developers

On 11/03/16 21:09, Alexander Duyck wrote:
> The only real issue with the "generic" TSO is that it isn't going to
> be so generic.  We have different devices that will support different
> levels of stuff.  For example the ixgbe drivers will need to treat the
> outer tunnel header as one giant L2 header.  As a result we will need
> to populate all the fields in the outer header including the outer IP
> ID, checksum, udp->len, and UDP or GRE checksum if requested.  For
> i40e I think this gets a bit simpler as they already handle the outer
> IPv4 ID and checksum.  I think there we may need to only populate the
> checksum for it to work out correctly.  As such I may look at coming
> up with a number of functions so that we can mix and match based on
> what is needed in order to assemble a partially segmented frame.
AIUI, the point of the design is that we _can_ populate everything,
because we're keeping lengths and outer IP ID fixed, so outer checksums
stay the same and the outer tunnel header _is_ just one giant L2 header
which is bit-for-bit identical for each generated segment.  So every
devicegets to just be dumb and treat it as opaque.
> The other issue I am working on at the moment to enable all this is to
> fix the differents between csum_tcpudp_magic and csum_ipv6_magic in
> terms of handling packet lengths greater than 65535.  Currently we are
> messing up the checksum in relation to IPv6 since we are using the
> truncated uh->len value.  I'll be submitting some patches later today
> that will hopefully get that fixed and that in turn should make the
> rest of the segmentation work easier.
Again, in the superpacket we want to calculate the checksum based on the
subsegment length, rather than the length of the superpacket.  The idea
is to create the header for an MSS-sized segment, then follow it with an
inner IP & TCP header, and n*MSS bytes of payload.  (This of course
produces a superpacket that's not what you'd send over a link with a 64k
MTU, unlike how we do it today.)
Then hw just chops up the payload, fixes up the inner headers, and glues
the "L2" header on each packet.

-Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-03-11 21:29                         ` Edward Cree
@ 2016-03-11 22:31                           ` Alexander Duyck
  2016-03-11 22:55                             ` Tom Herbert
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2016-03-11 22:31 UTC (permalink / raw)
  To: Edward Cree; +Cc: Tom Herbert, Linux Kernel Network Developers

On Fri, Mar 11, 2016 at 1:29 PM, Edward Cree <ecree@solarflare.com> wrote:
> On 11/03/16 21:09, Alexander Duyck wrote:
>> The only real issue with the "generic" TSO is that it isn't going to
>> be so generic.  We have different devices that will support different
>> levels of stuff.  For example the ixgbe drivers will need to treat the
>> outer tunnel header as one giant L2 header.  As a result we will need
>> to populate all the fields in the outer header including the outer IP
>> ID, checksum, udp->len, and UDP or GRE checksum if requested.  For
>> i40e I think this gets a bit simpler as they already handle the outer
>> IPv4 ID and checksum.  I think there we may need to only populate the
>> checksum for it to work out correctly.  As such I may look at coming
>> up with a number of functions so that we can mix and match based on
>> what is needed in order to assemble a partially segmented frame.
> AIUI, the point of the design is that we _can_ populate everything,
> because we're keeping lengths and outer IP ID fixed, so outer checksums
> stay the same and the outer tunnel header _is_ just one giant L2 header
> which is bit-for-bit identical for each generated segment.  So every
> devicegets to just be dumb and treat it as opaque.

This works so long as the device isn't trying to do anything like
insert VLAN tags.  Then I think we might have an issue since we don't
want to confuse the device and have it trying to insert the tag on the
inner frame's Ethernet header.

I suspect we may have differing levels of "dumb" that we have to deal
with.  That is all I am saying.  By default we could just populate all
of the length and checksum fields in the outer header, we would just
have to be consistent about what is provided then.  In addition there
will be the matter of sorting out the IP ID bits.  For example some of
the i40e parts support tunnel offloads, but not tunnel offloads with
checksums enabled.  I suspect those parts will end up wanting to
handle the outer IP header and UDP length values.  As a result there
trying to do a "dumb" send may result in us really messing up the IP
ID values if we don't take steps to make it a bit smarter.

>> The other issue I am working on at the moment to enable all this is to
>> fix the differents between csum_tcpudp_magic and csum_ipv6_magic in
>> terms of handling packet lengths greater than 65535.  Currently we are
>> messing up the checksum in relation to IPv6 since we are using the
>> truncated uh->len value.  I'll be submitting some patches later today
>> that will hopefully get that fixed and that in turn should make the
>> rest of the segmentation work easier.
> Again, in the superpacket we want to calculate the checksum based on the
> subsegment length, rather than the length of the superpacket.  The idea
> is to create the header for an MSS-sized segment, then follow it with an
> inner IP & TCP header, and n*MSS bytes of payload.  (This of course
> produces a superpacket that's not what you'd send over a link with a 64k
> MTU, unlike how we do it today.)

The question is at what point do we do the chopping.  Should we be
doing this in the drivers or somewhere higher in the stack like we do
for standard GSO segmentation.  I would think we would need to add
another bit that says we can do GSO with custom outer headers since I
can see VLANs being a possible issue otherwise.

> Then hw just chops up the payload, fixes up the inner headers, and glues
> the "L2" header on each packet.

Yea, it sounds really straight forward and easy.  It isn't till you
start digging into the actual code that it gets a bit hairy.

What this effectively is is another form of TSO where each driver will
want to do things a little differently.  Alot of it has to do with the
fact that this is kind of a nasty hack that we are trying to add since
many devices won't like the fact that we are lying about the size of
our actual L2 header so things like VLAN tag insertion are going to
end up blowing back on us.

Really my preference in the case of ixgbe would have been to let the
hardware update the outer IP header and the inner TCP header and then
do the UDP and inner IP header as the static headers.  That way we
could still theoretically support fragmentation on the outer headers
which last I knew is a very real possibility since the DF bit is not
set on the outer headers for VXLAN I believe.

- Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-03-11 22:31                           ` Alexander Duyck
@ 2016-03-11 22:55                             ` Tom Herbert
  2016-03-12  5:40                               ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-03-11 22:55 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Edward Cree, Linux Kernel Network Developers

On Fri, Mar 11, 2016 at 2:31 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Fri, Mar 11, 2016 at 1:29 PM, Edward Cree <ecree@solarflare.com> wrote:
>> On 11/03/16 21:09, Alexander Duyck wrote:
>>> The only real issue with the "generic" TSO is that it isn't going to
>>> be so generic.  We have different devices that will support different
>>> levels of stuff.  For example the ixgbe drivers will need to treat the
>>> outer tunnel header as one giant L2 header.  As a result we will need
>>> to populate all the fields in the outer header including the outer IP
>>> ID, checksum, udp->len, and UDP or GRE checksum if requested.  For
>>> i40e I think this gets a bit simpler as they already handle the outer
>>> IPv4 ID and checksum.  I think there we may need to only populate the
>>> checksum for it to work out correctly.  As such I may look at coming
>>> up with a number of functions so that we can mix and match based on
>>> what is needed in order to assemble a partially segmented frame.
>> AIUI, the point of the design is that we _can_ populate everything,
>> because we're keeping lengths and outer IP ID fixed, so outer checksums
>> stay the same and the outer tunnel header _is_ just one giant L2 header
>> which is bit-for-bit identical for each generated segment.  So every
>> devicegets to just be dumb and treat it as opaque.
>
> This works so long as the device isn't trying to do anything like
> insert VLAN tags.  Then I think we might have an issue since we don't
> want to confuse the device and have it trying to insert the tag on the
> inner frame's Ethernet header.
>
In Edward's giant L2 header mode, couldn't VLAN tags just be part of that?

> I suspect we may have differing levels of "dumb" that we have to deal
> with.  That is all I am saying.  By default we could just populate all
> of the length and checksum fields in the outer header, we would just
> have to be consistent about what is provided then.  In addition there
> will be the matter of sorting out the IP ID bits.  For example some of
> the i40e parts support tunnel offloads, but not tunnel offloads with
> checksums enabled.  I suspect those parts will end up wanting to
> handle the outer IP header and UDP length values.  As a result there
> trying to do a "dumb" send may result in us really messing up the IP
> ID values if we don't take steps to make it a bit smarter.
>
>>> The other issue I am working on at the moment to enable all this is to
>>> fix the differents between csum_tcpudp_magic and csum_ipv6_magic in
>>> terms of handling packet lengths greater than 65535.  Currently we are
>>> messing up the checksum in relation to IPv6 since we are using the
>>> truncated uh->len value.  I'll be submitting some patches later today
>>> that will hopefully get that fixed and that in turn should make the
>>> rest of the segmentation work easier.
>> Again, in the superpacket we want to calculate the checksum based on the
>> subsegment length, rather than the length of the superpacket.  The idea
>> is to create the header for an MSS-sized segment, then follow it with an
>> inner IP & TCP header, and n*MSS bytes of payload.  (This of course
>> produces a superpacket that's not what you'd send over a link with a 64k
>> MTU, unlike how we do it today.)
>
> The question is at what point do we do the chopping.  Should we be
> doing this in the drivers or somewhere higher in the stack like we do
> for standard GSO segmentation.  I would think we would need to add
> another bit that says we can do GSO with custom outer headers since I
> can see VLANs being a possible issue otherwise.
>
>> Then hw just chops up the payload, fixes up the inner headers, and glues
>> the "L2" header on each packet.
>
> Yea, it sounds really straight forward and easy.  It isn't till you
> start digging into the actual code that it gets a bit hairy.
>
> What this effectively is is another form of TSO where each driver will
> want to do things a little differently.  Alot of it has to do with the
> fact that this is kind of a nasty hack that we are trying to add since
> many devices won't like the fact that we are lying about the size of
> our actual L2 header so things like VLAN tag insertion are going to
> end up blowing back on us.
>
Right, the point is that we're trying to get out of the model where
every driver/device implements TSO differently, supports ad hoc
protocols, etc. Do you see any other common invasive technique that we
need to deal with other than VLAN insertion and IP ID?

> Really my preference in the case of ixgbe would have been to let the
> hardware update the outer IP header and the inner TCP header and then
> do the UDP and inner IP header as the static headers.  That way we
> could still theoretically support fragmentation on the outer headers
> which last I knew is a very real possibility since the DF bit is not
> set on the outer headers for VXLAN I believe.
>
> - Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)
  2016-03-11 22:55                             ` Tom Herbert
@ 2016-03-12  5:40                               ` Alexander Duyck
  2016-03-14 10:26                                 ` Generic TSO Edward Cree
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2016-03-12  5:40 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Edward Cree, Linux Kernel Network Developers

On Fri, Mar 11, 2016 at 2:55 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Fri, Mar 11, 2016 at 2:31 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Fri, Mar 11, 2016 at 1:29 PM, Edward Cree <ecree@solarflare.com> wrote:
>>> On 11/03/16 21:09, Alexander Duyck wrote:
>>>> The only real issue with the "generic" TSO is that it isn't going to
>>>> be so generic.  We have different devices that will support different
>>>> levels of stuff.  For example the ixgbe drivers will need to treat the
>>>> outer tunnel header as one giant L2 header.  As a result we will need
>>>> to populate all the fields in the outer header including the outer IP
>>>> ID, checksum, udp->len, and UDP or GRE checksum if requested.  For
>>>> i40e I think this gets a bit simpler as they already handle the outer
>>>> IPv4 ID and checksum.  I think there we may need to only populate the
>>>> checksum for it to work out correctly.  As such I may look at coming
>>>> up with a number of functions so that we can mix and match based on
>>>> what is needed in order to assemble a partially segmented frame.
>>> AIUI, the point of the design is that we _can_ populate everything,
>>> because we're keeping lengths and outer IP ID fixed, so outer checksums
>>> stay the same and the outer tunnel header _is_ just one giant L2 header
>>> which is bit-for-bit identical for each generated segment.  So every
>>> devicegets to just be dumb and treat it as opaque.
>>
>> This works so long as the device isn't trying to do anything like
>> insert VLAN tags.  Then I think we might have an issue since we don't
>> want to confuse the device and have it trying to insert the tag on the
>> inner frame's Ethernet header.
>>
> In Edward's giant L2 header mode, couldn't VLAN tags just be part of that?

The problem is things like VFs which aren't allowed to insert their
own tags.  Having them try to lie about where the network header
actually starts may trigger things like anti-spoof events.

>> I suspect we may have differing levels of "dumb" that we have to deal
>> with.  That is all I am saying.  By default we could just populate all
>> of the length and checksum fields in the outer header, we would just
>> have to be consistent about what is provided then.  In addition there
>> will be the matter of sorting out the IP ID bits.  For example some of
>> the i40e parts support tunnel offloads, but not tunnel offloads with
>> checksums enabled.  I suspect those parts will end up wanting to
>> handle the outer IP header and UDP length values.  As a result there
>> trying to do a "dumb" send may result in us really messing up the IP
>> ID values if we don't take steps to make it a bit smarter.
>>
>>>> The other issue I am working on at the moment to enable all this is to
>>>> fix the differents between csum_tcpudp_magic and csum_ipv6_magic in
>>>> terms of handling packet lengths greater than 65535.  Currently we are
>>>> messing up the checksum in relation to IPv6 since we are using the
>>>> truncated uh->len value.  I'll be submitting some patches later today
>>>> that will hopefully get that fixed and that in turn should make the
>>>> rest of the segmentation work easier.
>>> Again, in the superpacket we want to calculate the checksum based on the
>>> subsegment length, rather than the length of the superpacket.  The idea
>>> is to create the header for an MSS-sized segment, then follow it with an
>>> inner IP & TCP header, and n*MSS bytes of payload.  (This of course
>>> produces a superpacket that's not what you'd send over a link with a 64k
>>> MTU, unlike how we do it today.)
>>
>> The question is at what point do we do the chopping.  Should we be
>> doing this in the drivers or somewhere higher in the stack like we do
>> for standard GSO segmentation.  I would think we would need to add
>> another bit that says we can do GSO with custom outer headers since I
>> can see VLANs being a possible issue otherwise.
>>
>>> Then hw just chops up the payload, fixes up the inner headers, and glues
>>> the "L2" header on each packet.
>>
>> Yea, it sounds really straight forward and easy.  It isn't till you
>> start digging into the actual code that it gets a bit hairy.
>>
>> What this effectively is is another form of TSO where each driver will
>> want to do things a little differently.  Alot of it has to do with the
>> fact that this is kind of a nasty hack that we are trying to add since
>> many devices won't like the fact that we are lying about the size of
>> our actual L2 header so things like VLAN tag insertion are going to
>> end up blowing back on us.
>>
> Right, the point is that we're trying to get out of the model where
> every driver/device implements TSO differently, supports ad hoc
> protocols, etc. Do you see any other common invasive technique that we
> need to deal with other than VLAN insertion and IP ID?

Well that is the thing.  Before we can actually start tinkering with
the outer header we probably need to make sure we set the DF bit and
that it would be honored on the outer headers for IPv4.  I don't
believe any of the tunnels are currently doing that so repeating the
IP ID would be the worst possible scenario until that is resolved
since VXLAN tunneled frames can be fragmented while TCP frames cannot
so we really shouldn't be repeating IP IDs for the outer headers.

- Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO
  2016-03-12  5:40                               ` Alexander Duyck
@ 2016-03-14 10:26                                 ` Edward Cree
  2016-03-14 10:32                                   ` Edward Cree
  0 siblings, 1 reply; 47+ messages in thread
From: Edward Cree @ 2016-03-14 10:26 UTC (permalink / raw)
  To: Alexander Duyck, Tom Herbert; +Cc: Linux Kernel Network Developers

On 12/03/16 05:40, Alexander Duyck wrote:
> Well that is the thing.  Before we can actually start tinkering with
> the outer header we probably need to make sure we set the DF bit and
> that it would be honored on the outer headers for IPv4.  I don't
> believe any of the tunnels are currently doing that so repeating the
> IP ID would be the worst possible scenario until that is resolved
> since VXLAN tunneled frames can be fragmented while TCP frames cannot
> so we really shouldn't be repeating IP IDs for the outer headers.
So how do we progress with that?  I'm presuming it's not as simple as
just patching the tunnel drivers to set DF if the inner packet has it,
as that could break existing setups.  (I've heard that "but they're
already broken anyway" is not usually an acceptable argument.)  Some
sort of configuration option on the tunnel (like we do with udpcsum)?

Fortunately, with the design I'm currently planning on, a tunnel
driver could just set a flag in the SKB to say "unsafe for generic-
TSO", and we'd just send out the first segment normally and fall
back to regular software segmentation.

-Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO
  2016-03-14 10:26                                 ` Generic TSO Edward Cree
@ 2016-03-14 10:32                                   ` Edward Cree
  2016-03-14 15:59                                     ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Edward Cree @ 2016-03-14 10:32 UTC (permalink / raw)
  To: Alexander Duyck, Tom Herbert; +Cc: Linux Kernel Network Developers

On 14/03/16 10:26, Edward Cree wrote:
> On 12/03/16 05:40, Alexander Duyck wrote:
>> Well that is the thing.  Before we can actually start tinkering with
>> the outer header we probably need to make sure we set the DF bit and
>> that it would be honored on the outer headers for IPv4.  I don't
>> believe any of the tunnels are currently doing that so repeating the
>> IP ID would be the worst possible scenario until that is resolved
>> since VXLAN tunneled frames can be fragmented while TCP frames cannot
>> so we really shouldn't be repeating IP IDs for the outer headers.
> So how do we progress with that?  I'm presuming it's not as simple as
> just patching the tunnel drivers to set DF if the inner packet has it,
> as that could break existing setups.  (I've heard that "but they're
> already broken anyway" is not usually an acceptable argument.)  Some
> sort of configuration option on the tunnel (like we do with udpcsum)?
...and immediately I find out it already exists.  (I guess I should have
looked there first!)
>From drivers/net/vxlan.c:2001:
> else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
>             df = htons(IP_DF);

-Ed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Generic TSO
  2016-03-14 10:32                                   ` Edward Cree
@ 2016-03-14 15:59                                     ` Alexander Duyck
  0 siblings, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2016-03-14 15:59 UTC (permalink / raw)
  To: Edward Cree; +Cc: Tom Herbert, Linux Kernel Network Developers

On Mon, Mar 14, 2016 at 3:32 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 14/03/16 10:26, Edward Cree wrote:
>> On 12/03/16 05:40, Alexander Duyck wrote:
>>> Well that is the thing.  Before we can actually start tinkering with
>>> the outer header we probably need to make sure we set the DF bit and
>>> that it would be honored on the outer headers for IPv4.  I don't
>>> believe any of the tunnels are currently doing that so repeating the
>>> IP ID would be the worst possible scenario until that is resolved
>>> since VXLAN tunneled frames can be fragmented while TCP frames cannot
>>> so we really shouldn't be repeating IP IDs for the outer headers.
>> So how do we progress with that?  I'm presuming it's not as simple as
>> just patching the tunnel drivers to set DF if the inner packet has it,
>> as that could break existing setups.  (I've heard that "but they're
>> already broken anyway" is not usually an acceptable argument.)  Some
>> sort of configuration option on the tunnel (like we do with udpcsum)?
> ...and immediately I find out it already exists.  (I guess I should have
> looked there first!)
> From drivers/net/vxlan.c:2001:
>> else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
>>             df = htons(IP_DF);

I'm still not a fan of trying to freeze the outer IP header.  I think
it should be the one that should have the IP ID increment while the
inner IP header be the one that is frozen.  Maybe that is where we can
differ per device.  I would be okay with the outer tunnel headers and
inner IP header being frozen on ixgbe which will be needed in order to
compute outer UDP checksum anyway.  Then we could leave it up to the
driver's discretion as to if the outer header has the IP ID that
increments or the inner header.

- Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2016-03-14 15:59 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck
2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck
2016-02-19 20:28   ` Tom Herbert
2016-02-19 19:26 ` [net-next PATCH 2/2] VXLAN: " Alexander Duyck
2016-02-19 20:27   ` Tom Herbert
2016-02-19 21:36     ` Jesse Gross
2016-02-19 21:53 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum " Jesse Gross
2016-02-19 23:10   ` Alex Duyck
2016-02-20  0:08     ` Jesse Gross
2016-02-20  0:14       ` Tom Herbert
2016-02-20  2:18         ` Jesse Gross
2016-02-20 19:51           ` Tom Herbert
2016-02-23  3:31             ` Jesse Gross
2016-02-23 15:18               ` Edward Cree
2016-02-23 16:47                 ` Tom Herbert
2016-02-23 17:20                   ` Rick Jones
2016-02-23 17:38                     ` Edward Cree
2016-02-23 18:08                       ` David Miller
2016-02-23 20:20                         ` Edward Cree
2016-02-23 23:11                           ` David Miller
2016-02-24  0:53                           ` Tom Herbert
2016-02-24 17:30                             ` Edward Cree
2016-02-23 18:11                       ` Tom Herbert
2016-02-23 17:31                   ` Jesse Gross
2016-02-23 17:42                     ` Tom Herbert
2016-02-23 18:18                       ` Alexander Duyck
2016-02-23 18:26                       ` David Miller
2016-02-23 18:32                         ` Tom Herbert
2016-02-23 18:24                     ` David Miller
2016-02-24  9:58                       ` David Laight
2016-02-24 15:41                         ` David Miller
2016-02-25 20:14                   ` David Miller
2016-03-11 19:20             ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree
2016-03-11 19:57               ` Tom Herbert
2016-03-11 19:59                 ` Edward Cree
2016-03-11 20:16                   ` Tom Herbert
2016-03-11 20:24                     ` Edward Cree
2016-03-11 21:09                       ` Alexander Duyck
2016-03-11 21:29                         ` Edward Cree
2016-03-11 22:31                           ` Alexander Duyck
2016-03-11 22:55                             ` Tom Herbert
2016-03-12  5:40                               ` Alexander Duyck
2016-03-14 10:26                                 ` Generic TSO Edward Cree
2016-03-14 10:32                                   ` Edward Cree
2016-03-14 15:59                                     ` Alexander Duyck
2016-03-11 20:22               ` David Miller
2016-02-22  3:06 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.