All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/8] tou: Transports over UDP - part I
@ 2016-06-16 17:51 Tom Herbert
  2016-06-16 17:51 ` [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag Tom Herbert
                   ` (11 more replies)
  0 siblings, 12 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 17:51 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

Transports over UDP is intended to encapsulate TCP and other transport
protocols directly and securely in UDP.

The goal of this work is twofold:

1) Allow applications to run their own transport layer stack (i.e.from
   userspace). This eliminates dependencies on the OS (e.g. solves a
   major dependency issue for Facebook on clients).

2) Make transport layer headers (all of L4) invisible to the network
   so that they can't do intrusive actions at L4. This will be enforced
   with DTLS in use.

Note that #1 is really about running a transport stack in userspace
applications in clients, not necessarily servers. For servers we
intend to modified the kernel stack in order to leverage existing
implementation for building scalable serves (hence these patches).

This is described in more detail in the Internet Draft:
https://tools.ietf.org/html/draft-herbert-transports-over-udp-00

In Part I we implement a straightforward encapsulation of TCP in GUE.
The implements the basic mechanics of TOU encapsulation for TCP,
however does not yet implement the IP addressing interactions so
therefore so this is not robust to use in the presence of NAT.
TOU is enabled per socket with a new socket option. This
implementation includes GSO, GRO, and RCO support.

These patches also establish the baseline performance of TOU
and isolate the performance cost of UDP encapsulation. Performance
results are below.

Tested: Various cases of TOU with IPv4, IPv6 using TCP_STREAM and
TCP_RR. Also, tested IPIP for comparing TOU encapsulation to IP
tunneling.

    - IPv6 native
      1 TCP_STREAM
	8394 tps
      200 TCP_RR
	1726825 tps
	100/177/361 90/95/99% latencies

    - IPv6 TOU with RCO
      1 TCP_STREAM
	7410 tps
      200 TCP_RR
	1445603 tps
	121/211/395 90/95/99% latencies

    - IPv4 native
      1 TCP_STREAM
	8525 tps
      200 TCP_RR
	1826729 tps
	94/166/345 90/95/99% latencies

    - IPv4 TOU with RCO
      1 TCP_STREAM
	7624 tps
      200 TCP_RR
	1599642 tps
	108/190/377 90/95/99% latencies

    - IPIP with GUE and RCO
      1 TCP_STREAM
	5092 tps
      200 TCP_RR
	1276716 tps
	137/237/445 90/95/99% latencies


Tom Herbert (8):
  net: Change SKB_GSO_DODGY to be a tx_flag
  fou: Change ip_tunnel_encap to take net argument
  tou: Base infrastructure for Transport over UDP
  ipv4: Support TOU
  tcp: Support for TOU
  ipv6: Support TOU
  tcp6: Support for TOU
  tou: Support for GSO

 drivers/net/xen-netfront.c       |   2 +-
 include/linux/netdev_features.h  |   4 +-
 include/linux/netdevice.h        |   2 +-
 include/linux/skbuff.h           |   6 +-
 include/linux/virtio_net.h       |   2 +-
 include/net/fou.h                |   4 +-
 include/net/inet_sock.h          |   2 +
 include/net/ip6_tunnel.h         |   5 +-
 include/net/ip_tunnels.h         |   6 +-
 include/net/udp.h                |   2 +
 include/uapi/linux/if_tunnel.h   |  10 +++
 include/uapi/linux/in.h          |   1 +
 include/uapi/linux/in6.h         |   1 +
 net/core/dev.c                   |   2 +-
 net/core/skbuff.c                |   2 +-
 net/ipv4/Makefile                |   3 +-
 net/ipv4/af_inet.c               |   4 +
 net/ipv4/fou.c                   |  20 ++---
 net/ipv4/ip_output.c             |  44 +++++++++--
 net/ipv4/ip_sockglue.c           |   7 ++
 net/ipv4/tcp_ipv4.c              |   9 ++-
 net/ipv4/tou.c                   | 140 +++++++++++++++++++++++++++++++++
 net/ipv4/udp_offload.c           | 163 +++++++++++++++++++++++++++++++++++++--
 net/ipv6/fou6.c                  |   8 +-
 net/ipv6/inet6_connection_sock.c |  61 +++++++++++++--
 net/ipv6/ipv6_sockglue.c         |   7 ++
 net/ipv6/tcp_ipv6.c              |  11 +--
 net/ipv6/udp_offload.c           | 128 +++++++++++++++---------------
 net/packet/af_packet.c           |   2 +-
 29 files changed, 538 insertions(+), 120 deletions(-)
 create mode 100644 net/ipv4/tou.c

-- 
2.8.0.rc2

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
@ 2016-06-16 17:51 ` Tom Herbert
  2016-06-16 18:58   ` Alexander Duyck
  2016-06-16 17:51 ` [PATCH net-next 2/8] fou: Change ip_tunnel_encap to take net argument Tom Herbert
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 17:51 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

This replaces gso_type SKB_GSO_DODGY with a new tx_flag named
SKBTX_UNTRUSTED_SOURCE. This more generically desrcibes the skb
being created from a untrusted source as a characteristic of and skbuff.
This also frees up one gso_type flag bit.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 drivers/net/xen-netfront.c      | 2 +-
 include/linux/netdev_features.h | 3 ++-
 include/linux/netdevice.h       | 1 -
 include/linux/skbuff.h          | 6 ++++--
 include/linux/virtio_net.h      | 2 +-
 net/core/dev.c                  | 2 +-
 net/core/skbuff.c               | 2 +-
 net/packet/af_packet.c          | 2 +-
 8 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 96ccd4e..6f5ae17 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -854,7 +854,7 @@ static int xennet_set_skb_gso(struct sk_buff *skb,
 		SKB_GSO_TCPV6;
 
 	/* Header must be checked, and gso_segs computed. */
-	skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+	skb_shinfo(skb)->tx_flags |= SKBTX_UNTRUSTED_SOURCE;
 	skb_shinfo(skb)->gso_segs = 0;
 
 	return 0;
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 9c6c8ef..ab15c6a 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -37,7 +37,7 @@ enum {
 	NETIF_F_TSO_BIT			/* ... TCPv4 segmentation */
 		= NETIF_F_GSO_SHIFT,
 	NETIF_F_UFO_BIT,		/* ... UDPv4 fragmentation */
-	NETIF_F_GSO_ROBUST_BIT,		/* ... ->SKB_GSO_DODGY */
+	NETIF_F_GSO_RSVD,		/* ... Reserved */
 	NETIF_F_TSO_ECN_BIT,		/* ... TCP ECN support */
 	NETIF_F_TSO_MANGLEID_BIT,	/* ... IPV4 ID mangling allowed */
 	NETIF_F_TSO6_BIT,		/* ... TCPv6 segmentation */
@@ -57,6 +57,7 @@ enum {
 	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
 		NETIF_F_GSO_SCTP_BIT,
 
+	NETIF_F_GSO_ROBUST_BIT,		/* ... ->SKBTX_UNTRUSTED_SOURCE */
 	NETIF_F_FCOE_CRC_BIT,		/* FCoE CRC32 */
 	NETIF_F_SCTP_CRC_BIT,		/* SCTP checksum offload */
 	NETIF_F_FCOE_MTU_BIT,		/* Supports max FCoE MTU, 2158 bytes*/
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 890158e..5969028 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4023,7 +4023,6 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	/* check flags correspondence */
 	BUILD_BUG_ON(SKB_GSO_TCPV4   != (NETIF_F_TSO >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP     != (NETIF_F_UFO >> NETIF_F_GSO_SHIFT));
-	BUILD_BUG_ON(SKB_GSO_DODGY   != (NETIF_F_GSO_ROBUST >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TCP_ECN != (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TCP_FIXEDID != (NETIF_F_TSO_MANGLEID >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TCPV6   != (NETIF_F_TSO6 >> NETIF_F_GSO_SHIFT));
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dc0fca7..be34e06 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -387,6 +387,9 @@ enum {
 
 	/* generate software time stamp when entering packet scheduling */
 	SKBTX_SCHED_TSTAMP = 1 << 6,
+
+	/* skb created from untrusted source */
+	SKBTX_UNTRUSTED_SOURCE = 1 << 7,
 };
 
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
@@ -460,8 +463,7 @@ enum {
 	SKB_GSO_TCPV4 = 1 << 0,
 	SKB_GSO_UDP = 1 << 1,
 
-	/* This indicates the skb is from an untrusted source. */
-	SKB_GSO_DODGY = 1 << 2,
+	SKB_GSO_RSVD = 1 << 2,
 
 	/* This indicates the tcp segment has CWR set. */
 	SKB_GSO_TCP_ECN = 1 << 3,
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1c912f8..5814c8e 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -47,7 +47,7 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
 		skb_shinfo(skb)->gso_type = gso_type;
 
 		/* Header must be checked, and gso_segs computed. */
-		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+		skb_shinfo(skb)->tx_flags |= SKBTX_UNTRUSTED_SOURCE;
 		skb_shinfo(skb)->gso_segs = 0;
 	}
 
diff --git a/net/core/dev.c b/net/core/dev.c
index b148357..3d73640 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3057,7 +3057,7 @@ static void qdisc_pkt_len_init(struct sk_buff *skb)
 		else
 			hdr_len += sizeof(struct udphdr);
 
-		if (shinfo->gso_type & SKB_GSO_DODGY)
+		if (skb_shinfo(skb)->tx_flags & SKBTX_UNTRUSTED_SOURCE)
 			gso_segs = DIV_ROUND_UP(skb->len - hdr_len,
 						shinfo->gso_size);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e7ec6d3..2126b88 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3305,11 +3305,11 @@ perform_csum_check:
 
 		/* Update type to add partial and then remove dodgy if set */
 		type |= SKB_GSO_PARTIAL;
-		type &= ~SKB_GSO_DODGY;
 
 		/* Update GSO info and prepare to start updating headers on
 		 * our way back down the stack of protocols.
 		 */
+		skb_shinfo(segs)->tx_flags &= ~SKBTX_UNTRUSTED_SOURCE;
 		skb_shinfo(segs)->gso_size = skb_shinfo(head_skb)->gso_size;
 		skb_shinfo(segs)->gso_segs = partial_segs;
 		skb_shinfo(segs)->gso_type = type;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d1f3b9e..a8f75bd 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2468,7 +2468,7 @@ static int packet_snd_vnet_gso(struct sk_buff *skb,
 	skb_shinfo(skb)->gso_type = vnet_hdr->gso_type;
 
 	/* Header must be checked, and gso_segs computed. */
-	skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+	skb_shinfo(skb)->tx_flags |= SKBTX_UNTRUSTED_SOURCE;
 	skb_shinfo(skb)->gso_segs = 0;
 	return 0;
 }
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH net-next 2/8] fou: Change ip_tunnel_encap to take net argument
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
  2016-06-16 17:51 ` [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag Tom Herbert
@ 2016-06-16 17:51 ` Tom Herbert
  2016-06-16 17:51 ` [PATCH net-next 3/8] tou: Base infrastructure for Transport over UDP Tom Herbert
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 17:51 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

Add a struct net argument to ip_tunnel_encap function and pass into
backend build_header functions. This give build_header function the
correct netns and so they don't have to fish for it in the skbuff.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/net/fou.h        |  4 ++--
 include/net/ip6_tunnel.h |  5 +++--
 include/net/ip_tunnels.h |  5 +++--
 net/ipv4/fou.c           | 18 ++++++++----------
 net/ipv6/fou6.c          |  8 ++++----
 5 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/include/net/fou.h b/include/net/fou.h
index f5cc691..172995e 100644
--- a/include/net/fou.h
+++ b/include/net/fou.h
@@ -12,8 +12,8 @@ size_t fou_encap_hlen(struct ip_tunnel_encap *e);
 size_t gue_encap_hlen(struct ip_tunnel_encap *e);
 
 int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-		       u8 *protocol, __be16 *sport, int type);
+		       u8 *protocol, __be16 *sport, int type, struct net *net);
 int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-		       u8 *protocol, __be16 *sport, int type);
+		       u8 *protocol, __be16 *sport, int type, struct net *net);
 
 #endif
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 43a5a0e..4d2a39a 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -60,7 +60,7 @@ struct ip6_tnl {
 struct ip6_tnl_encap_ops {
 	size_t (*encap_hlen)(struct ip_tunnel_encap *e);
 	int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
-			    u8 *protocol, struct flowi6 *fl6);
+			    u8 *protocol, struct flowi6 *fl6, struct net *net);
 };
 
 #ifdef CONFIG_INET
@@ -110,7 +110,8 @@ static inline int ip6_tnl_encap(struct sk_buff *skb, struct ip6_tnl *t,
 	rcu_read_lock();
 	ops = rcu_dereference(ip6tun_encaps[t->encap.type]);
 	if (likely(ops && ops->build_header))
-		ret = ops->build_header(skb, &t->encap, protocol, fl6);
+		ret = ops->build_header(skb, &t->encap, protocol, fl6,
+					dev_net(t->dev));
 	rcu_read_unlock();
 
 	return ret;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 9222678..7594132 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -258,7 +258,7 @@ void ip_tunnel_setup(struct net_device *dev, int net_id);
 struct ip_tunnel_encap_ops {
 	size_t (*encap_hlen)(struct ip_tunnel_encap *e);
 	int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
-			    u8 *protocol, struct flowi4 *fl4);
+			    u8 *protocol, struct flowi4 *fl4, struct net *net);
 };
 
 #define MAX_IPTUN_ENCAP_OPS 8
@@ -309,7 +309,8 @@ static inline int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t,
 	rcu_read_lock();
 	ops = rcu_dereference(iptun_encaps[t->encap.type]);
 	if (likely(ops && ops->build_header))
-		ret = ops->build_header(skb, &t->encap, protocol, fl4);
+		ret = ops->build_header(skb, &t->encap, protocol, fl4,
+					dev_net(t->dev));
 	rcu_read_unlock();
 
 	return ret;
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 321d57f..9cd9168 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -875,7 +875,7 @@ static void fou_build_udp(struct sk_buff *skb, struct ip_tunnel_encap *e,
 }
 
 int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-		       u8 *protocol, __be16 *sport, int type)
+		       u8 *protocol, __be16 *sport, int type, struct net *net)
 {
 	int err;
 
@@ -883,22 +883,21 @@ int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 	if (err)
 		return err;
 
-	*sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
-						skb, 0, 0, false);
+	*sport = e->sport ? : udp_flow_src_port(net, skb, 0, 0, false);
 
 	return 0;
 }
 EXPORT_SYMBOL(__fou_build_header);
 
 int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-		     u8 *protocol, struct flowi4 *fl4)
+		     u8 *protocol, struct flowi4 *fl4, struct net *net)
 {
 	int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM :
 						       SKB_GSO_UDP_TUNNEL;
 	__be16 sport;
 	int err;
 
-	err = __fou_build_header(skb, e, protocol, &sport, type);
+	err = __fou_build_header(skb, e, protocol, &sport, type, net);
 	if (err)
 		return err;
 
@@ -909,7 +908,7 @@ int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 EXPORT_SYMBOL(fou_build_header);
 
 int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-		       u8 *protocol, __be16 *sport, int type)
+		       u8 *protocol, __be16 *sport, int type, struct net *net)
 {
 	struct guehdr *guehdr;
 	size_t hdrlen, optlen = 0;
@@ -931,8 +930,7 @@ int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 		return err;
 
 	/* Get source port (based on flow hash) before skb_push */
-	*sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
-						skb, 0, 0, false);
+	*sport = e->sport ? : udp_flow_src_port(net, skb, 0, 0, false);
 
 	hdrlen = sizeof(struct guehdr) + optlen;
 
@@ -982,14 +980,14 @@ int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 EXPORT_SYMBOL(__gue_build_header);
 
 int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-		     u8 *protocol, struct flowi4 *fl4)
+		     u8 *protocol, struct flowi4 *fl4, struct net *net)
 {
 	int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM :
 						       SKB_GSO_UDP_TUNNEL;
 	__be16 sport;
 	int err;
 
-	err = __gue_build_header(skb, e, protocol, &sport, type);
+	err = __gue_build_header(skb, e, protocol, &sport, type, net);
 	if (err)
 		return err;
 
diff --git a/net/ipv6/fou6.c b/net/ipv6/fou6.c
index 9ea249b..e38a7b1 100644
--- a/net/ipv6/fou6.c
+++ b/net/ipv6/fou6.c
@@ -34,14 +34,14 @@ static void fou6_build_udp(struct sk_buff *skb, struct ip_tunnel_encap *e,
 }
 
 int fou6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-		      u8 *protocol, struct flowi6 *fl6)
+		      u8 *protocol, struct flowi6 *fl6, struct net *net)
 {
 	__be16 sport;
 	int err;
 	int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ?
 		SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
 
-	err = __fou_build_header(skb, e, protocol, &sport, type);
+	err = __fou_build_header(skb, e, protocol, &sport, type, net);
 	if (err)
 		return err;
 
@@ -52,14 +52,14 @@ int fou6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 EXPORT_SYMBOL(fou6_build_header);
 
 int gue6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-		      u8 *protocol, struct flowi6 *fl6)
+		      u8 *protocol, struct flowi6 *fl6, struct net *net)
 {
 	__be16 sport;
 	int err;
 	int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ?
 		SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
 
-	err = __gue_build_header(skb, e, protocol, &sport, type);
+	err = __gue_build_header(skb, e, protocol, &sport, type, net);
 	if (err)
 		return err;
 
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH net-next 3/8] tou: Base infrastructure for Transport over UDP
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
  2016-06-16 17:51 ` [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag Tom Herbert
  2016-06-16 17:51 ` [PATCH net-next 2/8] fou: Change ip_tunnel_encap to take net argument Tom Herbert
@ 2016-06-16 17:51 ` Tom Herbert
  2016-06-16 17:51 ` [PATCH net-next 4/8] ipv4: Support TOU Tom Herbert
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 17:51 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

Add tou.c that implements common setsockopt functionality. This includes
initialization and argument structure for the setsockopt.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/net/inet_sock.h        |   2 +
 include/net/ip_tunnels.h       |   1 +
 include/uapi/linux/if_tunnel.h |  10 +++
 net/ipv4/Makefile              |   3 +-
 net/ipv4/af_inet.c             |   4 ++
 net/ipv4/tou.c                 | 140 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 159 insertions(+), 1 deletion(-)
 create mode 100644 net/ipv4/tou.c

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 012b1f9..d39f383 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -167,6 +167,7 @@ struct rtable;
  * @uc_index - Unicast outgoing device index
  * @mc_index - Multicast device index
  * @mc_list - Group array
+ * @tou_encap - Transports over UDP encapsulation
  * @cork - info to build ip hdr on each ip frag while socket is corked
  */
 struct inet_sock {
@@ -209,6 +210,7 @@ struct inet_sock {
 	__be32			mc_addr;
 	struct ip_mc_socklist __rcu	*mc_list;
 	struct inet_cork_full	cork;
+	struct ip_tunnel_encap	*tou_encap;
 };
 
 #define IPCORK_OPT	1	/* ip-options has been held in ipcork.opt */
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 7594132..e7ec9eb 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -88,6 +88,7 @@ struct ip_tunnel_encap {
 	u16			flags;
 	__be16			sport;
 	__be16			dport;
+	struct rcu_head		rcu_head;
 };
 
 struct ip_tunnel_prl_entry {
diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 1046f55..d0415ae 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -71,6 +71,16 @@ enum tunnel_encap_types {
 #define TUNNEL_ENCAP_FLAG_CSUM6		(1<<1)
 #define TUNNEL_ENCAP_FLAG_REMCSUM	(1<<2)
 
+/* Structure for Transport Over UDP (TOU) encapsulation. This is used in
+ * setsockopt of inet sockets.
+ */
+struct tou_encap {
+	u16			type; /* enum tunnel_encap_types */
+	u16			flags;
+	__be16			sport;
+	__be16			dport;
+};
+
 /* SIT-mode i_flags */
 #define	SIT_ISATAP	0x0001
 
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 24629b6..c4349e2 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tou.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index d39e9e4..9a49376 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -120,6 +120,7 @@
 #include <linux/mroute.h>
 #endif
 #include <net/l3mdev.h>
+#include <net/tou.h>
 
 
 /* The inetsw table contains everything that inet_create needs to
@@ -1830,6 +1831,9 @@ static int __init inet_init(void)
 	/* Add UDP-Lite (RFC 3828) */
 	udplite4_register();
 
+	/* Set TOU slab cache (Transport layer encapsulation over UDP) */
+	tou_init();
+
 	ping_init();
 
 	/*
diff --git a/net/ipv4/tou.c b/net/ipv4/tou.c
new file mode 100644
index 0000000..ef9999f
--- /dev/null
+++ b/net/ipv4/tou.c
@@ -0,0 +1,140 @@
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/socket.h>
+#include <linux/skbuff.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <net/genetlink.h>
+#include <net/gue.h>
+#include <net/ip.h>
+#include <net/protocol.h>
+#include <net/udp.h>
+#include <net/udp_tunnel.h>
+#include <net/xfrm.h>
+#include <net/tou.h>
+#include <net/ip6_tunnel.h>
+#include <uapi/linux/fou.h>
+#include <uapi/linux/genetlink.h>
+
+static struct kmem_cache *tou_cachep __read_mostly;
+
+static void tou_encap_rcu_free(struct rcu_head *head)
+{
+	struct ip_tunnel_encap *e = container_of(head, struct ip_tunnel_encap,
+						 rcu_head);
+
+	kmem_cache_free(tou_cachep, e);
+}
+
+int tou_encap_setsockopt(struct sock *sk, char __user *optval, int optlen,
+			 bool is_ipv6)
+{
+	struct tou_encap te;
+	struct ip_tunnel_encap encap;
+	struct inet_sock *inet = inet_sk(sk);
+	struct ip_tunnel_encap *e = inet->tou_encap;
+	int hlen = 0, old_hlen = 0;
+
+	if (optlen < sizeof(te))
+		return -EINVAL;
+
+	if (copy_from_user(&te, optval, sizeof(te)))
+		return -EFAULT;
+
+	if (e) {
+		old_hlen = is_ipv6 ? ip6_encap_hlen(e) : ip_encap_hlen(e);
+		if (unlikely(old_hlen < 0))
+			return -EINVAL;
+	}
+
+	if (te.type == TUNNEL_ENCAP_NONE) {
+		if (e) {
+			if (unlikely(old_hlen < 0))
+				return -EINVAL;
+
+			rcu_assign_pointer(inet->tou_encap, NULL);
+			call_rcu(&e->rcu_head, tou_encap_rcu_free);
+
+			goto adjust_ext_hdr;
+		} else {
+			return 0;
+		}
+	}
+
+	memset(&encap, 0, sizeof(encap));
+	encap.type = te.type;
+	encap.sport = te.sport;
+	encap.dport = te.dport;
+	encap.flags = te.flags;
+
+	hlen = is_ipv6 ? ip6_encap_hlen(e) : ip_encap_hlen(e);
+	if (hlen < 0)
+		return hlen;
+
+	if (!e) {
+		e = kmem_cache_alloc(tou_cachep, GFP_KERNEL);
+		if (!e)
+			return -ENOMEM;
+		rcu_assign_pointer(inet->tou_encap, e);
+	}
+
+	*e = encap;
+
+adjust_ext_hdr:
+	if (inet->is_icsk) {
+		struct inet_connection_sock *icsk = inet_csk(sk);
+
+		/* For a connected socket add the overhead of encapsulation
+		 * (specifically the difference between the new encapsulation
+		 * and the old one it present) into the extrenal header length
+		 * and adjust the mss.
+		 */
+		icsk->icsk_ext_hdr_len += (hlen - old_hlen);
+		icsk->icsk_sync_mss(sk, icsk->icsk_pmtu_cookie);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(tou_encap_setsockopt);
+
+int tou_encap_getsockopt(struct sock *sk, char __user *optval,
+			 int len, int __user *optlen, bool is_ipv6)
+{
+	struct tou_encap te;
+	struct inet_sock *inet = inet_sk(sk);
+	struct ip_tunnel_encap *e = inet->tou_encap;
+
+	if (len < sizeof(te))
+		return -EINVAL;
+
+	len = sizeof(te);
+
+	memset(&te, 0, sizeof(te));
+
+	if (!e) {
+		te.type = TUNNEL_ENCAP_NONE;
+	} else {
+		te.type = e->type;
+		te.sport = e->sport;
+		te.dport = e->dport;
+		te.flags = e->flags;
+	}
+
+	if (put_user(len, optlen))
+		return -EFAULT;
+
+	if (copy_to_user(optval, &te, len))
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL(tou_encap_getsockopt);
+
+void __init tou_init(void)
+{
+	tou_cachep = kmem_cache_create("tou_cache",
+				       sizeof(struct ip_tunnel_encap), 0,
+				       SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+}
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH net-next 4/8] ipv4: Support TOU
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
                   ` (2 preceding siblings ...)
  2016-06-16 17:51 ` [PATCH net-next 3/8] tou: Base infrastructure for Transport over UDP Tom Herbert
@ 2016-06-16 17:51 ` Tom Herbert
  2016-06-16 17:51 ` [PATCH net-next 5/8] tcp: Support for TOU Tom Herbert
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 17:51 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

Add tou_encap structure to inet_sock. In transmit path (ip_queue_xmit)
check if encapsulation is enabled and call the build header op
if it is. Add IP_TOU_ENCAP setsockopt for IPv4 sockets.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/uapi/linux/in.h |  1 +
 net/ipv4/ip_output.c    | 42 ++++++++++++++++++++++++++++++++++++------
 net/ipv4/ip_sockglue.c  |  7 +++++++
 net/ipv4/tou.c          |  2 +-
 4 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index eaf9491..9827bff 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -152,6 +152,7 @@ struct in_addr {
 #define MCAST_MSFILTER			48
 #define IP_MULTICAST_ALL		49
 #define IP_UNICAST_IF			50
+#define IP_TOU_ENCAP			51
 
 #define MCAST_EXCLUDE	0
 #define MCAST_INCLUDE	1
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index cbac493..11cf4de 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -78,6 +78,7 @@
 #include <linux/netfilter_bridge.h>
 #include <linux/netlink.h>
 #include <linux/tcp.h>
+#include <net/ip_tunnels.h>
 
 static int
 ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
@@ -382,11 +383,38 @@ int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
 	struct rtable *rt;
 	struct iphdr *iph;
 	int res;
+	__be16 dport, sport;
+	u8 protocol = sk->sk_protocol;
+	struct ip_tunnel_encap *e;
 
 	/* Skip all of this if the packet is already routed,
 	 * f.e. by something like SCTP.
 	 */
 	rcu_read_lock();
+
+	e = rcu_dereference(inet->tou_encap);
+	if (e) {
+		const struct ip_tunnel_encap_ops *ops;
+
+		/* Transport layer protocol over UDP enapsulation */
+		dport = e->dport;
+		sport = e->sport;
+		ops = rcu_dereference(iptun_encaps[e->type]);
+		if (likely(ops && ops->build_header)) {
+			res = ops->build_header(skb, e, &protocol,
+						(struct flowi4 *)fl,
+						sock_net(sk));
+			if (res < 0)
+				goto fail;
+		} else {
+			res = -EINVAL;
+			goto fail;
+		}
+	} else {
+		dport = inet->inet_dport;
+		sport = inet->inet_sport;
+	}
+
 	inet_opt = rcu_dereference(inet->inet_opt);
 	fl4 = &fl->u.ip4;
 	rt = skb_rtable(skb);
@@ -409,9 +437,9 @@ int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
 		 */
 		rt = ip_route_output_ports(net, fl4, sk,
 					   daddr, inet->inet_saddr,
-					   inet->inet_dport,
-					   inet->inet_sport,
-					   sk->sk_protocol,
+					   dport,
+					   sport,
+					   protocol,
 					   RT_CONN_FLAGS(sk),
 					   sk->sk_bound_dev_if);
 		if (IS_ERR(rt))
@@ -434,7 +462,7 @@ packet_routed:
 	else
 		iph->frag_off = 0;
 	iph->ttl      = ip_select_ttl(inet, &rt->dst);
-	iph->protocol = sk->sk_protocol;
+	iph->protocol = protocol;
 	ip_copy_addrs(iph, fl4);
 
 	/* Transport layer set skb->h.foo itself. */
@@ -456,10 +484,12 @@ packet_routed:
 	return res;
 
 no_route:
-	rcu_read_unlock();
 	IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	res = -EHOSTUNREACH;
+fail:
+	rcu_read_unlock();
 	kfree_skb(skb);
-	return -EHOSTUNREACH;
+	return res;
 }
 EXPORT_SYMBOL(ip_queue_xmit);
 
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 71a52f4d..0c9d3f0 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -42,6 +42,7 @@
 #include <net/transp_v6.h>
 #endif
 #include <net/ip_fib.h>
+#include <net/tou.h>
 
 #include <linux/errqueue.h>
 #include <asm/uaccess.h>
@@ -1162,6 +1163,10 @@ mc_msf_out:
 		inet->min_ttl = val;
 		break;
 
+	case IP_TOU_ENCAP:
+		err = tou_encap_setsockopt(sk, optval, optlen, false);
+		break;
+
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -1493,6 +1498,8 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
 	case IP_MINTTL:
 		val = inet->min_ttl;
 		break;
+	case IPV6_TOU_ENCAP:
+		return tou_encap_getsockopt(sk, optval, len, optlen, false);
 	default:
 		release_sock(sk);
 		return -ENOPROTOOPT;
diff --git a/net/ipv4/tou.c b/net/ipv4/tou.c
index ef9999f..d3069a8 100644
--- a/net/ipv4/tou.c
+++ b/net/ipv4/tou.c
@@ -69,7 +69,7 @@ int tou_encap_setsockopt(struct sock *sk, char __user *optval, int optlen,
 	encap.dport = te.dport;
 	encap.flags = te.flags;
 
-	hlen = is_ipv6 ? ip6_encap_hlen(e) : ip_encap_hlen(e);
+	hlen = is_ipv6 ? ip6_encap_hlen(&encap) : ip_encap_hlen(&encap);
 	if (hlen < 0)
 		return hlen;
 
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH net-next 5/8] tcp: Support for TOU
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
                   ` (3 preceding siblings ...)
  2016-06-16 17:51 ` [PATCH net-next 4/8] ipv4: Support TOU Tom Herbert
@ 2016-06-16 17:51 ` Tom Herbert
  2016-06-16 17:52 ` [PATCH net-next 6/8] ipv6: Support TOU Tom Herbert
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 17:51 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

Need to adjust MSS to account for encapsulation overhead. This is done
by add encpasulation header length into icsk_ext_hdr_len.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 net/ipv4/tcp_ipv4.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3708de2..c344f667 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -74,6 +74,7 @@
 #include <net/xfrm.h>
 #include <net/secure_seq.h>
 #include <net/busy_poll.h>
+#include <net/tou.h>
 
 #include <linux/inet.h>
 #include <linux/ipv6.h>
@@ -205,9 +206,9 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	inet->inet_dport = usin->sin_port;
 	sk_daddr_set(sk, daddr);
 
-	inet_csk(sk)->icsk_ext_hdr_len = 0;
+	inet_csk(sk)->icsk_ext_hdr_len = tou_hdr_len(sk);
 	if (inet_opt)
-		inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
+		inet_csk(sk)->icsk_ext_hdr_len += inet_opt->opt.optlen;
 
 	tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;
 
@@ -1296,9 +1297,9 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb,
 	newinet->mc_index     = inet_iif(skb);
 	newinet->mc_ttl	      = ip_hdr(skb)->ttl;
 	newinet->rcv_tos      = ip_hdr(skb)->tos;
-	inet_csk(newsk)->icsk_ext_hdr_len = 0;
+	inet_csk(sk)->icsk_ext_hdr_len = tou_hdr_len(sk);
 	if (inet_opt)
-		inet_csk(newsk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
+		inet_csk(newsk)->icsk_ext_hdr_len += inet_opt->opt.optlen;
 	newinet->inet_id = newtp->write_seq ^ jiffies;
 
 	if (!dst) {
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH net-next 6/8] ipv6: Support TOU
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
                   ` (4 preceding siblings ...)
  2016-06-16 17:51 ` [PATCH net-next 5/8] tcp: Support for TOU Tom Herbert
@ 2016-06-16 17:52 ` Tom Herbert
  2016-06-16 17:52 ` [PATCH net-next 7/8] tcp6: Support for TOU Tom Herbert
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 17:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

In transmit path (inet6_csk_xmit) check if encapsulation is enabled and
call the build header op if it is. Add IP_TOU_ENCAP setsockopt for IPv6
sockets.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/uapi/linux/in6.h         |  1 +
 net/ipv6/inet6_connection_sock.c | 58 ++++++++++++++++++++++++++++++++++------
 net/ipv6/ipv6_sockglue.c         |  7 +++++
 3 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/in6.h b/include/uapi/linux/in6.h
index 318a482..9a610c3 100644
--- a/include/uapi/linux/in6.h
+++ b/include/uapi/linux/in6.h
@@ -282,6 +282,7 @@ struct in6_flowlabel_req {
 #define IPV6_RECVORIGDSTADDR    IPV6_ORIGDSTADDR
 #define IPV6_TRANSPARENT        75
 #define IPV6_UNICAST_IF         76
+#define IPV6_TOU_ENCAP		77
 
 /*
  * Multicast Routing:
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 532c3ef..6c971bc 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -24,6 +24,7 @@
 #include <net/inet_ecn.h>
 #include <net/inet_hashtables.h>
 #include <net/ip6_route.h>
+#include <net/ip6_tunnel.h>
 #include <net/sock.h>
 #include <net/inet6_connection_sock.h>
 #include <net/sock_reuseport.h>
@@ -118,13 +119,11 @@ struct dst_entry *__inet6_csk_dst_check(struct sock *sk, u32 cookie)
 	return __sk_dst_check(sk, cookie);
 }
 
-static struct dst_entry *inet6_csk_route_socket(struct sock *sk,
-						struct flowi6 *fl6)
+static void inet6_csk_fill_flowi6(struct sock *sk,
+				  struct flowi6 *fl6)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct ipv6_pinfo *np = inet6_sk(sk);
-	struct in6_addr *final_p, final;
-	struct dst_entry *dst;
 
 	memset(fl6, 0, sizeof(*fl6));
 	fl6->flowi6_proto = sk->sk_protocol;
@@ -137,6 +136,14 @@ static struct dst_entry *inet6_csk_route_socket(struct sock *sk,
 	fl6->fl6_sport = inet->inet_sport;
 	fl6->fl6_dport = inet->inet_dport;
 	security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+}
+
+static struct dst_entry *inet6_csk_route_socket(struct sock *sk,
+						struct flowi6 *fl6)
+{
+	struct ipv6_pinfo *np = inet6_sk(sk);
+	struct in6_addr *final_p, final;
+	struct dst_entry *dst;
 
 	rcu_read_lock();
 	final_p = fl6_update_dst(fl6, rcu_dereference(np->opt), &final);
@@ -154,20 +161,48 @@ static struct dst_entry *inet6_csk_route_socket(struct sock *sk,
 
 int inet6_csk_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl_unused)
 {
+	struct inet_sock *inet = inet_sk(sk);
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct flowi6 fl6;
 	struct dst_entry *dst;
 	int res;
+	u8 protocol = sk->sk_protocol;
+	struct ip_tunnel_encap *e = inet->tou_encap;
+
+	inet6_csk_fill_flowi6(sk, &fl6);
+
+	rcu_read_lock();
+
+	e = rcu_dereference(inet->tou_encap);
+	if (e) {
+		const struct ip6_tnl_encap_ops *ops;
+
+		/* Transport layer protocol over UDP enapsulation */
+		ops = rcu_dereference(ip6tun_encaps[e->type]);
+		if (likely(ops && ops->build_header)) {
+			res = ops->build_header(skb, e, &protocol,
+						&fl6, sock_net(sk));
+			if (res < 0)
+				goto fail;
+		} else {
+			res = -EINVAL;
+			goto fail;
+		}
+
+		/* Changing ports and protocol to be routed */
+		fl6.fl6_sport = e->sport;
+		fl6.fl6_dport = e->dport;
+		fl6.flowi6_proto = protocol;
+	}
 
 	dst = inet6_csk_route_socket(sk, &fl6);
 	if (IS_ERR(dst)) {
 		sk->sk_err_soft = -PTR_ERR(dst);
 		sk->sk_route_caps = 0;
-		kfree_skb(skb);
-		return PTR_ERR(dst);
+		res = PTR_ERR(dst);
+		goto fail;
 	}
 
-	rcu_read_lock();
 	skb_dst_set_noref(skb, dst);
 
 	/* Restore final destination back after routing done */
@@ -177,14 +212,21 @@ int inet6_csk_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl_unused
 		       np->tclass);
 	rcu_read_unlock();
 	return res;
+fail:
+	rcu_read_unlock();
+	kfree_skb(skb);
+	return res;
 }
 EXPORT_SYMBOL_GPL(inet6_csk_xmit);
 
 struct dst_entry *inet6_csk_update_pmtu(struct sock *sk, u32 mtu)
 {
 	struct flowi6 fl6;
-	struct dst_entry *dst = inet6_csk_route_socket(sk, &fl6);
+	struct dst_entry *dst;
+
+	inet6_csk_fill_flowi6(sk, &fl6);
 
+	dst = inet6_csk_route_socket(sk, &fl6);
 	if (IS_ERR(dst))
 		return NULL;
 	dst->ops->update_pmtu(dst, sk, NULL, mtu);
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index a9895e1..1697c0e 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -52,6 +52,7 @@
 #include <net/udplite.h>
 #include <net/xfrm.h>
 #include <net/compat.h>
+#include <net/tou.h>
 
 #include <asm/uaccess.h>
 
@@ -868,6 +869,9 @@ pref_skip_coa:
 		np->autoflowlabel = valbool;
 		retv = 0;
 		break;
+	case IPV6_TOU_ENCAP:
+		retv = tou_encap_setsockopt(sk, optval, optlen, true);
+		break;
 	}
 
 	release_sock(sk);
@@ -1310,6 +1314,9 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 		val = np->autoflowlabel;
 		break;
 
+	case IPV6_TOU_ENCAP:
+		return tou_encap_getsockopt(sk, optval, len, optlen, true);
+
 	default:
 		return -ENOPROTOOPT;
 	}
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH net-next 7/8] tcp6: Support for TOU
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
                   ` (5 preceding siblings ...)
  2016-06-16 17:52 ` [PATCH net-next 6/8] ipv6: Support TOU Tom Herbert
@ 2016-06-16 17:52 ` Tom Herbert
  2016-06-16 17:52 ` [PATCH net-next 8/8] tou: Support for GSO Tom Herbert
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 17:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

Need to adjust MSS to account for encapsulation overhead. This is done
by add encpasulation header length into icsk_ext_hdr_len.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 net/ipv6/tcp_ipv6.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f36c2d0..eb67da7 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -62,6 +62,7 @@
 #include <net/inet_common.h>
 #include <net/secure_seq.h>
 #include <net/busy_poll.h>
+#include <net/tou.h>
 
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
@@ -210,7 +211,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 		err = tcp_v4_connect(sk, (struct sockaddr *)&sin, sizeof(sin));
 
 		if (err) {
-			icsk->icsk_ext_hdr_len = exthdrlen;
+			icsk->icsk_ext_hdr_len = tou_hdr_len(sk) + exthdrlen;
 			icsk->icsk_af_ops = &ipv6_specific;
 			sk->sk_backlog_rcv = tcp_v6_do_rcv;
 #ifdef CONFIG_TCP_MD5SIG
@@ -262,9 +263,9 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	    ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr))
 		tcp_fetch_timewait_stamp(sk, dst);
 
-	icsk->icsk_ext_hdr_len = 0;
+	icsk->icsk_ext_hdr_len = tou_hdr_len(sk);
 	if (opt)
-		icsk->icsk_ext_hdr_len = opt->opt_flen +
+		icsk->icsk_ext_hdr_len += opt->opt_flen +
 					 opt->opt_nflen;
 
 	tp->rx_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
@@ -1114,9 +1115,9 @@ static struct sock *tcp_v6_syn_recv_sock(const struct sock *sk, struct sk_buff *
 		opt = ipv6_dup_options(newsk, opt);
 		RCU_INIT_POINTER(newnp->opt, opt);
 	}
-	inet_csk(newsk)->icsk_ext_hdr_len = 0;
+	inet_csk(newsk)->icsk_ext_hdr_len = tou_hdr_len(sk);
 	if (opt)
-		inet_csk(newsk)->icsk_ext_hdr_len = opt->opt_nflen +
+		inet_csk(newsk)->icsk_ext_hdr_len += opt->opt_nflen +
 						    opt->opt_flen;
 
 	tcp_ca_openreq_child(newsk, dst);
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH net-next 8/8] tou: Support for GSO
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
                   ` (6 preceding siblings ...)
  2016-06-16 17:52 ` [PATCH net-next 7/8] tcp6: Support for TOU Tom Herbert
@ 2016-06-16 17:52 ` Tom Herbert
  2016-06-16 18:10 ` [PATCH net-next 0/8] tou: Transports over UDP - part I Rick Jones
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 17:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

Add SKB_GSO_TOU. In udp[64]_ufo_fragment check for SKB_GSO_TOU. If this
is set call skb_udp_tou_segment. skb_udp_tou_segment is very similar
to skb_udp_tunnel_segment except that we only need to deal with the
L4 headers.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/linux/netdev_features.h  |   3 +-
 include/linux/netdevice.h        |   1 +
 include/linux/skbuff.h           |   2 +-
 include/net/udp.h                |   2 +
 net/ipv4/fou.c                   |   2 +
 net/ipv4/ip_output.c             |   2 +
 net/ipv4/udp_offload.c           | 163 +++++++++++++++++++++++++++++++++++++--
 net/ipv6/inet6_connection_sock.c |   3 +
 net/ipv6/udp_offload.c           | 128 +++++++++++++++---------------
 9 files changed, 237 insertions(+), 69 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index ab15c6a..ffc4e0a 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -37,7 +37,7 @@ enum {
 	NETIF_F_TSO_BIT			/* ... TCPv4 segmentation */
 		= NETIF_F_GSO_SHIFT,
 	NETIF_F_UFO_BIT,		/* ... UDPv4 fragmentation */
-	NETIF_F_GSO_RSVD,		/* ... Reserved */
+	NETIF_F_GSO_TOU_BIT,		/* ... Transports over UDP */
 	NETIF_F_TSO_ECN_BIT,		/* ... TCP ECN support */
 	NETIF_F_TSO_MANGLEID_BIT,	/* ... IPV4 ID mangling allowed */
 	NETIF_F_TSO6_BIT,		/* ... TCPv6 segmentation */
@@ -131,6 +131,7 @@ enum {
 #define NETIF_F_GSO_PARTIAL	 __NETIF_F(GSO_PARTIAL)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
 #define NETIF_F_GSO_SCTP	__NETIF_F(GSO_SCTP)
+#define NETIF_F_GSO_TOU		__NETIF_F(GSO_TOU)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX	__NETIF_F(HW_VLAN_STAG_RX)
 #define NETIF_F_HW_VLAN_STAG_TX	__NETIF_F(HW_VLAN_STAG_TX)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5969028..624d169 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4023,6 +4023,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	/* check flags correspondence */
 	BUILD_BUG_ON(SKB_GSO_TCPV4   != (NETIF_F_TSO >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP     != (NETIF_F_UFO >> NETIF_F_GSO_SHIFT));
+	BUILD_BUG_ON(SKB_GSO_TOU     != (NETIF_F_GSO_TOU >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TCP_ECN != (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TCP_FIXEDID != (NETIF_F_TSO_MANGLEID >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TCPV6   != (NETIF_F_TSO6 >> NETIF_F_GSO_SHIFT));
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index be34e06..9f85a7d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -463,7 +463,7 @@ enum {
 	SKB_GSO_TCPV4 = 1 << 0,
 	SKB_GSO_UDP = 1 << 1,
 
-	SKB_GSO_RSVD = 1 << 2,
+	SKB_GSO_TOU = 1 << 2,
 
 	/* This indicates the tcp segment has CWR set. */
 	SKB_GSO_TCP_ECN = 1 << 3,
diff --git a/include/net/udp.h b/include/net/udp.h
index 8894d71..48b767f 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -262,6 +262,8 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait);
 struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
 				       netdev_features_t features,
 				       bool is_ipv6);
+struct sk_buff *skb_udp_tou_segment(struct sk_buff *skb,
+				    netdev_features_t features, bool is_ipv6);
 int udp_lib_getsockopt(struct sock *sk, int level, int optname,
 		       char __user *optval, int __user *optlen);
 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 9cd9168..3cdc060 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -435,6 +435,8 @@ next_proto:
 	/* Flag this frame as already having an outer encap header */
 	NAPI_GRO_CB(skb)->is_fou = 1;
 
+	skb_set_transport_header(skb, skb_gro_offset(skb));
+
 	rcu_read_lock();
 	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
 	ops = rcu_dereference(offloads[proto]);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 11cf4de..090cede 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -410,6 +410,8 @@ int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
 			res = -EINVAL;
 			goto fail;
 		}
+		skb_shinfo(skb)->gso_type |= SKB_GSO_TOU;
+		skb_set_inner_ipproto(skb, sk->sk_protocol);
 	} else {
 		dport = inet->inet_dport;
 		sport = inet->inet_sport;
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 81f253b..8e56a21 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -184,6 +184,155 @@ out_unlock:
 }
 EXPORT_SYMBOL(skb_udp_tunnel_segment);
 
+/* __skb_udp_tou_segment
+ *
+ * Handle segmentation of TOU (Transports Protocols over UDP). Note that this
+ * is very similar __skb_udp_tunnel_segment however here we don't need to
+ * deal with MAC or nework layers. Everything is done base on transport
+ * headers only.
+ */
+static struct sk_buff *__skb_udp_tou_segment(struct sk_buff *skb,
+	netdev_features_t features,
+	struct sk_buff *(*gso_inner_segment)(struct sk_buff *skb,
+					     netdev_features_t features),
+	bool is_ipv6)
+{
+	int tnl_hlen = skb_inner_transport_header(skb) -
+		       skb_transport_header(skb);
+	bool remcsum, need_csum, offload_csum, ufo;
+	struct sk_buff *segs = ERR_PTR(-EINVAL);
+	struct udphdr *uh = udp_hdr(skb);
+	__wsum partial;
+
+	if (unlikely(!pskb_may_pull(skb, tnl_hlen)))
+		goto out;
+
+	/* Adjust partial header checksum to negate old length.
+	 * We cannot rely on the value contained in uh->len as it is
+	 * possible that the actual value exceeds the boundaries of the
+	 * 16 bit length field due to the header being added outside of an
+	 * IP or IPv6 frame that was already limited to 64K - 1.
+	 */
+	if (skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL)
+		partial = (__force __wsum)uh->len;
+	else
+		partial = (__force __wsum)htonl(skb->len);
+	partial = csum_sub(csum_unfold(uh->check), partial);
+
+	/* Setup inner skb. Only the transport header is relevant */
+	skb->encapsulation = 0;
+	SKB_GSO_CB(skb)->encap_level = 0;
+	__skb_pull(skb, tnl_hlen);
+	skb_reset_transport_header(skb);
+
+	need_csum = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM);
+	skb->encap_hdr_csum = need_csum;
+
+	remcsum = !!(skb_shinfo(skb)->gso_type & SKB_GSO_TUNNEL_REMCSUM);
+	skb->remcsum_offload = remcsum;
+
+	ufo = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP);
+
+	/* Try to offload checksum if possible */
+	offload_csum = !!(need_csum &&
+			  (skb->dev->features &
+			   (is_ipv6 ? (NETIF_F_HW_CSUM | NETIF_F_IPV6_CSUM) :
+				      (NETIF_F_HW_CSUM | NETIF_F_IP_CSUM))));
+
+	features &= skb->dev->hw_enc_features;
+
+	/* The only checksum offload we care about from here on out is the
+	 * outer one so strip the existing checksum feature flags and
+	 * instead set the flag based on our outer checksum offload value.
+	 */
+	if (remcsum || ufo) {
+		features &= ~NETIF_F_CSUM_MASK;
+		if (!need_csum || offload_csum)
+			features |= NETIF_F_HW_CSUM;
+	}
+
+	/* segment inner packet. */
+	segs = gso_inner_segment(skb, features);
+	if (IS_ERR_OR_NULL(segs)) {
+		skb->encapsulation = 1;
+		skb_push(skb, tnl_hlen);
+		skb_reset_transport_header(skb);
+
+		goto out;
+	}
+
+	skb = segs;
+	do {
+		unsigned int len;
+
+		if (remcsum)
+			skb->ip_summed = CHECKSUM_NONE;
+
+		/* Adjust transport header back to UDP header */
+
+		skb->transport_header -= tnl_hlen;
+		uh = udp_hdr(skb);
+		len = skb->len - ((unsigned char *)uh - skb->data);
+
+		/* If we are only performing partial GSO the inner header
+		 * will be using a length value equal to only one MSS sized
+		 * segment instead of the entire frame.
+		 */
+		if (skb_is_gso(skb)) {
+			uh->len = htons(skb_shinfo(skb)->gso_size +
+					SKB_GSO_CB(skb)->data_offset +
+					skb->head - (unsigned char *)uh);
+		} else {
+			uh->len = htons(len);
+		}
+
+		if (!need_csum)
+			continue;
+
+		uh->check = ~csum_fold(csum_add(partial,
+				       (__force __wsum)htonl(len)));
+
+		if (skb->encapsulation || !offload_csum) {
+			uh->check = gso_make_checksum(skb, ~uh->check);
+			if (uh->check == 0)
+				uh->check = CSUM_MANGLED_0;
+		} else {
+			skb->ip_summed = CHECKSUM_PARTIAL;
+			skb->csum_start = skb_transport_header(skb) - skb->head;
+			skb->csum_offset = offsetof(struct udphdr, check);
+		}
+	} while ((skb = skb->next));
+out:
+	return segs;
+}
+
+struct sk_buff *skb_udp_tou_segment(struct sk_buff *skb,
+				    netdev_features_t features,
+				    bool is_ipv6)
+{
+	const struct net_offload **offloads;
+	const struct net_offload *ops;
+	struct sk_buff *segs = ERR_PTR(-EINVAL);
+	struct sk_buff *(*gso_inner_segment)(struct sk_buff *skb,
+					     netdev_features_t features);
+
+	rcu_read_lock();
+
+	offloads = is_ipv6 ? inet6_offloads : inet_offloads;
+	ops = rcu_dereference(offloads[skb->inner_ipproto]);
+	if (!ops || !ops->callbacks.gso_segment)
+		goto out_unlock;
+	gso_inner_segment = ops->callbacks.gso_segment;
+
+	segs = __skb_udp_tou_segment(skb, features, gso_inner_segment, is_ipv6);
+
+out_unlock:
+	rcu_read_unlock();
+
+	return segs;
+}
+EXPORT_SYMBOL(skb_udp_tou_segment);
+
 static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 					 netdev_features_t features)
 {
@@ -193,11 +342,15 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 	struct udphdr *uh;
 	struct iphdr *iph;
 
-	if (skb->encapsulation &&
-	    (skb_shinfo(skb)->gso_type &
-	     (SKB_GSO_UDP_TUNNEL|SKB_GSO_UDP_TUNNEL_CSUM))) {
-		segs = skb_udp_tunnel_segment(skb, features, false);
-		goto out;
+	if (skb->encapsulation) {
+		if (skb_shinfo(skb)->gso_type & SKB_GSO_TOU) {
+			segs = skb_udp_tou_segment(skb, features, false);
+			goto out;
+		} else if ((skb_shinfo(skb)->gso_type &
+		    (SKB_GSO_UDP_TUNNEL | SKB_GSO_UDP_TUNNEL_CSUM))) {
+			segs = skb_udp_tunnel_segment(skb, features, false);
+			goto out;
+		}
 	}
 
 	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 6c971bc..7b3978a 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -189,6 +189,9 @@ int inet6_csk_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl_unused
 			goto fail;
 		}
 
+		skb_shinfo(skb)->gso_type |= SKB_GSO_TOU;
+		skb_set_inner_ipproto(skb, sk->sk_protocol);
+
 		/* Changing ports and protocol to be routed */
 		fl6.fl6_sport = e->sport;
 		fl6.fl6_dport = e->dport;
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index ac858c4..b53486b 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -29,6 +29,8 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 	u8 frag_hdr_sz = sizeof(struct frag_hdr);
 	__wsum csum;
 	int tnl_hlen;
+	const struct ipv6hdr *ipv6h;
+	struct udphdr *uh;
 
 	mss = skb_shinfo(skb)->gso_size;
 	if (unlikely(skb->len <= mss))
@@ -47,74 +49,76 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 		goto out;
 	}
 
-	if (skb->encapsulation && skb_shinfo(skb)->gso_type &
-	    (SKB_GSO_UDP_TUNNEL|SKB_GSO_UDP_TUNNEL_CSUM))
-		segs = skb_udp_tunnel_segment(skb, features, true);
-	else {
-		const struct ipv6hdr *ipv6h;
-		struct udphdr *uh;
-
-		if (!pskb_may_pull(skb, sizeof(struct udphdr)))
+	if (skb->encapsulation) {
+		if (skb_shinfo(skb)->gso_type & SKB_GSO_TOU) {
+			segs = skb_udp_tou_segment(skb, features, true);
+			goto out;
+		} else if (skb_shinfo(skb)->gso_type &
+			   (SKB_GSO_UDP_TUNNEL | SKB_GSO_UDP_TUNNEL_CSUM)) {
+			segs = skb_udp_tunnel_segment(skb, features, true);
 			goto out;
-
-		/* Do software UFO. Complete and fill in the UDP checksum as HW cannot
-		 * do checksum of UDP packets sent as multiple IP fragments.
-		 */
-
-		uh = udp_hdr(skb);
-		ipv6h = ipv6_hdr(skb);
-
-		uh->check = 0;
-		csum = skb_checksum(skb, 0, skb->len, 0);
-		uh->check = udp_v6_check(skb->len, &ipv6h->saddr,
-					  &ipv6h->daddr, csum);
-		if (uh->check == 0)
-			uh->check = CSUM_MANGLED_0;
-
-		skb->ip_summed = CHECKSUM_NONE;
-
-		/* If there is no outer header we can fake a checksum offload
-		 * due to the fact that we have already done the checksum in
-		 * software prior to segmenting the frame.
-		 */
-		if (!skb->encap_hdr_csum)
-			features |= NETIF_F_HW_CSUM;
-
-		/* Check if there is enough headroom to insert fragment header. */
-		tnl_hlen = skb_tnl_header_len(skb);
-		if (skb->mac_header < (tnl_hlen + frag_hdr_sz)) {
-			if (gso_pskb_expand_head(skb, tnl_hlen + frag_hdr_sz))
-				goto out;
 		}
+	}
 
-		/* Find the unfragmentable header and shift it left by frag_hdr_sz
-		 * bytes to insert fragment header.
-		 */
-		unfrag_ip6hlen = ip6_find_1stfragopt(skb, &prevhdr);
-		nexthdr = *prevhdr;
-		*prevhdr = NEXTHDR_FRAGMENT;
-		unfrag_len = (skb_network_header(skb) - skb_mac_header(skb)) +
-			     unfrag_ip6hlen + tnl_hlen;
-		packet_start = (u8 *) skb->head + SKB_GSO_CB(skb)->mac_offset;
-		memmove(packet_start-frag_hdr_sz, packet_start, unfrag_len);
-
-		SKB_GSO_CB(skb)->mac_offset -= frag_hdr_sz;
-		skb->mac_header -= frag_hdr_sz;
-		skb->network_header -= frag_hdr_sz;
-
-		fptr = (struct frag_hdr *)(skb_network_header(skb) + unfrag_ip6hlen);
-		fptr->nexthdr = nexthdr;
-		fptr->reserved = 0;
-		if (!skb_shinfo(skb)->ip6_frag_id)
-			ipv6_proxy_select_ident(dev_net(skb->dev), skb);
-		fptr->identification = skb_shinfo(skb)->ip6_frag_id;
+	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
+		goto out;
 
-		/* Fragment the skb. ipv6 header and the remaining fields of the
-		 * fragment header are updated in ipv6_gso_segment()
-		 */
-		segs = skb_segment(skb, features);
+	/* Do software UFO. Complete and fill in the UDP checksum as HW cannot
+	 * do checksum of UDP packets sent as multiple IP fragments.
+	 */
+
+	uh = udp_hdr(skb);
+	ipv6h = ipv6_hdr(skb);
+
+	uh->check = 0;
+	csum = skb_checksum(skb, 0, skb->len, 0);
+	uh->check = udp_v6_check(skb->len, &ipv6h->saddr,
+				  &ipv6h->daddr, csum);
+	if (uh->check == 0)
+		uh->check = CSUM_MANGLED_0;
+
+	skb->ip_summed = CHECKSUM_NONE;
+
+	/* If there is no outer header we can fake a checksum offload
+	 * due to the fact that we have already done the checksum in
+	 * software prior to segmenting the frame.
+	 */
+	if (!skb->encap_hdr_csum)
+		features |= NETIF_F_HW_CSUM;
+
+	/* Check if there is enough headroom to insert fragment header. */
+	tnl_hlen = skb_tnl_header_len(skb);
+	if (skb->mac_header < (tnl_hlen + frag_hdr_sz)) {
+		if (gso_pskb_expand_head(skb, tnl_hlen + frag_hdr_sz))
+			goto out;
 	}
 
+	/* Find the unfragmentable header and shift it left by frag_hdr_sz
+	 * bytes to insert fragment header.
+	 */
+	unfrag_ip6hlen = ip6_find_1stfragopt(skb, &prevhdr);
+	nexthdr = *prevhdr;
+	*prevhdr = NEXTHDR_FRAGMENT;
+	unfrag_len = (skb_network_header(skb) - skb_mac_header(skb)) +
+		     unfrag_ip6hlen + tnl_hlen;
+	packet_start = (u8 *)skb->head + SKB_GSO_CB(skb)->mac_offset;
+	memmove(packet_start - frag_hdr_sz, packet_start, unfrag_len);
+
+	SKB_GSO_CB(skb)->mac_offset -= frag_hdr_sz;
+	skb->mac_header -= frag_hdr_sz;
+	skb->network_header -= frag_hdr_sz;
+
+	fptr = (struct frag_hdr *)(skb_network_header(skb) + unfrag_ip6hlen);
+	fptr->nexthdr = nexthdr;
+	fptr->reserved = 0;
+	if (!skb_shinfo(skb)->ip6_frag_id)
+		ipv6_proxy_select_ident(dev_net(skb->dev), skb);
+	fptr->identification = skb_shinfo(skb)->ip6_frag_id;
+
+	/* Fragment the skb. ipv6 header and the remaining fields of the
+	 * fragment header are updated in ipv6_gso_segment()
+	 */
+	segs = skb_segment(skb, features);
 out:
 	return segs;
 }
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
                   ` (7 preceding siblings ...)
  2016-06-16 17:52 ` [PATCH net-next 8/8] tou: Support for GSO Tom Herbert
@ 2016-06-16 18:10 ` Rick Jones
  2016-06-16 23:15 ` Hannes Frederic Sowa
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Rick Jones @ 2016-06-16 18:10 UTC (permalink / raw)
  To: Tom Herbert, davem, netdev; +Cc: kernel-team

On 06/16/2016 10:51 AM, Tom Herbert wrote:

> Note that #1 is really about running a transport stack in userspace
> applications in clients, not necessarily servers. For servers we
> intend to modified the kernel stack in order to leverage existing
> implementation for building scalable serves (hence these patches).

Only if there is a v2 for other reasons...  I assume that was meant to 
be "scalable servers."


> Tested: Various cases of TOU with IPv4, IPv6 using TCP_STREAM and
> TCP_RR. Also, tested IPIP for comparing TOU encapsulation to IP
> tunneling.
>
>      - IPv6 native
>        1 TCP_STREAM
> 	8394 tps

TPS for TCP_STREAM?  Is that Mbit/s?

>        200 TCP_RR
> 	1726825 tps
> 	100/177/361 90/95/99% latencies

To enhance the already good comprehensiveness of the numbers, a 1 TCP_RR 
showing the effect on latency rather than aggregate PPS would be 
goodness, as would a comparison of the service demands of the different 
single-stream results.

CPU and NIC models would provide excellent context for the numbers.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag
  2016-06-16 17:51 ` [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag Tom Herbert
@ 2016-06-16 18:58   ` Alexander Duyck
  2016-06-16 20:18     ` Tom Herbert
  2016-06-17 22:33     ` Tom Herbert
  0 siblings, 2 replies; 42+ messages in thread
From: Alexander Duyck @ 2016-06-16 18:58 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Netdev, Kernel Team

On Thu, Jun 16, 2016 at 10:51 AM, Tom Herbert <tom@herbertland.com> wrote:
> This replaces gso_type SKB_GSO_DODGY with a new tx_flag named
> SKBTX_UNTRUSTED_SOURCE. This more generically desrcibes the skb
> being created from a untrusted source as a characteristic of and skbuff.
> This also frees up one gso_type flag bit.
>
> Signed-off-by: Tom Herbert <tom@herbertland.com>

Instead of leaving this bit in the shared_info why not look at moving
it into the sk_buff itself?  It seems like this might be a better
candidate for something like that as a large part of what the dodgy
bit represents is that the header offsets are likely not set correctly
and need to be parsed out and updated.  It might make more sense to
place this in the slot just after remcsum_offload.  That way once all
the header offsets have been updated you could just update this one
bit to indicate that the header offsets stored in this sk_buff are
valid.

I also don't see where these changes address any changes needed to
skb_gso_ok in order to actually trigger the partial walk though the
GSO code.  You probably need to look at adding a statement there to do
a check for your untrusted source bit versus the GSO_ROBUST feature.
It probably doesn't need to be much, just something like tacking on a
"&& (!skb_is_untrustued(skb) || (features & NETIF_F_GSO_ROBUST)" to
the conditional statement.

- Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag
  2016-06-16 18:58   ` Alexander Duyck
@ 2016-06-16 20:18     ` Tom Herbert
  2016-06-16 20:33       ` Alexander Duyck
  2016-06-17 22:33     ` Tom Herbert
  1 sibling, 1 reply; 42+ messages in thread
From: Tom Herbert @ 2016-06-16 20:18 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, Netdev, Kernel Team

On Thu, Jun 16, 2016 at 11:58 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Thu, Jun 16, 2016 at 10:51 AM, Tom Herbert <tom@herbertland.com> wrote:
>> This replaces gso_type SKB_GSO_DODGY with a new tx_flag named
>> SKBTX_UNTRUSTED_SOURCE. This more generically desrcibes the skb
>> being created from a untrusted source as a characteristic of and skbuff.
>> This also frees up one gso_type flag bit.
>>
>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>
> Instead of leaving this bit in the shared_info why not look at moving
> it into the sk_buff itself?  It seems like this might be a better
> candidate for something like that as a large part of what the dodgy
> bit represents is that the header offsets are likely not set correctly
> and need to be parsed out and updated.  It might make more sense to
> place this in the slot just after remcsum_offload.  That way once all
> the header offsets have been updated you could just update this one
> bit to indicate that the header offsets stored in this sk_buff are
> valid.
>
> I also don't see where these changes address any changes needed to
> skb_gso_ok in order to actually trigger the partial walk though the
> GSO code.  You probably need to look at adding a statement there to do
> a check for your untrusted source bit versus the GSO_ROBUST feature.
> It probably doesn't need to be much, just something like tacking on a
> "&& (!skb_is_untrustued(skb) || (features & NETIF_F_GSO_ROBUST)" to
> the conditional statement.
>
All the places where SKB_GSO_DODGY was being checked should have been
replaced with SKBTX_UNTRUSTED_SOURCE.

> - Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag
  2016-06-16 20:18     ` Tom Herbert
@ 2016-06-16 20:33       ` Alexander Duyck
  0 siblings, 0 replies; 42+ messages in thread
From: Alexander Duyck @ 2016-06-16 20:33 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Netdev, Kernel Team

On Thu, Jun 16, 2016 at 1:18 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Thu, Jun 16, 2016 at 11:58 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Thu, Jun 16, 2016 at 10:51 AM, Tom Herbert <tom@herbertland.com> wrote:
>>> This replaces gso_type SKB_GSO_DODGY with a new tx_flag named
>>> SKBTX_UNTRUSTED_SOURCE. This more generically desrcibes the skb
>>> being created from a untrusted source as a characteristic of and skbuff.
>>> This also frees up one gso_type flag bit.
>>>
>>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>>
>> Instead of leaving this bit in the shared_info why not look at moving
>> it into the sk_buff itself?  It seems like this might be a better
>> candidate for something like that as a large part of what the dodgy
>> bit represents is that the header offsets are likely not set correctly
>> and need to be parsed out and updated.  It might make more sense to
>> place this in the slot just after remcsum_offload.  That way once all
>> the header offsets have been updated you could just update this one
>> bit to indicate that the header offsets stored in this sk_buff are
>> valid.
>>
>> I also don't see where these changes address any changes needed to
>> skb_gso_ok in order to actually trigger the partial walk though the
>> GSO code.  You probably need to look at adding a statement there to do
>> a check for your untrusted source bit versus the GSO_ROBUST feature.
>> It probably doesn't need to be much, just something like tacking on a
>> "&& (!skb_is_untrustued(skb) || (features & NETIF_F_GSO_ROBUST)" to
>> the conditional statement.
>>
> All the places where SKB_GSO_DODGY was being checked should have been
> replaced with SKBTX_UNTRUSTED_SOURCE.

Yes and no.  So net_gso_ok used to be one of those places but you
moved the one bit and dropped the other so is it no longer is being
checked there.

That is why I mentioned looking at adding one additional check for the
untrusted bit and NETIF_F_GSO_ROBUST.  You could probably just tack it
on at the end of the features check at the end of net_gso_ok.

- Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
                   ` (8 preceding siblings ...)
  2016-06-16 18:10 ` [PATCH net-next 0/8] tou: Transports over UDP - part I Rick Jones
@ 2016-06-16 23:15 ` Hannes Frederic Sowa
  2016-06-17 16:51   ` Tom Herbert
  2016-06-18  3:09 ` David Miller
  2016-06-22 22:15 ` Richard Weinberger
  11 siblings, 1 reply; 42+ messages in thread
From: Hannes Frederic Sowa @ 2016-06-16 23:15 UTC (permalink / raw)
  To: Tom Herbert, davem, netdev; +Cc: kernel-team

On 16.06.2016 19:51, Tom Herbert wrote:
> Transports over UDP is intended to encapsulate TCP and other transport
> protocols directly and securely in UDP.
> 
> The goal of this work is twofold:
> 
> 1) Allow applications to run their own transport layer stack (i.e.from
>    userspace). This eliminates dependencies on the OS (e.g. solves a
>    major dependency issue for Facebook on clients).
> 
> 2) Make transport layer headers (all of L4) invisible to the network
>    so that they can't do intrusive actions at L4. This will be enforced
>    with DTLS in use.
> 
> Note that #1 is really about running a transport stack in userspace
> applications in clients, not necessarily servers. For servers we
> intend to modified the kernel stack in order to leverage existing
> implementation for building scalable serves (hence these patches).
> 
> This is described in more detail in the Internet Draft:
> https://tools.ietf.org/html/draft-herbert-transports-over-udp-00
> 
> In Part I we implement a straightforward encapsulation of TCP in GUE.
> The implements the basic mechanics of TOU encapsulation for TCP,
> however does not yet implement the IP addressing interactions so
> therefore so this is not robust to use in the presence of NAT.
> TOU is enabled per socket with a new socket option. This
> implementation includes GSO, GRO, and RCO support.
> 
> These patches also establish the baseline performance of TOU
> and isolate the performance cost of UDP encapsulation. Performance
> results are below.
> 
> Tested: Various cases of TOU with IPv4, IPv6 using TCP_STREAM and
> TCP_RR. Also, tested IPIP for comparing TOU encapsulation to IP
> tunneling.

Thinking about middleboxes again:

E.g. https://tools.ietf.org/html/rfc6347#section-4.2.3 states that DTLS
packets are not allowed to be fragmented. Because of this and
furthermore because of the impossibility of clamp-mss-to-pmtu to work
anymore, do you have any idea on how reliable this can work?

Or is your plan to use a smaller MSS on all paths by default?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-16 23:15 ` Hannes Frederic Sowa
@ 2016-06-17 16:51   ` Tom Herbert
  2016-06-21 16:56     ` Hannes Frederic Sowa
  0 siblings, 1 reply; 42+ messages in thread
From: Tom Herbert @ 2016-06-17 16:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team

On Thu, Jun 16, 2016 at 4:15 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 16.06.2016 19:51, Tom Herbert wrote:
>> Transports over UDP is intended to encapsulate TCP and other transport
>> protocols directly and securely in UDP.
>>
>> The goal of this work is twofold:
>>
>> 1) Allow applications to run their own transport layer stack (i.e.from
>>    userspace). This eliminates dependencies on the OS (e.g. solves a
>>    major dependency issue for Facebook on clients).
>>
>> 2) Make transport layer headers (all of L4) invisible to the network
>>    so that they can't do intrusive actions at L4. This will be enforced
>>    with DTLS in use.
>>
>> Note that #1 is really about running a transport stack in userspace
>> applications in clients, not necessarily servers. For servers we
>> intend to modified the kernel stack in order to leverage existing
>> implementation for building scalable serves (hence these patches).
>>
>> This is described in more detail in the Internet Draft:
>> https://tools.ietf.org/html/draft-herbert-transports-over-udp-00
>>
>> In Part I we implement a straightforward encapsulation of TCP in GUE.
>> The implements the basic mechanics of TOU encapsulation for TCP,
>> however does not yet implement the IP addressing interactions so
>> therefore so this is not robust to use in the presence of NAT.
>> TOU is enabled per socket with a new socket option. This
>> implementation includes GSO, GRO, and RCO support.
>>
>> These patches also establish the baseline performance of TOU
>> and isolate the performance cost of UDP encapsulation. Performance
>> results are below.
>>
>> Tested: Various cases of TOU with IPv4, IPv6 using TCP_STREAM and
>> TCP_RR. Also, tested IPIP for comparing TOU encapsulation to IP
>> tunneling.
>
> Thinking about middleboxes again:
>
> E.g. https://tools.ietf.org/html/rfc6347#section-4.2.3 states that DTLS
> packets are not allowed to be fragmented. Because of this and
> furthermore because of the impossibility of clamp-mss-to-pmtu to work
> anymore, do you have any idea on how reliable this can work?
>
> Or is your plan to use a smaller MSS on all paths by default?
>
Normal PMTU discovery mechanisms are applicable to prevent
fragmentation. The overhead is accounted for in the MSS (similar to
overhead of TCP options of IPv6 extension headers). Besides that,
RFC6347 describes how fragmentation should be avoided, it does not
explicitly forbid fragmentation, no IP protocol can outright forbid
it. At most they could try to require DF bit is always set but that
won't always be obeyed like when packets are tunneled in the network.

Tom

> Thanks,
> Hannes
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag
  2016-06-16 18:58   ` Alexander Duyck
  2016-06-16 20:18     ` Tom Herbert
@ 2016-06-17 22:33     ` Tom Herbert
  1 sibling, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-17 22:33 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, Netdev, Kernel Team

On Thu, Jun 16, 2016 at 11:58 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Thu, Jun 16, 2016 at 10:51 AM, Tom Herbert <tom@herbertland.com> wrote:
>> This replaces gso_type SKB_GSO_DODGY with a new tx_flag named
>> SKBTX_UNTRUSTED_SOURCE. This more generically desrcibes the skb
>> being created from a untrusted source as a characteristic of and skbuff.
>> This also frees up one gso_type flag bit.
>>
>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>
> Instead of leaving this bit in the shared_info why not look at moving
> it into the sk_buff itself?  It seems like this might be a better
> candidate for something like that as a large part of what the dodgy
> bit represents is that the header offsets are likely not set correctly
> and need to be parsed out and updated.  It might make more sense to
> place this in the slot just after remcsum_offload.  That way once all
> the header offsets have been updated you could just update this one
> bit to indicate that the header offsets stored in this sk_buff are
> valid.
>
I don't really understand what the point of SKB_GSO_DODGY is. Seems
like we should be verifying the correct values are set up front in the
vnet, not relying on the core stack to have to worry about this narrow
use case. Fields in the skbuff should be set correctly all the time in
the stack as an invariant I think, if there not correct it is the
fault of the code setting the fields.

> I also don't see where these changes address any changes needed to
> skb_gso_ok in order to actually trigger the partial walk though the
> GSO code.  You probably need to look at adding a statement there to do
> a check for your untrusted source bit versus the GSO_ROBUST feature.
> It probably doesn't need to be much, just something like tacking on a
> "&& (!skb_is_untrustued(skb) || (features & NETIF_F_GSO_ROBUST)" to
> the conditional statement.
>
> - Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
                   ` (9 preceding siblings ...)
  2016-06-16 23:15 ` Hannes Frederic Sowa
@ 2016-06-18  3:09 ` David Miller
  2016-06-18  3:52   ` Tom Herbert
  2016-06-22 22:15 ` Richard Weinberger
  11 siblings, 1 reply; 42+ messages in thread
From: David Miller @ 2016-06-18  3:09 UTC (permalink / raw)
  To: tom; +Cc: netdev, kernel-team

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 16 Jun 2016 10:51:54 -0700

> The goal of this work is twofold:
> 
> 1) Allow applications to run their own transport layer stack (i.e.from
>    userspace). This eliminates dependencies on the OS (e.g. solves a
>    major dependency issue for Facebook on clients).

Clients need to support TOU in their kernels, it's a similar kind of
dependency, but of course you only need to propagate it once rather
than once per desired stack change.

We also now have to debug against every single userland TCP
implementation someone can come up with, the barrier to entry is
insanely low with TOU.  Maybe this sounds great to you, but to me
it is quite terrifying

This has a monumental impact upon maintainence of our TCP stack.

I suspect a lot of bug reports will go straight to /dev/null once it
is clear that it is a TOU TCP connection.

For TCP the tight integration into the kernel is a benefit, because it
limits the number of variables at stake when trying to analyze
problems.

> 2) Make transport layer headers (all of L4) invisible to the network
>    so that they can't do intrusive actions at L4. This will be enforced
>    with DTLS in use.

This achievement is only met when DTLS is enabled, and it isn't enabled
by default.

This sounds really great on paper, however I find it hard to believe
that makers of middleware boxes are just going to throw their hands in
the air and say "oh well, we lose."

Rather, I think people are going to start adding rules to block TOU
tunnels entirely because they cannot inspect nor conditionally
filter/rewrite the contents.  This is even more likely if Joe Random
and so easily can do their own userland TCP stack over TOU.

BTW, I have a question about the pseudo header checksum.  If the outer
IP addresses can change, due to NAT or whatever, and therefore the
session key is used for connection demuxing instead of the addresses,
why don't we also have to use the session key instead of the IP
addresses for the pseudo header used in checksum calculations?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-18  3:09 ` David Miller
@ 2016-06-18  3:52   ` Tom Herbert
  2016-06-19 20:15     ` Hajime Tazaki
                       ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-18  3:52 UTC (permalink / raw)
  To: David Miller; +Cc: Linux Kernel Network Developers, Kernel Team

> We also now have to debug against every single userland TCP
> implementation someone can come up with, the barrier to entry is
> insanely low with TOU.  Maybe this sounds great to you, but to me
> it is quite terrifying
>
No, it doesn't sound great, but the major problem we have is that
Android and to some extent iOS & Windows take a long time to update
the kernel, and it can take an _extremely_ long time if we need them
to actively enable features that are needed by applications. For
instance, TFO was put in the Linux several years ago, but it still
hasn't been enabled in Android and only fairly recently enabled in
iOS. If we (e.g. Facebook) implement a userspace stack in clients and
control the stack in our servers we can roll out a feature like that
in a couple of months. I don't see anyway to fix other than trying to
take control of our own destiny.  It seems like it's either this or
use something like QUIC which bypasses the kernel completely and
discards TCP-- we have far too much invested to commit to that
alternative at this point.

> This sounds really great on paper, however I find it hard to believe
> that makers of middleware boxes are just going to throw their hands in
> the air and say "oh well, we lose."
>
Happy Eyeballs is implied. There's pretty good data that the majority
(>90%) of the Internet will pass UDP without issue, QUIC is already
seeing good deployment and there are several UDP based protocols
productively deployed. There is also an effort in IETF, PLUS,  to
define a substrate layer that makes UDP based transport protocol
palatable to middleboxes with some sort of signaling.

> Rather, I think people are going to start adding rules to block TOU
> tunnels entirely because they cannot inspect nor conditionally
> filter/rewrite the contents.  This is even more likely if Joe Random
> and so easily can do their own userland TCP stack over TOU.
>
Unfortunately, encryption is the only proven solution to protocol
ossification. If the network doesn't see it, it can't ossify it.

> BTW, I have a question about the pseudo header checksum.  If the outer
> IP addresses can change, due to NAT or whatever, and therefore the
> session key is used for connection demuxing instead of the addresses,
> why don't we also have to use the session key instead of the IP
> addresses for the pseudo header used in checksum calculations?

Yes, the inner checksum will need to be handled differently in case of
NAT. Need to update draft to take that into account.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-18  3:52   ` Tom Herbert
@ 2016-06-19 20:15     ` Hajime Tazaki
  2016-06-20  3:07     ` David Miller
  2016-06-21 17:11     ` Hannes Frederic Sowa
  2 siblings, 0 replies; 42+ messages in thread
From: Hajime Tazaki @ 2016-06-19 20:15 UTC (permalink / raw)
  To: tom; +Cc: davem, netdev, kernel-team


On Fri, 17 Jun 2016 20:52:55 -0700,
Tom Herbert wrote:
> 
> > We also now have to debug against every single userland TCP
> > implementation someone can come up with, the barrier to entry is
> > insanely low with TOU.  Maybe this sounds great to you, but to me
> > it is quite terrifying
> >
> No, it doesn't sound great, but the major problem we have is that
> Android and to some extent iOS & Windows take a long time to update
> the kernel, and it can take an _extremely_ long time if we need them
> to actively enable features that are needed by applications. For
> instance, TFO was put in the Linux several years ago, but it still
> hasn't been enabled in Android and only fairly recently enabled in
> iOS. 

This is exactly the identical motivation what LibOS (now
joined to LKL) has - to have network stack personality.
Without having additional *layers* in the protocol header,
an application can freely benefit any protocol extensions
without updating their host kernel.

the performance is far lower than TOU at this stage (we also
have netperf benchmark results) but I'm positive to improve
this.

So, I would say why not LKL ?

* LKL
https://lwn.net/Articles/662953/

-- Hajime

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-18  3:52   ` Tom Herbert
  2016-06-19 20:15     ` Hajime Tazaki
@ 2016-06-20  3:07     ` David Miller
  2016-06-20 15:13       ` Tom Herbert
  2016-06-21 17:11     ` Hannes Frederic Sowa
  2 siblings, 1 reply; 42+ messages in thread
From: David Miller @ 2016-06-20  3:07 UTC (permalink / raw)
  To: tom; +Cc: netdev, kernel-team

From: Tom Herbert <tom@herbertland.com>
Date: Fri, 17 Jun 2016 20:52:55 -0700

> For instance, TFO was put in the Linux several years ago, but it
> still hasn't been enabled in Android and only fairly recently
> enabled in iOS.

"Android decided to get locked into a really old kernel for 6+ years"
is not really a good argument, sorry.

We've already been hurt badly by the poor decisions the Android folks
have made wrt. the handling of their kernel.

Let's not make it worse by also making userland TCP stacks ubiquitous
as a side effect.

I've been assured several times that the Android situation will at
the very least improve.

And if it does improve and features do propagate more quickly, we'll
have all of this risk and gambling unnecessarily.

Let's not route around the Android problem, but rather get them to
address it properly.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-20  3:07     ` David Miller
@ 2016-06-20 15:13       ` Tom Herbert
  2016-06-21  8:29         ` David Miller
  0 siblings, 1 reply; 42+ messages in thread
From: Tom Herbert @ 2016-06-20 15:13 UTC (permalink / raw)
  To: David Miller; +Cc: Linux Kernel Network Developers, Kernel Team

On Sun, Jun 19, 2016 at 8:07 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Fri, 17 Jun 2016 20:52:55 -0700
>
>> For instance, TFO was put in the Linux several years ago, but it
>> still hasn't been enabled in Android and only fairly recently
>> enabled in iOS.
>
> "Android decided to get locked into a really old kernel for 6+ years"
> is not really a good argument, sorry.
>
> We've already been hurt badly by the poor decisions the Android folks
> have made wrt. the handling of their kernel.
>
The problem is not just with Android. We are also very much dependent
on Windows, IOS, and whole bunch of network device vendors and network
operators to run applications over Internet. At least with Android we
have a path to get changes into the kernel pending the "next rebase".
Windows and IOS are black boxes, the only way we can get changes in is
to deal with their engineering directly (both have solid kernel
engineering, but they still set their own agendas). The situation is
worse with middleboxes. They all over the place on what they are doing
and many take a great deal of liberty with choosing which parts of the
standards to implement-- we have no way to directly influence them.

> Let's not make it worse by also making userland TCP stacks ubiquitous
> as a side effect.
>
> I've been assured several times that the Android situation will at
> the very least improve.
>
> And if it does improve and features do propagate more quickly, we'll
> have all of this risk and gambling unnecessarily.
>
> Let's not route around the Android problem, but rather get them to
> address it properly.

Routing around the problem is already being done. From
draft-tsvwg-quic-protocol-02:

"QUIC operates entirely in userspace, and is currently shipped to
users as a part of the Chromium browser, enabling rapid deployment and
experimentation.  As a userspace transport atop UDP, QUIC allows
innovations which have proven difficult to deploy with existing
protocols as they are hampered by legacy clients and middleboxes, or
by prolonged Operating System development and deployment cycles."

We can't ignore this. We can't just dismiss this as being a impending
failure because it conflicts with our idea of what the architecture of
the Internet should be. We would be remiss if we don't consider
solutions like QUIC or developing competitive alternatives like we're
doing in TOU.

Tom

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-20 15:13       ` Tom Herbert
@ 2016-06-21  8:29         ` David Miller
  2016-06-22  3:42           ` Jerry Chu
  0 siblings, 1 reply; 42+ messages in thread
From: David Miller @ 2016-06-21  8:29 UTC (permalink / raw)
  To: tom; +Cc: netdev, kernel-team

From: Tom Herbert <tom@herbertland.com>
Date: Mon, 20 Jun 2016 08:13:48 -0700

> Routing around the problem is already being done.

QUIC, a new protocol used for specific purposes and implemented in
userspace from the start is significantly different from making the
kernel's _TCP_ implementation bypassed into a userspace one just by
UDP encapsulating it.

That is a major and conscious change in our mentality.

The consequences are far and wide, and I'm having a very hard time
seeing the benefits you cite being larger than the negatives here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-17 16:51   ` Tom Herbert
@ 2016-06-21 16:56     ` Hannes Frederic Sowa
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Frederic Sowa @ 2016-06-21 16:56 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team

On 17.06.2016 09:51, Tom Herbert wrote:
> On Thu, Jun 16, 2016 at 4:15 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> On 16.06.2016 19:51, Tom Herbert wrote:
>>> Transports over UDP is intended to encapsulate TCP and other transport
>>> protocols directly and securely in UDP.
>>>
>>> The goal of this work is twofold:
>>>
>>> 1) Allow applications to run their own transport layer stack (i.e.from
>>>    userspace). This eliminates dependencies on the OS (e.g. solves a
>>>    major dependency issue for Facebook on clients).
>>>
>>> 2) Make transport layer headers (all of L4) invisible to the network
>>>    so that they can't do intrusive actions at L4. This will be enforced
>>>    with DTLS in use.
>>>
>>> Note that #1 is really about running a transport stack in userspace
>>> applications in clients, not necessarily servers. For servers we
>>> intend to modified the kernel stack in order to leverage existing
>>> implementation for building scalable serves (hence these patches).
>>>
>>> This is described in more detail in the Internet Draft:
>>> https://tools.ietf.org/html/draft-herbert-transports-over-udp-00
>>>
>>> In Part I we implement a straightforward encapsulation of TCP in GUE.
>>> The implements the basic mechanics of TOU encapsulation for TCP,
>>> however does not yet implement the IP addressing interactions so
>>> therefore so this is not robust to use in the presence of NAT.
>>> TOU is enabled per socket with a new socket option. This
>>> implementation includes GSO, GRO, and RCO support.
>>>
>>> These patches also establish the baseline performance of TOU
>>> and isolate the performance cost of UDP encapsulation. Performance
>>> results are below.
>>>
>>> Tested: Various cases of TOU with IPv4, IPv6 using TCP_STREAM and
>>> TCP_RR. Also, tested IPIP for comparing TOU encapsulation to IP
>>> tunneling.
>>
>> Thinking about middleboxes again:
>>
>> E.g. https://tools.ietf.org/html/rfc6347#section-4.2.3 states that DTLS
>> packets are not allowed to be fragmented. Because of this and
>> furthermore because of the impossibility of clamp-mss-to-pmtu to work
>> anymore, do you have any idea on how reliable this can work?
>>
>> Or is your plan to use a smaller MSS on all paths by default?
>>
> Normal PMTU discovery mechanisms are applicable to prevent
> fragmentation. The overhead is accounted for in the MSS (similar to
> overhead of TCP options of IPv6 extension headers). Besides that,
> RFC6347 describes how fragmentation should be avoided, it does not
> explicitly forbid fragmentation, no IP protocol can outright forbid
> it. At most they could try to require DF bit is always set but that
> won't always be obeyed like when packets are tunneled in the network.

I agree, the specification is a bit unclear of what to do, but in terms
of not causing fragmentation it seems pretty clear to me:

"
   Each DTLS record MUST fit within a single datagram.  In order to
   avoid IP fragmentation, clients of the DTLS record layer SHOULD
   attempt to size records so that they fit within any PMTU estimates
   obtained from the record layer.
"

DTLS has invented its own fragmentation just to make sure that the
handshake actually doesn't depend on IP layer fragmentation.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-18  3:52   ` Tom Herbert
  2016-06-19 20:15     ` Hajime Tazaki
  2016-06-20  3:07     ` David Miller
@ 2016-06-21 17:11     ` Hannes Frederic Sowa
  2016-06-21 17:23       ` Tom Herbert
  2 siblings, 1 reply; 42+ messages in thread
From: Hannes Frederic Sowa @ 2016-06-21 17:11 UTC (permalink / raw)
  To: Tom Herbert, David Miller; +Cc: Linux Kernel Network Developers, Kernel Team

On 17.06.2016 20:52, Tom Herbert wrote:
> 
>> > Rather, I think people are going to start adding rules to block TOU
>> > tunnels entirely because they cannot inspect nor conditionally
>> > filter/rewrite the contents.  This is even more likely if Joe Random
>> > and so easily can do their own userland TCP stack over TOU.
>> >
> Unfortunately, encryption is the only proven solution to protocol
> ossification. If the network doesn't see it, it can't ossify it.

DTLS carries still a lot of information, both in its handshake, as well
as in the actual framing. The protocol is basically only TLS on top of
datagrams and as such implements connection establishment and tear down
of connections, which middle boxes can certainly track. It will just be
a matter of time until middle boxes and security appliances will be able
to track those connections, maybe not being able to inspect the content
but at least see the certificates in clear-text and as such also have
the common names and other addressing information at hand. The meta-data
might certainly be track able.

Because of reply protection you actually can infer the number of bytes
transferred and someone can end up building congestion control on a
middle box based on that, infer retransmissions etc.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-21 17:11     ` Hannes Frederic Sowa
@ 2016-06-21 17:23       ` Tom Herbert
  0 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-21 17:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, Linux Kernel Network Developers, Kernel Team

On Tue, Jun 21, 2016 at 10:11 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 17.06.2016 20:52, Tom Herbert wrote:
>>
>>> > Rather, I think people are going to start adding rules to block TOU
>>> > tunnels entirely because they cannot inspect nor conditionally
>>> > filter/rewrite the contents.  This is even more likely if Joe Random
>>> > and so easily can do their own userland TCP stack over TOU.
>>> >
>> Unfortunately, encryption is the only proven solution to protocol
>> ossification. If the network doesn't see it, it can't ossify it.
>
> DTLS carries still a lot of information, both in its handshake, as well
> as in the actual framing. The protocol is basically only TLS on top of
> datagrams and as such implements connection establishment and tear down
> of connections, which middle boxes can certainly track. It will just be
> a matter of time until middle boxes and security appliances will be able
> to track those connections, maybe not being able to inspect the content
> but at least see the certificates in clear-text and as such also have
> the common names and other addressing information at hand. The meta-data
> might certainly be track able.
>
> Because of reply protection you actually can infer the number of bytes
> transferred and someone can end up building congestion control on a
> middle box based on that, infer retransmissions etc.
>
Right, it's probably impossible to completely eliminate track-ability.
But hopefully we can keep the plain text information to the absolute
minimum needed to send the packet over the network and decrypt it at
the receiver.

One interesting characteristic of disassociated location is that we
could purposely try to manipulate ECMP so that every packet for a flow
take different paths so no single device (assuming multi-path) can
reconstruct the whole communication (kind of like spread spectrum for
the Internet). I imagine there are some might be some environments
where paranoids might want to do this.

Tom

> Bye,
> Hannes
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-21  8:29         ` David Miller
@ 2016-06-22  3:42           ` Jerry Chu
  2016-06-22  4:06             ` David Ahern
                               ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Jerry Chu @ 2016-06-22  3:42 UTC (permalink / raw)
  To: David Miller; +Cc: Tom Herbert, netdev, kernel-team

On Tue, Jun 21, 2016 at 1:29 AM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Mon, 20 Jun 2016 08:13:48 -0700
>
>> Routing around the problem is already being done.
>
> QUIC, a new protocol used for specific purposes and implemented in
> userspace from the start is significantly different from making the
> kernel's _TCP_ implementation bypassed into a userspace one just by
> UDP encapsulating it.
>
> That is a major and conscious change in our mentality.
>
> The consequences are far and wide, and I'm having a very hard time
> seeing the benefits you cite being larger than the negatives here.

I don't believe TOU will lead to a proliferation of TCP implementations in
the userland - getting a solid TCP implementation is hard. Yes any smart CS
student in the networking field can write one over a weekend, to get 3WHS
to work, and may even include graceful shutdown. But creating one from
scratch that is both high quality, compliant, highly inter-operable, and highly
performing is really hard. Just look at how much work folks on the list have
to continue to pour in to maintain the Linux TCP stack as the best on the
planet.

Yes TOU may lower the bar for random hacks by Joe Random. But I'd argue
no large organization would serious consider or dare deploy TCP stack
with random hacks. I know we have a very high bar to pass at Google. This
should limit the impact of bad TCP stacks on the Internet. If we continue
to keep up and make timely improvements to the Linux TCP stack, and
better yet, to continue to improve technology like UML and LKL to make it
easy for folks to access great technologies in the Linux kernel stack and
deploy them in the userland, it will probably take away all the motivations
for people to do their own random hacks.

Best,

Jerry

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-22  3:42           ` Jerry Chu
@ 2016-06-22  4:06             ` David Ahern
  2016-06-22 19:24               ` David Miller
  2016-06-22 17:40             ` Tom Herbert
  2016-06-22 19:23             ` David Miller
  2 siblings, 1 reply; 42+ messages in thread
From: David Ahern @ 2016-06-22  4:06 UTC (permalink / raw)
  To: Jerry Chu, David Miller; +Cc: Tom Herbert, netdev, kernel-team

On 6/21/16 9:42 PM, Jerry Chu wrote:
> Yes TOU may lower the bar for random hacks by Joe Random. But I'd argue
> no large organization would serious consider or dare deploy TCP stack
> with random hacks.

There are userspace network stacks that have been around for years and 
widely deployed on devices that basically use Linux as the boot OS.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-22  3:42           ` Jerry Chu
  2016-06-22  4:06             ` David Ahern
@ 2016-06-22 17:40             ` Tom Herbert
  2016-06-22 19:23             ` David Miller
  2 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-22 17:40 UTC (permalink / raw)
  To: Jerry Chu; +Cc: David Miller, netdev, Kernel Team

On Tue, Jun 21, 2016 at 8:42 PM, Jerry Chu <hkchu@google.com> wrote:
> On Tue, Jun 21, 2016 at 1:29 AM, David Miller <davem@davemloft.net> wrote:
>> From: Tom Herbert <tom@herbertland.com>
>> Date: Mon, 20 Jun 2016 08:13:48 -0700
>>
>>> Routing around the problem is already being done.
>>
>> QUIC, a new protocol used for specific purposes and implemented in
>> userspace from the start is significantly different from making the
>> kernel's _TCP_ implementation bypassed into a userspace one just by
>> UDP encapsulating it.
>>
>> That is a major and conscious change in our mentality.
>>
>> The consequences are far and wide, and I'm having a very hard time
>> seeing the benefits you cite being larger than the negatives here.
>
> I don't believe TOU will lead to a proliferation of TCP implementations in
> the userland - getting a solid TCP implementation is hard. Yes any smart CS
> student in the networking field can write one over a weekend, to get 3WHS
> to work, and may even include graceful shutdown. But creating one from
> scratch that is both high quality, compliant, highly inter-operable, and highly
> performing is really hard. Just look at how much work folks on the list have
> to continue to pour in to maintain the Linux TCP stack as the best on the
> planet.
>
> Yes TOU may lower the bar for random hacks by Joe Random. But I'd argue
> no large organization would serious consider or dare deploy TCP stack
> with random hacks. I know we have a very high bar to pass at Google. This
> should limit the impact of bad TCP stacks on the Internet. If we continue
> to keep up and make timely improvements to the Linux TCP stack, and
> better yet, to continue to improve technology like UML and LKL to make it
> easy for folks to access great technologies in the Linux kernel stack and
> deploy them in the userland, it will probably take away all the motivations
> for people to do their own random hacks.
>
+1

A major point of TOU is precisely that we want to continue leveraging
the Linux stack to build scalable, flexible, robust servers. We are
only considering userspace based transport protocol for end clients
which are already "Joe Random" devices as far as servers are
concerned. The difference between TOU and "traditional" OS bypass
networking is that we are implementing only the transport layer
protocol in userspace (i.e. TCP, SCTP, DCCP, etc.), not anything
related to L2 and L3 which makes the solution feasible to deploy on
existing client OSes.

Tom

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-22  3:42           ` Jerry Chu
  2016-06-22  4:06             ` David Ahern
  2016-06-22 17:40             ` Tom Herbert
@ 2016-06-22 19:23             ` David Miller
  2016-06-25 15:56               ` Tom Herbert
  2 siblings, 1 reply; 42+ messages in thread
From: David Miller @ 2016-06-22 19:23 UTC (permalink / raw)
  To: hkchu; +Cc: tom, netdev, kernel-team

From: Jerry Chu <hkchu@google.com>
Date: Tue, 21 Jun 2016 20:42:19 -0700

> I don't believe TOU will lead to a proliferation of TCP
> implementations in the userland - getting a solid TCP implementation
> is hard.

The fear isn't doing legitimate things.

It's making TCP stacks that do evil stuff on purpose.

Also, making proprietary TCP stacks that override the kernel one.

And finally, here's the best part, all of the above can be done as a
new, huge, potential attack vector for hackers.

All they need is to get this evil TCP stack working once, then it's
in every root kit out there.  If you can take over the TCP stack of
a several hundred thousand machine strong botnet, imagine what you
could do.

And the TOU facility... facilitates this.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-22  4:06             ` David Ahern
@ 2016-06-22 19:24               ` David Miller
  0 siblings, 0 replies; 42+ messages in thread
From: David Miller @ 2016-06-22 19:24 UTC (permalink / raw)
  To: dsa; +Cc: hkchu, tom, netdev, kernel-team

From: David Ahern <dsa@cumulusnetworks.com>
Date: Tue, 21 Jun 2016 22:06:01 -0600

> On 6/21/16 9:42 PM, Jerry Chu wrote:
>> Yes TOU may lower the bar for random hacks by Joe Random. But I'd
>> argue
>> no large organization would serious consider or dare deploy TCP stack
>> with random hacks.
> 
> There are userspace network stacks that have been around for years and
> widely deployed on devices that basically use Linux as the boot OS.

I'm not talking about TCP stacks that are trying to do things right.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
                   ` (10 preceding siblings ...)
  2016-06-18  3:09 ` David Miller
@ 2016-06-22 22:15 ` Richard Weinberger
  2016-06-22 22:56   ` Tom Herbert
  2016-06-23  7:40   ` David Miller
  11 siblings, 2 replies; 42+ messages in thread
From: Richard Weinberger @ 2016-06-22 22:15 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David S. Miller, netdev, kernel-team

On Thu, Jun 16, 2016 at 7:51 PM, Tom Herbert <tom@herbertland.com> wrote:
> Transports over UDP is intended to encapsulate TCP and other transport
> protocols directly and securely in UDP.
>
> The goal of this work is twofold:
>
> 1) Allow applications to run their own transport layer stack (i.e.from
>    userspace). This eliminates dependencies on the OS (e.g. solves a
>    major dependency issue for Facebook on clients).

Facebook on clients would be a Facebook app on mobile devices?
Does that mean that the Facebook app is so advanced and complicated
that it needs a special TCP stack?!

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-22 22:15 ` Richard Weinberger
@ 2016-06-22 22:56   ` Tom Herbert
  2016-06-23  7:40   ` David Miller
  1 sibling, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-22 22:56 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: David S. Miller, netdev, Kernel Team

On Wed, Jun 22, 2016 at 3:15 PM, Richard Weinberger
<richard.weinberger@gmail.com> wrote:
> On Thu, Jun 16, 2016 at 7:51 PM, Tom Herbert <tom@herbertland.com> wrote:
>> Transports over UDP is intended to encapsulate TCP and other transport
>> protocols directly and securely in UDP.
>>
>> The goal of this work is twofold:
>>
>> 1) Allow applications to run their own transport layer stack (i.e.from
>>    userspace). This eliminates dependencies on the OS (e.g. solves a
>>    major dependency issue for Facebook on clients).
>
> Facebook on clients would be a Facebook app on mobile devices?
> Does that mean that the Facebook app is so advanced and complicated
> that it needs a special TCP stack?!
>
Yes, in the sense that Facebook app is probably the biggest single app
in mobile and probably has about the most users. Advancing the
transport layer, especially with regards to security and privacy, is
critical to maintain long term viability. But that being said,
security, protocol ossification, middlebox intrusion, the demise of
the E2E model are everyone's problem. One major issue here, probably
the biggest issue on the whole Internet, is that upgrade story for
core software (FW, OS, etc.) in devices attached to the Internet is
miserable-- to the point that some people think this undermines the
future of the Internet (e.g.
http://www.darkreading.com/vulnerabilities---threats/internet-of-things-devices-are-doomed/d/d-id/1315735).
TOU is a means to eliminate the dependencies we have on devices or
OSes being secure or being updated in a timely fashion to provide
security improvements.

Tom

> --
> Thanks,
> //richard

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-22 22:15 ` Richard Weinberger
  2016-06-22 22:56   ` Tom Herbert
@ 2016-06-23  7:40   ` David Miller
  2016-06-23  7:50     ` Richard Weinberger
  1 sibling, 1 reply; 42+ messages in thread
From: David Miller @ 2016-06-23  7:40 UTC (permalink / raw)
  To: richard.weinberger; +Cc: tom, netdev, kernel-team

From: Richard Weinberger <richard.weinberger@gmail.com>
Date: Thu, 23 Jun 2016 00:15:04 +0200

> On Thu, Jun 16, 2016 at 7:51 PM, Tom Herbert <tom@herbertland.com> wrote:
>> Transports over UDP is intended to encapsulate TCP and other transport
>> protocols directly and securely in UDP.
>>
>> The goal of this work is twofold:
>>
>> 1) Allow applications to run their own transport layer stack (i.e.from
>>    userspace). This eliminates dependencies on the OS (e.g. solves a
>>    major dependency issue for Facebook on clients).
> 
> Facebook on clients would be a Facebook app on mobile devices?
> Does that mean that the Facebook app is so advanced and complicated
> that it needs a special TCP stack?!

No, the TCP stack in the android/iOS/Windows kernel is so out of date
that in order to get even moderately recent TCP features it is
necessary to do this.

That's the point.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-23  7:40   ` David Miller
@ 2016-06-23  7:50     ` Richard Weinberger
  2016-06-24 21:12       ` Tom Herbert
  0 siblings, 1 reply; 42+ messages in thread
From: Richard Weinberger @ 2016-06-23  7:50 UTC (permalink / raw)
  To: David Miller; +Cc: tom, netdev, kernel-team

Am 23.06.2016 um 09:40 schrieb David Miller:
> From: Richard Weinberger <richard.weinberger@gmail.com>
> Date: Thu, 23 Jun 2016 00:15:04 +0200
> 
>> On Thu, Jun 16, 2016 at 7:51 PM, Tom Herbert <tom@herbertland.com> wrote:
>>> Transports over UDP is intended to encapsulate TCP and other transport
>>> protocols directly and securely in UDP.
>>>
>>> The goal of this work is twofold:
>>>
>>> 1) Allow applications to run their own transport layer stack (i.e.from
>>>    userspace). This eliminates dependencies on the OS (e.g. solves a
>>>    major dependency issue for Facebook on clients).
>>
>> Facebook on clients would be a Facebook app on mobile devices?
>> Does that mean that the Facebook app is so advanced and complicated
>> that it needs a special TCP stack?!
> 
> No, the TCP stack in the android/iOS/Windows kernel is so out of date
> that in order to get even moderately recent TCP features it is
> necessary to do this.

I see.
So the plan is bringing TOU into almost every kernel out there
and then ship Apps with their own TCP stacks since vendors are unable
to deliver decent updates.

I didn't realize that the situation is *that* worse. :(

Thanks,
//richard

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-23  7:50     ` Richard Weinberger
@ 2016-06-24 21:12       ` Tom Herbert
  2016-06-24 21:36         ` Rick Jones
  0 siblings, 1 reply; 42+ messages in thread
From: Tom Herbert @ 2016-06-24 21:12 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: David Miller, Linux Kernel Network Developers, Kernel Team

On Thu, Jun 23, 2016 at 12:50 AM, Richard Weinberger <richard@nod.at> wrote:
> Am 23.06.2016 um 09:40 schrieb David Miller:
>> From: Richard Weinberger <richard.weinberger@gmail.com>
>> Date: Thu, 23 Jun 2016 00:15:04 +0200
>>
>>> On Thu, Jun 16, 2016 at 7:51 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>> Transports over UDP is intended to encapsulate TCP and other transport
>>>> protocols directly and securely in UDP.
>>>>
>>>> The goal of this work is twofold:
>>>>
>>>> 1) Allow applications to run their own transport layer stack (i.e.from
>>>>    userspace). This eliminates dependencies on the OS (e.g. solves a
>>>>    major dependency issue for Facebook on clients).
>>>
>>> Facebook on clients would be a Facebook app on mobile devices?
>>> Does that mean that the Facebook app is so advanced and complicated
>>> that it needs a special TCP stack?!
>>
>> No, the TCP stack in the android/iOS/Windows kernel is so out of date
>> that in order to get even moderately recent TCP features it is
>> necessary to do this.
>
> I see.
> So the plan is bringing TOU into almost every kernel out there
> and then ship Apps with their own TCP stacks since vendors are unable
> to deliver decent updates.
>
> I didn't realize that the situation is *that* worse. :(
>
The client OS side is only part of the story. Middlebox intrusion at
L4 is also a major issue we need to address. The "failure" of TFO is a
good case study. Both the upgrade issues on clients and the tendency
for some middleboxes to drop SYN packets with data have together
severely hindered what otherwise should have been straightforward and
useful feature to deploy.

Tom

> //richard

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-24 21:12       ` Tom Herbert
@ 2016-06-24 21:36         ` Rick Jones
  2016-06-24 21:46           ` Tom Herbert
  0 siblings, 1 reply; 42+ messages in thread
From: Rick Jones @ 2016-06-24 21:36 UTC (permalink / raw)
  To: Tom Herbert, Richard Weinberger
  Cc: David Miller, Linux Kernel Network Developers, Kernel Team

On 06/24/2016 02:12 PM, Tom Herbert wrote:
> The client OS side is only part of the story. Middlebox intrusion at
> L4 is also a major issue we need to address. The "failure" of TFO is a
> good case study. Both the upgrade issues on clients and the tendency
> for some middleboxes to drop SYN packets with data have together
> severely hindered what otherwise should have been straightforward and
> useful feature to deploy.

How would you define "severely?"  Has it actually been more severe than 
for say ECN?  Or it was for say SACK or PAWS?

rick jones

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-24 21:36         ` Rick Jones
@ 2016-06-24 21:46           ` Tom Herbert
  2016-06-24 22:06             ` Rick Jones
  0 siblings, 1 reply; 42+ messages in thread
From: Tom Herbert @ 2016-06-24 21:46 UTC (permalink / raw)
  To: Rick Jones
  Cc: Richard Weinberger, David Miller,
	Linux Kernel Network Developers, Kernel Team

On Fri, Jun 24, 2016 at 2:36 PM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 06/24/2016 02:12 PM, Tom Herbert wrote:
>>
>> The client OS side is only part of the story. Middlebox intrusion at
>> L4 is also a major issue we need to address. The "failure" of TFO is a
>> good case study. Both the upgrade issues on clients and the tendency
>> for some middleboxes to drop SYN packets with data have together
>> severely hindered what otherwise should have been straightforward and
>> useful feature to deploy.
>
>
> How would you define "severely?"  Has it actually been more severe than for
> say ECN?  Or it was for say SACK or PAWS?
>
ECN is probably even a bigger disappointment in terms of seeing
deployment :-( From http://ecn.ethz.ch/ecn-pam15.pdf:

"Even though ECN was standardized in 2001, and it is widely
implemented in end-systems, it is barely deployed. This is due to a
history of problems with severely broken middleboxes shortly after
standardization, which led to connectivity failure and guidance to
leave ECN disabled."

SACK and PAWS seemed to have faired a little better I believe.

Tom

> rick jones
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-24 21:46           ` Tom Herbert
@ 2016-06-24 22:06             ` Rick Jones
  2016-06-24 23:43               ` Tom Herbert
  0 siblings, 1 reply; 42+ messages in thread
From: Rick Jones @ 2016-06-24 22:06 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Richard Weinberger, David Miller,
	Linux Kernel Network Developers, Kernel Team

On 06/24/2016 02:46 PM, Tom Herbert wrote:
> On Fri, Jun 24, 2016 at 2:36 PM, Rick Jones <rick.jones2@hpe.com> wrote:
>> How would you define "severely?"  Has it actually been more severe than for
>> say ECN?  Or it was for say SACK or PAWS?
>>
> ECN is probably even a bigger disappointment in terms of seeing
> deployment :-( From http://ecn.ethz.ch/ecn-pam15.pdf:
>
> "Even though ECN was standardized in 2001, and it is widely
> implemented in end-systems, it is barely deployed. This is due to a
> history of problems with severely broken middleboxes shortly after
> standardization, which led to connectivity failure and guidance to
> leave ECN disabled."
>
> SACK and PAWS seemed to have faired a little better I believe.

The conclusion of that (rather interesting) paper reads:

"Our analysis therefore indicates that enabling ECN by default would
lead to connections to about five websites per thousand to suffer
additional setup latency with RFC 3168 fallback. This represents an
order of magnitude fewer than the about forty per thousand which
experience transient or permanent connection failure due to other
operational issues"

Doesn't that then suggest that not enabling ECN is basically a matter of 
FUD more than remaining assumed broken middleboxes?

My main point is that in the past at least, trouble with broken 
middleboxes didn't lead us to start wrapping all our TCP/transport 
traffic in UDP to try to hide it from them.  We've managed to get SACK 
and PAWS universal without having to resort to that, and it would seem 
we could get ECN universal if we could overcome our FUD.  Why would TFO 
for instance be any different?

There was an equally interesting second paragraph in the conclusion:

"As not all websites are equally popular, failures on five per thousand
websites does not by any means imply that five per thousand connection 
attempts will fail. While estimation of connection attempt rate by rank 
is out of scope of this work, we note that the highest ranked website 
exhibiting stable connection failure has rank 596, and only 13 such 
sites appear in the top 5000"

rick jones

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-24 22:06             ` Rick Jones
@ 2016-06-24 23:43               ` Tom Herbert
  2016-06-25  0:01                 ` Rick Jones
  0 siblings, 1 reply; 42+ messages in thread
From: Tom Herbert @ 2016-06-24 23:43 UTC (permalink / raw)
  To: Rick Jones
  Cc: Richard Weinberger, David Miller,
	Linux Kernel Network Developers, Kernel Team

On Fri, Jun 24, 2016 at 3:06 PM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 06/24/2016 02:46 PM, Tom Herbert wrote:
>>
>> On Fri, Jun 24, 2016 at 2:36 PM, Rick Jones <rick.jones2@hpe.com> wrote:
>>>
>>> How would you define "severely?"  Has it actually been more severe than
>>> for
>>> say ECN?  Or it was for say SACK or PAWS?
>>>
>> ECN is probably even a bigger disappointment in terms of seeing
>> deployment :-( From http://ecn.ethz.ch/ecn-pam15.pdf:
>>
>> "Even though ECN was standardized in 2001, and it is widely
>> implemented in end-systems, it is barely deployed. This is due to a
>> history of problems with severely broken middleboxes shortly after
>> standardization, which led to connectivity failure and guidance to
>> leave ECN disabled."
>>
>> SACK and PAWS seemed to have faired a little better I believe.
>
>
> The conclusion of that (rather interesting) paper reads:
>
> "Our analysis therefore indicates that enabling ECN by default would
> lead to connections to about five websites per thousand to suffer
> additional setup latency with RFC 3168 fallback. This represents an
> order of magnitude fewer than the about forty per thousand which
> experience transient or permanent connection failure due to other
> operational issues"
>
> Doesn't that then suggest that not enabling ECN is basically a matter of FUD
> more than remaining assumed broken middleboxes?
>
> My main point is that in the past at least, trouble with broken middleboxes
> didn't lead us to start wrapping all our TCP/transport traffic in UDP to try
> to hide it from them.  We've managed to get SACK and PAWS universal without
> having to resort to that, and it would seem we could get ECN universal if we
> could overcome our FUD.  Why would TFO for instance be any different?
>
Here's Christoph's slides on TFO in the wild which presents a good
summary of the middlebox problem. There is one significant difference
in that ECN needs network support whereas TFO didn't. Given that
experience, I'm doubtful other new features at L4 could ever be
productively use (like EDO or maybe TCP-ENO).

https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf

Tom

> There was an equally interesting second paragraph in the conclusion:
>
> "As not all websites are equally popular, failures on five per thousand
> websites does not by any means imply that five per thousand connection
> attempts will fail. While estimation of connection attempt rate by rank is
> out of scope of this work, we note that the highest ranked website
> exhibiting stable connection failure has rank 596, and only 13 such sites
> appear in the top 5000"
>
> rick jones

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-24 23:43               ` Tom Herbert
@ 2016-06-25  0:01                 ` Rick Jones
  2016-06-25 16:22                   ` Tom Herbert
  0 siblings, 1 reply; 42+ messages in thread
From: Rick Jones @ 2016-06-25  0:01 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Richard Weinberger, David Miller,
	Linux Kernel Network Developers, Kernel Team

On 06/24/2016 04:43 PM, Tom Herbert wrote:
> Here's Christoph's slides on TFO in the wild which presents a good
> summary of the middlebox problem. There is one significant difference
> in that ECN needs network support whereas TFO didn't. Given that
> experience, I'm doubtful other new features at L4 could ever be
> productively use (like EDO or maybe TCP-ENO).
>
> https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf

Perhaps I am being overly optimistic, but my takeaway from those slides 
is Apple were able to come-up with ways to deal with the middleboxes and 
so could indeed productively use TCP FastOpen.

"Overall, very good success-rate"
though tempered by
"But... middleboxes were a big issue in some ISPs..."

Though it doesn't get into how big (some connections, many, most, all?) 
and how many ISPs.

rick jones

Just an anecdote...  Not that I am a "power user" of my iPhone running 
9.3.2 (13F69) nor that I know that anything I am using is the Apple 
Service stated as using TFO (mostly Safari, Mail and Messages) but if it 
is, I cannot say that any troubles under the covers have been noticed by me.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-22 19:23             ` David Miller
@ 2016-06-25 15:56               ` Tom Herbert
  0 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-25 15:56 UTC (permalink / raw)
  To: David Miller; +Cc: Jerry Chu, Linux Kernel Network Developers, Kernel Team

On Wed, Jun 22, 2016 at 12:23 PM, David Miller <davem@davemloft.net> wrote:
> From: Jerry Chu <hkchu@google.com>
> Date: Tue, 21 Jun 2016 20:42:19 -0700
>
>> I don't believe TOU will lead to a proliferation of TCP
>> implementations in the userland - getting a solid TCP implementation
>> is hard.
>
> The fear isn't doing legitimate things.
>
> It's making TCP stacks that do evil stuff on purpose.
>
> Also, making proprietary TCP stacks that override the kernel one.
>
There is no "kernel one", there are many client kernels, many
different stacks on the Internet, many implementations of TCP. Some of
these are poorly engineered, years behind in technology, and otherwise
horribly insecure. There are even still people running WIndows95 for
heaven's sake! There is simply no way we can just implicitly trust
kernels to be doing the right thing. Neither is there any requirement
for us to do so, you will not find a requirement in any Internet
standard that TCP MUST be implemented in the kernel. The same
characteristics hold for middleboxes and firewalls, there are many
implementations, many don't follow standards, and the SW/FW upgrade
issue is potentially catastrophic to the Internet.  We have no
requirement and can never assume that a robust sufficient firewall is
the in the path of our packets.

Bottom line: if you're developing a business critical application on
the Internet, you cannot assume that the OSes or the network provide
adequate security; you need to take ownership of security for your
application. TOU is a step in that direction.

> And finally, here's the best part, all of the above can be done as a
> new, huge, potential attack vector for hackers.
>
I disagree that there is a new attack vector here. Yes, a malicious
program can open up an unconnected UDP socket and send to any
destination although they should not be able to spoof source address
easily. This is well known and attacks against DNS and other current
uses of UDP already exit. But, a major part of TOU is that the
transport layer header is encrypted so this makes it impossible to
inject packets into a TOU connection even if the program is able to
snoop packets on the wire. TOU is be protected against injection
attacks unlike TCP since headers are in cleartext. So the worse case
attack should be a form of SYN attack which we already have to deal
with for existing transport protocols.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
  2016-06-25  0:01                 ` Rick Jones
@ 2016-06-25 16:22                   ` Tom Herbert
  0 siblings, 0 replies; 42+ messages in thread
From: Tom Herbert @ 2016-06-25 16:22 UTC (permalink / raw)
  To: Rick Jones
  Cc: Richard Weinberger, David Miller,
	Linux Kernel Network Developers, Kernel Team

On Fri, Jun 24, 2016 at 5:01 PM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 06/24/2016 04:43 PM, Tom Herbert wrote:
>>
>> Here's Christoph's slides on TFO in the wild which presents a good
>> summary of the middlebox problem. There is one significant difference
>> in that ECN needs network support whereas TFO didn't. Given that
>> experience, I'm doubtful other new features at L4 could ever be
>> productively use (like EDO or maybe TCP-ENO).
>>
>> https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
>
>
> Perhaps I am being overly optimistic, but my takeaway from those slides is
> Apple were able to come-up with ways to deal with the middleboxes and so
> could indeed productively use TCP FastOpen.
>
They do it by detecting that TFO packets are being dropped and so
fallback to not using TFO. Clients behind such a middlebox can't use
the feature.

> "Overall, very good success-rate"
> though tempered by
> "But... middleboxes were a big issue in some ISPs..."
>
> Though it doesn't get into how big (some connections, many, most, all?) and
> how many ISPs.
>
Note that this not just about TCP. TCP was never ordained to be the
one an only transport protocol on the Internet, in fact the intent of
the Internet architecture and a layered protocol stack was to
encourage innovation in protocol design not to ossify to just one in
perpetuity. SCTP for instance has some very interesting features like
sub streams to avoid HOL block, reliable messages model. But we can't
even consider the possibility of deploying it for our applications,
firewalls and NAT boxes will likely drop; not all major client OSes
(e.g. WIndows) even provide support for it.

ECN, TFO, and SCTP issues are just anecdotes of problems that have the
same root cause: middleboxes are invasive in transport protocols in
non-standard ways so that extending or changing transport protocols
difficult or infeasible to deploy. This is protocol ossification, and
IMO the solution is to eliminate middleboxes have visibility of L4.

Tom

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-06-25 16:22 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16 17:51 [PATCH net-next 0/8] tou: Transports over UDP - part I Tom Herbert
2016-06-16 17:51 ` [PATCH net-next 1/8] net: Change SKB_GSO_DODGY to be a tx_flag Tom Herbert
2016-06-16 18:58   ` Alexander Duyck
2016-06-16 20:18     ` Tom Herbert
2016-06-16 20:33       ` Alexander Duyck
2016-06-17 22:33     ` Tom Herbert
2016-06-16 17:51 ` [PATCH net-next 2/8] fou: Change ip_tunnel_encap to take net argument Tom Herbert
2016-06-16 17:51 ` [PATCH net-next 3/8] tou: Base infrastructure for Transport over UDP Tom Herbert
2016-06-16 17:51 ` [PATCH net-next 4/8] ipv4: Support TOU Tom Herbert
2016-06-16 17:51 ` [PATCH net-next 5/8] tcp: Support for TOU Tom Herbert
2016-06-16 17:52 ` [PATCH net-next 6/8] ipv6: Support TOU Tom Herbert
2016-06-16 17:52 ` [PATCH net-next 7/8] tcp6: Support for TOU Tom Herbert
2016-06-16 17:52 ` [PATCH net-next 8/8] tou: Support for GSO Tom Herbert
2016-06-16 18:10 ` [PATCH net-next 0/8] tou: Transports over UDP - part I Rick Jones
2016-06-16 23:15 ` Hannes Frederic Sowa
2016-06-17 16:51   ` Tom Herbert
2016-06-21 16:56     ` Hannes Frederic Sowa
2016-06-18  3:09 ` David Miller
2016-06-18  3:52   ` Tom Herbert
2016-06-19 20:15     ` Hajime Tazaki
2016-06-20  3:07     ` David Miller
2016-06-20 15:13       ` Tom Herbert
2016-06-21  8:29         ` David Miller
2016-06-22  3:42           ` Jerry Chu
2016-06-22  4:06             ` David Ahern
2016-06-22 19:24               ` David Miller
2016-06-22 17:40             ` Tom Herbert
2016-06-22 19:23             ` David Miller
2016-06-25 15:56               ` Tom Herbert
2016-06-21 17:11     ` Hannes Frederic Sowa
2016-06-21 17:23       ` Tom Herbert
2016-06-22 22:15 ` Richard Weinberger
2016-06-22 22:56   ` Tom Herbert
2016-06-23  7:40   ` David Miller
2016-06-23  7:50     ` Richard Weinberger
2016-06-24 21:12       ` Tom Herbert
2016-06-24 21:36         ` Rick Jones
2016-06-24 21:46           ` Tom Herbert
2016-06-24 22:06             ` Rick Jones
2016-06-24 23:43               ` Tom Herbert
2016-06-25  0:01                 ` Rick Jones
2016-06-25 16:22                   ` Tom Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.