All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC net-next 00/11] udp gso
@ 2018-04-17 20:00 Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 01/11] udp: expose inet cork to udp Willem de Bruijn
                   ` (11 more replies)
  0 siblings, 12 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Segmentation offload reduces cycles/byte for large packets by
amortizing the cost of protocol stack traversal.

This patchset implements GSO for UDP. A process can concatenate and
submit multiple datagrams to the same destination in one send call
by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
or passing an analogous cmsg at send time.

The stack will send the entire large (up to network layer max size)
datagram through the protocol layer. At the GSO layer, it is broken
up in individual segments. All receive the same network layer header
and UDP src and dst port. All but the last segment have the same UDP
header, but the last may differ in length and checksum.

This initial patchset is RFC. A few open items

* MSG_MORE
  The feature requires UDP checksum offload, as without it the
  checksum + copy operation at send() time is likely cheaper than
  checksumming each segment in the GSO layer.

  UDP checksum offload is disabled with MSG_MORE. As a result, GSO
  only works in the lockless fast path.

  The patchset can be simplified if explicitly excluding MSG_MORE.
  For one, patch 1 can be dropped by passing ipcm to udp_send_skb
  instead of inet_cork.

* MSG_ZEROCOPY
  UDP zerocopy has been sent for review before. Completion
  notification cost exceeds the savings from copy avoidance for
  datagrams of regular MSS (< 1500B).

  UDP GSO enables building larger packets, at which point
  zerocopy becomes effective. Results with the current benchmark
  are not as great as from GSO itself, though that may say more
  about the benchmark. Either way, I do not intend to submit
  this separate feature as part of a final UDP GSO patchset.

* GSO_BY_FRAGS
  An alternative implementation that would allow non-uniform
  segment length is to use GSO_BY_FRAGS like SCTP. This would
  likely require MSG_MORE to build the list using multiple
  send calls (or one sendmmsg). The two approaches are not
  mutually-exclusive, so that could be a follow-up.

Initial results show a significant reduction in UDP cycles/byte.
See the main patch for more details and benchmark results.

        udp
          876 MB/s 14873 msg/s 624666 calls/s
            11,205,777,429      cycles
    
        udp gso
         2139 MB/s 36282 msg/s 36282 calls/s
            11,204,374,561      cycles


The patch set is broken down as follows:
- patch 1 is a prerequisite: code rearrangement, noop otherwise
- patch 2 is the core feature
- patch 3,4,6 are refinements
- patch 5 adds the cmsg interface
- patch 7 adds udp zerocopy
- patch 8..11 are tests

This idea was presented previously at netconf 2017-2
http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

Known limitation:
  - The feature requires pacing and possibly a lower threshold on
    segment size to limit the number of segments that may be passed
    to the NIC at once.

  - Even when only accepting datagrams with CHECKSUM_PARTIAL, the
    segmentation layer must drop or fall back to software checksumming
    if the device cannot checksum the packet.
	    
    This can happen if a device advertises checksum offload in
    general, but removes it for this skb in .ndo_features_check.

Willem de Bruijn (11):
  udp: expose inet cork to udp
  udp: add gso
  udp: better wmem accounting on gso
  udp: paged allocation with gso
  udp: add gso segment cmsg
  udp: add gso support to virtual devices
  udp: zerocopy
  selftests: udp gso
  selftests: udp gso with connected sockets
  selftests: udp gso with corking
  selftests: udp gso benchmark

 include/linux/netdev_features.h               |   3 +
 include/linux/skbuff.h                        |  10 +
 include/linux/udp.h                           |   1 +
 include/net/inet_sock.h                       |   1 +
 include/net/ip.h                              |   3 +-
 include/net/ipv6.h                            |   2 +
 include/net/udp.h                             |   5 +
 include/uapi/linux/udp.h                      |   1 +
 net/core/skbuff.c                             |  14 +-
 net/core/sock.c                               |   5 +-
 net/ipv4/af_inet.c                            |   2 +-
 net/ipv4/ip_output.c                          |  63 +-
 net/ipv4/udp.c                                |  78 ++-
 net/ipv4/udp_offload.c                        |  63 ++
 net/ipv6/ip6_offload.c                        |   5 +-
 net/ipv6/ip6_output.c                         |  66 +-
 net/ipv6/udp.c                                |  29 +-
 net/ipv6/udp_offload.c                        |  14 +
 tools/testing/selftests/net/.gitignore        |   3 +
 tools/testing/selftests/net/Makefile          |   3 +-
 tools/testing/selftests/net/udpgso.c          | 621 ++++++++++++++++++
 tools/testing/selftests/net/udpgso.sh         |  31 +
 tools/testing/selftests/net/udpgso_bench.sh   |  74 +++
 tools/testing/selftests/net/udpgso_bench_rx.c | 265 ++++++++
 tools/testing/selftests/net/udpgso_bench_tx.c | 379 +++++++++++
 25 files changed, 1689 insertions(+), 52 deletions(-)
 create mode 100644 tools/testing/selftests/net/udpgso.c
 create mode 100755 tools/testing/selftests/net/udpgso.sh
 create mode 100755 tools/testing/selftests/net/udpgso_bench.sh
 create mode 100644 tools/testing/selftests/net/udpgso_bench_rx.c
 create mode 100644 tools/testing/selftests/net/udpgso_bench_tx.c

-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH RFC net-next 01/11] udp: expose inet cork to udp
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
@ 2018-04-17 20:00 ` Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 02/11] udp: add gso Willem de Bruijn
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

UDP segmentation offload needs access to inet_cork in the udp layer.
Pass the struct to ip(6)_make_skb instead of allocating it on the
stack in that function itself.

This patch is a noop otherwise.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/net/ip.h      |  2 +-
 include/net/ipv6.h    |  1 +
 net/ipv4/ip_output.c  | 17 ++++++++---------
 net/ipv4/udp.c        |  4 +++-
 net/ipv6/ip6_output.c | 20 ++++++++++----------
 net/ipv6/udp.c        |  3 ++-
 6 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index ecffd843e7b8..5a0d2b660506 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -171,7 +171,7 @@ struct sk_buff *ip_make_skb(struct sock *sk, struct flowi4 *fl4,
 					int len, int odd, struct sk_buff *skb),
 			    void *from, int length, int transhdrlen,
 			    struct ipcm_cookie *ipc, struct rtable **rtp,
-			    unsigned int flags);
+			    struct inet_cork *cork, unsigned int flags);
 
 static inline struct sk_buff *ip_finish_skb(struct sock *sk, struct flowi4 *fl4)
 {
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 68b167d98879..0dd722cab037 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -950,6 +950,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 			     void *from, int length, int transhdrlen,
 			     struct ipcm6_cookie *ipc6, struct flowi6 *fl6,
 			     struct rt6_info *rt, unsigned int flags,
+			     struct inet_cork_full *cork,
 			     const struct sockcm_cookie *sockc);
 
 static inline struct sk_buff *ip6_finish_skb(struct sock *sk)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4c11b810a447..83b14ea16654 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1468,9 +1468,8 @@ struct sk_buff *ip_make_skb(struct sock *sk,
 					int len, int odd, struct sk_buff *skb),
 			    void *from, int length, int transhdrlen,
 			    struct ipcm_cookie *ipc, struct rtable **rtp,
-			    unsigned int flags)
+			    struct inet_cork *cork, unsigned int flags)
 {
-	struct inet_cork cork;
 	struct sk_buff_head queue;
 	int err;
 
@@ -1479,22 +1478,22 @@ struct sk_buff *ip_make_skb(struct sock *sk,
 
 	__skb_queue_head_init(&queue);
 
-	cork.flags = 0;
-	cork.addr = 0;
-	cork.opt = NULL;
-	err = ip_setup_cork(sk, &cork, ipc, rtp);
+	cork->flags = 0;
+	cork->addr = 0;
+	cork->opt = NULL;
+	err = ip_setup_cork(sk, cork, ipc, rtp);
 	if (err)
 		return ERR_PTR(err);
 
-	err = __ip_append_data(sk, fl4, &queue, &cork,
+	err = __ip_append_data(sk, fl4, &queue, cork,
 			       &current->task_frag, getfrag,
 			       from, length, transhdrlen, flags);
 	if (err) {
-		__ip_flush_pending_frames(sk, &queue, &cork);
+		__ip_flush_pending_frames(sk, &queue, cork);
 		return ERR_PTR(err);
 	}
 
-	return __ip_make_skb(sk, fl4, &queue, &cork);
+	return __ip_make_skb(sk, fl4, &queue, cork);
 }
 
 /*
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 24b5c59b1c53..6b9d8017b319 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1030,9 +1030,11 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	/* Lockless fast path for the non-corking case. */
 	if (!corkreq) {
+		struct inet_cork cork;
+
 		skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
 				  sizeof(struct udphdr), &ipc, &rt,
-				  msg->msg_flags);
+				  &cork, msg->msg_flags);
 		err = PTR_ERR(skb);
 		if (!IS_ERR_OR_NULL(skb))
 			err = udp_send_skb(skb, fl4);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a39b04f9fa6e..6751f4c375b9 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1749,9 +1749,9 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 			     void *from, int length, int transhdrlen,
 			     struct ipcm6_cookie *ipc6, struct flowi6 *fl6,
 			     struct rt6_info *rt, unsigned int flags,
+			     struct inet_cork_full *cork,
 			     const struct sockcm_cookie *sockc)
 {
-	struct inet_cork_full cork;
 	struct inet6_cork v6_cork;
 	struct sk_buff_head queue;
 	int exthdrlen = (ipc6->opt ? ipc6->opt->opt_flen : 0);
@@ -1762,27 +1762,27 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 
 	__skb_queue_head_init(&queue);
 
-	cork.base.flags = 0;
-	cork.base.addr = 0;
-	cork.base.opt = NULL;
-	cork.base.dst = NULL;
+	cork->base.flags = 0;
+	cork->base.addr = 0;
+	cork->base.opt = NULL;
+	cork->base.dst = NULL;
 	v6_cork.opt = NULL;
-	err = ip6_setup_cork(sk, &cork, &v6_cork, ipc6, rt, fl6);
+	err = ip6_setup_cork(sk, cork, &v6_cork, ipc6, rt, fl6);
 	if (err) {
-		ip6_cork_release(&cork, &v6_cork);
+		ip6_cork_release(cork, &v6_cork);
 		return ERR_PTR(err);
 	}
 	if (ipc6->dontfrag < 0)
 		ipc6->dontfrag = inet6_sk(sk)->dontfrag;
 
-	err = __ip6_append_data(sk, fl6, &queue, &cork.base, &v6_cork,
+	err = __ip6_append_data(sk, fl6, &queue, &cork->base, &v6_cork,
 				&current->task_frag, getfrag, from,
 				length + exthdrlen, transhdrlen + exthdrlen,
 				flags, ipc6, sockc);
 	if (err) {
-		__ip6_flush_pending_frames(sk, &queue, &cork, &v6_cork);
+		__ip6_flush_pending_frames(sk, &queue, cork, &v6_cork);
 		return ERR_PTR(err);
 	}
 
-	return __ip6_make_skb(sk, &queue, &cork, &v6_cork);
+	return __ip6_make_skb(sk, &queue, cork, &v6_cork);
 }
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 4ec76a87aeb8..824797f8d1ab 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1324,12 +1324,13 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	/* Lockless fast path for the non-corking case */
 	if (!corkreq) {
+		struct inet_cork_full cork;
 		struct sk_buff *skb;
 
 		skb = ip6_make_skb(sk, getfrag, msg, ulen,
 				   sizeof(struct udphdr), &ipc6,
 				   &fl6, (struct rt6_info *)dst,
-				   msg->msg_flags, &sockc);
+				   msg->msg_flags, &cork, &sockc);
 		err = PTR_ERR(skb);
 		if (!IS_ERR_OR_NULL(skb))
 			err = udp_v6_send_skb(skb, &fl6);
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH RFC net-next 02/11] udp: add gso
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 01/11] udp: expose inet cork to udp Willem de Bruijn
@ 2018-04-17 20:00 ` Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 03/11] udp: better wmem accounting on gso Willem de Bruijn
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Implement generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.

To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.

A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.

Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.

UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
a large payload into a number of discrete UDP datagrams.

The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
from UFO (SKB_UDP_GSO). It adds a gso_size field to the udp socket,
ip(v6) cmsg cookie and inet_cork structure to be able to set the value
at setsockopt or cmsg time and to work with both lockless and corked
paths. A lockless-only cmsg-only patch would be significantly shorter.

The feature requires udp checksum offload to avoid checksumming in the
GSO layer. This is disabled with MSG_MORE as of commit d749c9cbffd6
("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets") and commit
682b1a9d3f96 ("ipv6: no CHECKSUM_PARTIAL on MSG_MORE corked sockets"),
so only the lockless fast path works. IPPROTO_UDPLITE is excluded as
well, as that protocol has no gso handler registered.

Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.

    tcp tso
     3197 MB/s 54232 msg/s 54232 calls/s
         6,457,754,262      cycles

    tcp gso
     1765 MB/s 29939 msg/s 29939 calls/s
        11,203,021,806      cycles

    tcp without tso/gso *
      739 MB/s 12548 msg/s 12548 calls/s
        11,205,483,630      cycles

    udp
      876 MB/s 14873 msg/s 624666 calls/s
        11,205,777,429      cycles

    udp gso
     2139 MB/s 36282 msg/s 36282 calls/s
        11,204,374,561      cycles

   [*] after reverting commit 0a6b2a1dc2a2
       ("tcp: switch to GSO being always on")

Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:

  perf stat -a -C 12 -e cycles \
    ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4

Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h   |  9 ++++++++
 include/linux/udp.h      |  1 +
 include/net/inet_sock.h  |  1 +
 include/net/ip.h         |  1 +
 include/net/ipv6.h       |  1 +
 include/net/udp.h        |  4 ++++
 include/uapi/linux/udp.h |  1 +
 net/core/skbuff.c        |  2 ++
 net/ipv4/af_inet.c       |  2 +-
 net/ipv4/ip_output.c     |  7 ++++--
 net/ipv4/udp.c           | 31 +++++++++++++++++++++++---
 net/ipv4/udp_offload.c   | 47 ++++++++++++++++++++++++++++++++++++++++
 net/ipv6/ip6_offload.c   |  5 +++--
 net/ipv6/ip6_output.c    |  4 +++-
 net/ipv6/udp.c           | 21 +++++++++++++++---
 net/ipv6/udp_offload.c   | 14 ++++++++++++
 16 files changed, 139 insertions(+), 12 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9065477ed255..6850643508c1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -573,6 +573,8 @@ enum {
 	SKB_GSO_ESP = 1 << 15,
 
 	SKB_GSO_UDP = 1 << 16,
+
+	SKB_GSO_UDP_L4 = 1 << 17,
 };
 
 #if BITS_PER_LONG > 32
@@ -4047,6 +4049,13 @@ static inline bool skb_is_gso_sctp(const struct sk_buff *skb)
 	return skb_shinfo(skb)->gso_type & SKB_GSO_SCTP;
 }
 
+static inline bool skb_is_ufo(const struct sk_buff *skb)
+{
+	const unsigned int gso_type = skb_shinfo(skb)->gso_type;
+
+	return (gso_type & (SKB_GSO_UDP | SKB_GSO_UDP_L4)) == SKB_GSO_UDP;
+}
+
 static inline void skb_gso_reset(struct sk_buff *skb)
 {
 	skb_shinfo(skb)->gso_size = 0;
diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..f7184052ed32 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -55,6 +55,7 @@ struct udp_sock {
 	 * when the socket is uncorked.
 	 */
 	__u16		 len;		/* total length of pending frames */
+	__u16		 gso_size;
 	/*
 	 * Fields specific to UDP-Lite.
 	 */
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 0a671c32d6b9..83d5b3c2ac42 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -147,6 +147,7 @@ struct inet_cork {
 	__u8			ttl;
 	__s16			tos;
 	char			priority;
+	__u16			gso_size;
 };
 
 struct inet_cork_full {
diff --git a/include/net/ip.h b/include/net/ip.h
index 5a0d2b660506..1190d10fb9b2 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -76,6 +76,7 @@ struct ipcm_cookie {
 	__u8			ttl;
 	__s16			tos;
 	char			priority;
+	__u16			gso_size;
 };
 
 #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 0dd722cab037..0a872a7c33c8 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -298,6 +298,7 @@ struct ipcm6_cookie {
 	__s16 tclass;
 	__s8  dontfrag;
 	struct ipv6_txoptions *opt;
+	__u16 gso_size;
 };
 
 static inline struct ipv6_txoptions *txopt_get(const struct ipv6_pinfo *np)
diff --git a/include/net/udp.h b/include/net/udp.h
index 0676b272f6ac..741d888d0fdb 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -174,6 +174,10 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *skb,
 				 struct udphdr *uh, udp_lookup_t lookup);
 int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
 
+struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
+				  netdev_features_t features,
+				  unsigned int mss, __sum16 check);
+
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
 	struct udphdr *uh;
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index efb7b5991c2f..09d00f8c442b 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -32,6 +32,7 @@ struct udphdr {
 #define UDP_ENCAP	100	/* Set the socket to accept encapsulated packets */
 #define UDP_NO_CHECK6_TX 101	/* Disable sending checksum for UDP6X */
 #define UDP_NO_CHECK6_RX 102	/* Disable accpeting checksum for UDP6 */
+#define UDP_SEGMENT	103	/* Set GSO segmentation size */
 
 /* UDP encapsulation types */
 #define UDP_ENCAP_ESPINUDP_NON_IKE	1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 345b51837ca8..3eed21f64e0b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4926,6 +4926,8 @@ static unsigned int skb_gso_transport_seglen(const struct sk_buff *skb)
 		thlen = tcp_hdrlen(skb);
 	} else if (unlikely(skb_is_gso_sctp(skb))) {
 		thlen = sizeof(struct sctphdr);
+	} else if (shinfo->gso_type & SKB_GSO_UDP_L4) {
+		thlen = sizeof(struct udphdr);
 	}
 	/* UFO sets gso_size to the size of the fragmentation
 	 * payload, i.e. the size of the L4 (UDP) header is already
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 3ebf599cebae..d6c3cd02ca2a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1327,7 +1327,7 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	segs = ERR_PTR(-EPROTONOSUPPORT);
 
 	if (!skb->encapsulation || encap) {
-		udpfrag = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP);
+		udpfrag = skb_is_ufo(skb);
 		fixedid = !!(skb_shinfo(skb)->gso_type & SKB_GSO_TCP_FIXEDID);
 
 		/* fixed ID is invalid if DF bit is not set */
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 83b14ea16654..7abfb24ec5e5 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -882,7 +882,8 @@ static int __ip_append_data(struct sock *sk,
 	skb = skb_peek_tail(queue);
 
 	exthdrlen = !skb ? rt->dst.header_len : 0;
-	mtu = cork->fragsize;
+	mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
+
 	if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP &&
 	    sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)
 		tskey = sk->sk_tskey++;
@@ -1133,6 +1134,8 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 	*rtp = NULL;
 	cork->fragsize = ip_sk_use_pmtu(sk) ?
 			 dst_mtu(&rt->dst) : rt->dst.dev->mtu;
+
+	cork->gso_size = sk->sk_type == SOCK_DGRAM ? ipc->gso_size : 0;
 	cork->dst = &rt->dst;
 	cork->length = 0;
 	cork->ttl = ipc->ttl;
@@ -1212,7 +1215,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		return -EOPNOTSUPP;
 
 	hh_len = LL_RESERVED_SPACE(rt->dst.dev);
-	mtu = cork->fragsize;
+	mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
 
 	fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
 	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6b9d8017b319..731772a69043 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -757,7 +757,8 @@ void udp_set_csum(bool nocheck, struct sk_buff *skb,
 }
 EXPORT_SYMBOL(udp_set_csum);
 
-static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4)
+static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4,
+			struct inet_cork *cork)
 {
 	struct sock *sk = skb->sk;
 	struct inet_sock *inet = inet_sk(sk);
@@ -777,6 +778,19 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4)
 	uh->len = htons(len);
 	uh->check = 0;
 
+	if (cork->gso_size) {
+		const int hlen = skb_network_header_len(skb) +
+				 sizeof(struct udphdr);
+
+		if (hlen + cork->gso_size > cork->fragsize)
+			return -EINVAL;
+		if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite)
+			return -EIO;
+
+		skb_shinfo(skb)->gso_size = cork->gso_size;
+		skb_shinfo(skb)->gso_type = SKB_GSO_UDP | SKB_GSO_UDP_L4;
+	}
+
 	if (is_udplite)  				 /*     UDP-Lite      */
 		csum = udplite_csum(skb);
 
@@ -828,7 +842,7 @@ int udp_push_pending_frames(struct sock *sk)
 	if (!skb)
 		goto out;
 
-	err = udp_send_skb(skb, fl4);
+	err = udp_send_skb(skb, fl4, &inet->cork.base);
 
 out:
 	up->len = 0;
@@ -922,6 +936,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipc.sockc.tsflags = sk->sk_tsflags;
 	ipc.addr = inet->inet_saddr;
 	ipc.oif = sk->sk_bound_dev_if;
+	ipc.gso_size = up->gso_size;
 
 	if (msg->msg_controllen) {
 		err = ip_cmsg_send(sk, msg, &ipc, sk->sk_family == AF_INET6);
@@ -1037,7 +1052,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 				  &cork, msg->msg_flags);
 		err = PTR_ERR(skb);
 		if (!IS_ERR_OR_NULL(skb))
-			err = udp_send_skb(skb, fl4);
+			err = udp_send_skb(skb, fl4, &cork);
 		goto out;
 	}
 
@@ -2367,6 +2382,12 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		up->no_check6_rx = valbool;
 		break;
 
+	case UDP_SEGMENT:
+		if (val < 0 || val > USHRT_MAX)
+			return -EINVAL;
+		up->gso_size = val;
+		break;
+
 	/*
 	 * 	UDP-Lite's partial checksum coverage (RFC 3828).
 	 */
@@ -2457,6 +2478,10 @@ int udp_lib_getsockopt(struct sock *sk, int level, int optname,
 		val = up->no_check6_rx;
 		break;
 
+	case UDP_SEGMENT:
+		val = up->gso_size;
+		break;
+
 	/* The following two cannot be changed on UDP sockets, the return is
 	 * always 0 (which corresponds to the full checksum coverage of UDP). */
 	case UDPLITE_SEND_CSCOV:
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index ea6e6e7df0ee..3378f8dec9c7 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -187,6 +187,50 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(skb_udp_tunnel_segment);
 
+struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
+				  netdev_features_t features,
+				  unsigned int mss, __sum16 check)
+{
+	struct udphdr *uh = udp_hdr(gso_skb);
+	struct sk_buff *segs;
+	unsigned int hdrlen;
+
+	if (gso_skb->len <= sizeof(*uh) + mss)
+		return ERR_PTR(-EINVAL);
+
+	uh->len = htons(sizeof(*uh) + mss);
+	uh->check = check;
+	skb_pull(gso_skb, sizeof(*uh));
+	hdrlen = gso_skb->data - skb_mac_header(gso_skb);
+
+	segs = skb_segment(gso_skb, features);
+	if (unlikely(IS_ERR_OR_NULL(segs)))
+		return segs;
+
+	/* If last packet is not full, fix up its header */
+	if (segs->prev->len != hdrlen + mss) {
+		unsigned int mss_last = segs->prev->len - hdrlen;
+
+		uh = udp_hdr(segs->prev);
+		uh->len = htons(sizeof(*uh) + mss_last);
+		csum_replace2(&uh->check, htons(mss), htons(mss_last));
+	}
+
+	return segs;
+}
+
+static struct sk_buff *__udp4_gso_segment(struct sk_buff *gso_skb,
+					  netdev_features_t features)
+{
+	const struct iphdr *iph = ip_hdr(gso_skb);
+	unsigned int mss = skb_shinfo(gso_skb)->gso_size;
+
+	return __udp_gso_segment(gso_skb, features, mss,
+				 udp_v4_check(sizeof(struct udphdr) + mss,
+					      iph->saddr, iph->daddr, 0));
+}
+EXPORT_SYMBOL_GPL(__udp4_gso_segment);
+
 static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 					 netdev_features_t features)
 {
@@ -209,6 +253,9 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
 		goto out;
 
+	if (!skb_is_ufo(skb))
+		return __udp4_gso_segment(skb, features);
+
 	mss = skb_shinfo(skb)->gso_size;
 	if (unlikely(skb->len <= mss))
 		goto out;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 4a87f9428ca5..e59f67c0bf89 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -88,9 +88,10 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 
 	if (skb->encapsulation &&
 	    skb_shinfo(skb)->gso_type & (SKB_GSO_IPXIP4 | SKB_GSO_IPXIP6))
-		udpfrag = proto == IPPROTO_UDP && encap;
+		udpfrag = proto == IPPROTO_UDP && encap && skb_is_ufo(skb);
 	else
-		udpfrag = proto == IPPROTO_UDP && !skb->encapsulation;
+		udpfrag = proto == IPPROTO_UDP && !skb->encapsulation &&
+			  skb_is_ufo(skb);
 
 	ops = rcu_dereference(inet6_offloads[proto]);
 	if (likely(ops && ops->callbacks.gso_segment)) {
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6751f4c375b9..3ce947c1d173 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1234,6 +1234,8 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
 	if (mtu < IPV6_MIN_MTU)
 		return -EINVAL;
 	cork->base.fragsize = mtu;
+	cork->base.gso_size = sk->sk_type == SOCK_DGRAM ? ipc6->gso_size : 0;
+
 	if (dst_allfrag(xfrm_dst_path(&rt->dst)))
 		cork->base.flags |= IPCORK_ALLFRAG;
 	cork->base.length = 0;
@@ -1275,7 +1277,7 @@ static int __ip6_append_data(struct sock *sk,
 		dst_exthdrlen = rt->dst.header_len - rt->rt6i_nfheader_len;
 	}
 
-	mtu = cork->fragsize;
+	mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize;
 	orig_mtu = mtu;
 
 	hh_len = LL_RESERVED_SPACE(rt->dst.dev);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 824797f8d1ab..16aee23b1e5f 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1023,7 +1023,8 @@ static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
  *	Sending
  */
 
-static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6)
+static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6,
+			   struct inet_cork *cork)
 {
 	struct sock *sk = skb->sk;
 	struct udphdr *uh;
@@ -1042,6 +1043,19 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6)
 	uh->len = htons(len);
 	uh->check = 0;
 
+	if (cork->gso_size) {
+		const int hlen = skb_network_header_len(skb) +
+				 sizeof(struct udphdr);
+
+		if (hlen + cork->gso_size > cork->fragsize)
+			return -EINVAL;
+		if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite)
+			return -EIO;
+
+		skb_shinfo(skb)->gso_size = cork->gso_size;
+		skb_shinfo(skb)->gso_type = SKB_GSO_UDP | SKB_GSO_UDP_L4;
+	}
+
 	if (is_udplite)
 		csum = udplite_csum(skb);
 	else if (udp_sk(sk)->no_check6_tx) {   /* UDP csum disabled */
@@ -1093,7 +1107,7 @@ static int udp_v6_push_pending_frames(struct sock *sk)
 	if (!skb)
 		goto out;
 
-	err = udp_v6_send_skb(skb, &fl6);
+	err = udp_v6_send_skb(skb, &fl6, &inet_sk(sk)->cork.base);
 
 out:
 	up->len = 0;
@@ -1127,6 +1141,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipc6.hlimit = -1;
 	ipc6.tclass = -1;
 	ipc6.dontfrag = -1;
+	ipc6.gso_size = up->gso_size;
 	sockc.tsflags = sk->sk_tsflags;
 
 	/* destination address check */
@@ -1333,7 +1348,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 				   msg->msg_flags, &cork, &sockc);
 		err = PTR_ERR(skb);
 		if (!IS_ERR_OR_NULL(skb))
-			err = udp_v6_send_skb(skb, &fl6);
+			err = udp_v6_send_skb(skb, &fl6, &cork.base);
 		goto out;
 	}
 
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 2a04dc9c781b..7410120fc114 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -17,6 +17,17 @@
 #include <net/ip6_checksum.h>
 #include "ip6_offload.h"
 
+static struct sk_buff *__udp6_gso_segment(struct sk_buff *gso_skb,
+					  netdev_features_t features)
+{
+	const struct ipv6hdr *ip6h = ipv6_hdr(gso_skb);
+	unsigned int mss = skb_shinfo(gso_skb)->gso_size;
+
+	return __udp_gso_segment(gso_skb, features, mss,
+				 udp_v6_check(sizeof(struct udphdr) + mss,
+					      &ip6h->saddr, &ip6h->daddr, 0));
+}
+
 static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 					 netdev_features_t features)
 {
@@ -48,6 +59,9 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 		if (!pskb_may_pull(skb, sizeof(struct udphdr)))
 			goto out;
 
+		if (!skb_is_ufo(skb))
+			return __udp6_gso_segment(skb, features);
+
 		/* Do software UFO. Complete and fill in the UDP checksum as HW cannot
 		 * do checksum of UDP packets sent as multiple IP fragments.
 		 */
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH RFC net-next 03/11] udp: better wmem accounting on gso
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 01/11] udp: expose inet cork to udp Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 02/11] udp: add gso Willem de Bruijn
@ 2018-04-17 20:00 ` Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 04/11] udp: paged allocation with gso Willem de Bruijn
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

skb_segment by default transfers allocated wmem from the gso skb
to the tail of the segment list. This underreports real truesize
of the list, especially if the tail might be dropped.

Similar to tcp_gso_segment, update wmem_alloc with the aggregate
list truesize and make each segment responsible for its own
share by setting skb->destructor.

Clear gso_skb->destructor prior to calling skb_segment to skip
the default assignment to tail.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/udp_offload.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 3378f8dec9c7..a69117e25e78 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -192,7 +192,9 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 				  unsigned int mss, __sum16 check)
 {
 	struct udphdr *uh = udp_hdr(gso_skb);
-	struct sk_buff *segs;
+	struct sock *sk = gso_skb->sk;
+	struct sk_buff *segs, *seg;
+	unsigned int sum_truesize = 0;
 	unsigned int hdrlen;
 
 	if (gso_skb->len <= sizeof(*uh) + mss)
@@ -203,9 +205,23 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 	skb_pull(gso_skb, sizeof(*uh));
 	hdrlen = gso_skb->data - skb_mac_header(gso_skb);
 
+	/* clear destructor to avoid skb_segment assigning it to tail */
+	WARN_ON_ONCE(gso_skb->destructor != sock_wfree);
+	gso_skb->destructor = NULL;
+
 	segs = skb_segment(gso_skb, features);
-	if (unlikely(IS_ERR_OR_NULL(segs)))
+	if (unlikely(IS_ERR_OR_NULL(segs))) {
+		gso_skb->destructor = sock_wfree;
 		return segs;
+	}
+
+	for (seg = segs; seg; seg = seg->next) {
+		seg->destructor = sock_wfree;
+		seg->sk = sk;
+		sum_truesize += seg->truesize;
+	}
+
+	refcount_add(sum_truesize - gso_skb->truesize, &sk->sk_wmem_alloc);
 
 	/* If last packet is not full, fix up its header */
 	if (segs->prev->len != hdrlen + mss) {
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH RFC net-next 04/11] udp: paged allocation with gso
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
                   ` (2 preceding siblings ...)
  2018-04-17 20:00 ` [PATCH RFC net-next 03/11] udp: better wmem accounting on gso Willem de Bruijn
@ 2018-04-17 20:00 ` Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 05/11] udp: add gso segment cmsg Willem de Bruijn
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

When sending large datagrams that are later segmented, store data in
page frags to avoid copying from linear in skb_segment.

This logic will also be used by zerocopy.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/ip_output.c  | 15 +++++++++++----
 net/ipv6/ip6_output.c | 19 ++++++++++++++-----
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7abfb24ec5e5..9ccd6c28e420 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -878,11 +878,13 @@ static int __ip_append_data(struct sock *sk,
 	struct rtable *rt = (struct rtable *)cork->dst;
 	unsigned int wmem_alloc_delta = 0;
 	u32 tskey = 0;
+	bool paged;
 
 	skb = skb_peek_tail(queue);
 
 	exthdrlen = !skb ? rt->dst.header_len : 0;
 	mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
+	paged = !!cork->gso_size;
 
 	if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP &&
 	    sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)
@@ -934,6 +936,7 @@ static int __ip_append_data(struct sock *sk,
 			unsigned int fraglen;
 			unsigned int fraggap;
 			unsigned int alloclen;
+			unsigned int pagedlen = 0;
 			struct sk_buff *skb_prev;
 alloc_new_skb:
 			skb_prev = skb;
@@ -954,8 +957,12 @@ static int __ip_append_data(struct sock *sk,
 			if ((flags & MSG_MORE) &&
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
-			else
+			else if (!paged)
 				alloclen = fraglen;
+			else {
+				alloclen = min_t(int, fraglen, MAX_HEADER);
+				pagedlen = fraglen - alloclen;
+			}
 
 			alloclen += exthdrlen;
 
@@ -999,7 +1006,7 @@ static int __ip_append_data(struct sock *sk,
 			/*
 			 *	Find where to start putting bytes.
 			 */
-			data = skb_put(skb, fraglen + exthdrlen);
+			data = skb_put(skb, fraglen + exthdrlen - pagedlen);
 			skb_set_network_header(skb, exthdrlen);
 			skb->transport_header = (skb->network_header +
 						 fragheaderlen);
@@ -1015,7 +1022,7 @@ static int __ip_append_data(struct sock *sk,
 				pskb_trim_unique(skb_prev, maxfraglen);
 			}
 
-			copy = datalen - transhdrlen - fraggap;
+			copy = datalen - transhdrlen - fraggap - pagedlen;
 			if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
 				err = -EFAULT;
 				kfree_skb(skb);
@@ -1023,7 +1030,7 @@ static int __ip_append_data(struct sock *sk,
 			}
 
 			offset += copy;
-			length -= datalen - fraggap;
+			length -= copy + transhdrlen;
 			transhdrlen = 0;
 			exthdrlen = 0;
 			csummode = CHECKSUM_NONE;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3ce947c1d173..9fbcec4fb946 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1270,6 +1270,7 @@ static int __ip6_append_data(struct sock *sk,
 	int csummode = CHECKSUM_NONE;
 	unsigned int maxnonfragsize, headersize;
 	unsigned int wmem_alloc_delta = 0;
+	bool paged;
 
 	skb = skb_peek_tail(queue);
 	if (!skb) {
@@ -1277,6 +1278,7 @@ static int __ip6_append_data(struct sock *sk,
 		dst_exthdrlen = rt->dst.header_len - rt->rt6i_nfheader_len;
 	}
 
+	paged = !!cork->gso_size;
 	mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize;
 	orig_mtu = mtu;
 
@@ -1368,6 +1370,7 @@ static int __ip6_append_data(struct sock *sk,
 			unsigned int fraglen;
 			unsigned int fraggap;
 			unsigned int alloclen;
+			unsigned int pagedlen = 0;
 alloc_new_skb:
 			/* There's no room in the current skb */
 			if (skb)
@@ -1390,11 +1393,17 @@ static int __ip6_append_data(struct sock *sk,
 
 			if (datalen > (cork->length <= mtu && !(cork->flags & IPCORK_ALLFRAG) ? mtu : maxfraglen) - fragheaderlen)
 				datalen = maxfraglen - fragheaderlen - rt->dst.trailer_len;
+			fraglen = datalen + fragheaderlen;
+
 			if ((flags & MSG_MORE) &&
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
-			else
-				alloclen = datalen + fragheaderlen;
+			else if (!paged)
+				alloclen = fraglen;
+			else {
+				alloclen = min_t(int, fraglen, MAX_HEADER);
+				pagedlen = fraglen - alloclen;
+			}
 
 			alloclen += dst_exthdrlen;
 
@@ -1416,7 +1425,7 @@ static int __ip6_append_data(struct sock *sk,
 			 */
 			alloclen += sizeof(struct frag_hdr);
 
-			copy = datalen - transhdrlen - fraggap;
+			copy = datalen - transhdrlen - fraggap - pagedlen;
 			if (copy < 0) {
 				err = -EINVAL;
 				goto error;
@@ -1455,7 +1464,7 @@ static int __ip6_append_data(struct sock *sk,
 			/*
 			 *	Find where to start putting bytes
 			 */
-			data = skb_put(skb, fraglen);
+			data = skb_put(skb, fraglen - pagedlen);
 			skb_set_network_header(skb, exthdrlen);
 			data += fragheaderlen;
 			skb->transport_header = (skb->network_header +
@@ -1478,7 +1487,7 @@ static int __ip6_append_data(struct sock *sk,
 			}
 
 			offset += copy;
-			length -= datalen - fraggap;
+			length -= copy + transhdrlen;
 			transhdrlen = 0;
 			exthdrlen = 0;
 			dst_exthdrlen = 0;
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH RFC net-next 05/11] udp: add gso segment cmsg
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
                   ` (3 preceding siblings ...)
  2018-04-17 20:00 ` [PATCH RFC net-next 04/11] udp: paged allocation with gso Willem de Bruijn
@ 2018-04-17 20:00 ` Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 06/11] udp: add gso support to virtual devices Willem de Bruijn
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Allow specifying segment size in the send call.

The new control message performs the same function as socket option
UDP_SEGMENT while avoiding the extra system call.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/net/udp.h |  1 +
 net/ipv4/udp.c    | 43 +++++++++++++++++++++++++++++++++++++++++--
 net/ipv6/udp.c    |  5 ++++-
 3 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 741d888d0fdb..05990746810e 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -273,6 +273,7 @@ int udp_abort(struct sock *sk, int err);
 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
 int udp_push_pending_frames(struct sock *sk);
 void udp_flush_pending_frames(struct sock *sk);
+int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
 void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 731772a69043..2bad61a05c5e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -851,6 +851,42 @@ int udp_push_pending_frames(struct sock *sk)
 }
 EXPORT_SYMBOL(udp_push_pending_frames);
 
+static int __udp_cmsg_send(struct cmsghdr *cmsg, u16 *gso_size)
+{
+	switch (cmsg->cmsg_type) {
+	case UDP_SEGMENT:
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(__u16)))
+			return -EINVAL;
+		*gso_size = *(__u16 *)CMSG_DATA(cmsg);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size)
+{
+	struct cmsghdr *cmsg;
+	bool need_ip = false;
+	int err;
+
+	for_each_cmsghdr(cmsg, msg) {
+		if (!CMSG_OK(msg, cmsg))
+			return -EINVAL;
+
+		if (cmsg->cmsg_level != SOL_UDP) {
+			need_ip = true;
+			continue;
+		}
+
+		err = __udp_cmsg_send(cmsg, gso_size);
+		if (err)
+			return err;
+	}
+
+	return need_ip;
+}
+
 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -939,8 +975,11 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipc.gso_size = up->gso_size;
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(sk, msg, &ipc, sk->sk_family == AF_INET6);
-		if (unlikely(err)) {
+		err = udp_cmsg_send(sk, msg, &ipc.gso_size);
+		if (err > 0)
+			err = ip_cmsg_send(sk, msg, &ipc,
+					   sk->sk_family == AF_INET6);
+		if (unlikely(err < 0)) {
 			kfree(ipc.opt);
 			return err;
 		}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 16aee23b1e5f..c8b3289bdbd0 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1274,7 +1274,10 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		opt->tot_len = sizeof(*opt);
 		ipc6.opt = opt;
 
-		err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, &ipc6, &sockc);
+		err = udp_cmsg_send(sk, msg, &ipc6.gso_size);
+		if (err > 0)
+			err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6,
+						    &ipc6, &sockc);
 		if (err < 0) {
 			fl6_sock_release(flowlabel);
 			return err;
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH RFC net-next 06/11] udp: add gso support to virtual devices
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
                   ` (4 preceding siblings ...)
  2018-04-17 20:00 ` [PATCH RFC net-next 05/11] udp: add gso segment cmsg Willem de Bruijn
@ 2018-04-17 20:00 ` Willem de Bruijn
  2018-04-18  0:43   ` Dimitris Michailidis
  2018-04-17 20:00 ` [PATCH RFC net-next 07/11] udp: zerocopy Willem de Bruijn
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Virtual devices such as tunnels and bonding can handle large packets.
Only segment packets when reaching a physical or loopback device.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/netdev_features.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 35b79f47a13d..1e4883bb02a7 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -80,6 +80,7 @@ enum {
 
 	NETIF_F_GRO_HW_BIT,		/* Hardware Generic receive offload */
 	NETIF_F_HW_TLS_RECORD_BIT,	/* Offload TLS record */
+	NETIF_F_GSO_UDP_L4_BIT,		/* UDP payload GSO (not UFO) */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -147,6 +148,7 @@ enum {
 #define NETIF_F_HW_ESP_TX_CSUM	__NETIF_F(HW_ESP_TX_CSUM)
 #define	NETIF_F_RX_UDP_TUNNEL_PORT  __NETIF_F(RX_UDP_TUNNEL_PORT)
 #define NETIF_F_HW_TLS_RECORD	__NETIF_F(HW_TLS_RECORD)
+#define NETIF_F_GSO_UDP_L4	__NETIF_F(GSO_UDP_L4)
 
 #define for_each_netdev_feature(mask_addr, bit)	\
 	for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
@@ -216,6 +218,7 @@ enum {
 				 NETIF_F_GSO_GRE_CSUM |			\
 				 NETIF_F_GSO_IPXIP4 |			\
 				 NETIF_F_GSO_IPXIP6 |			\
+				 NETIF_F_GSO_UDP_L4 |			\
 				 NETIF_F_GSO_UDP_TUNNEL |		\
 				 NETIF_F_GSO_UDP_TUNNEL_CSUM)
 
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH RFC net-next 07/11] udp: zerocopy
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
                   ` (5 preceding siblings ...)
  2018-04-17 20:00 ` [PATCH RFC net-next 06/11] udp: add gso support to virtual devices Willem de Bruijn
@ 2018-04-17 20:00 ` Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 08/11] selftests: udp gso Willem de Bruijn
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.

This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagram, the benefit of page flipping outweights the cost of
generating a completion notification.

The datagram implementation has initial refcnt of 0, instead of 1 for
TCP. The tcp_sendmsg_locked function has to hold a reference on
uarg independent those held by the skbs it generates, because the
skbs can be sent and freed in the function main loop. This is not
needed for other sockets.

Benefit depends on the ratio of cycles spent copying. For an initial
benchmark with udp gso that spends 13% of (systemwide) cycles in
copy_user_fast_string, cycle savings of zerocopy are proportionate:

    udp gso
     2139 MB/s 36282 msg/s 36282 calls/s
        11,204,374,561      cycles

    udp gso zerocopy
     2394 MB/s 40608 msg/s 40608 calls/s
        11,205,017,927      cycles

This is likely not the best demonstrator benchmark.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 12 +++++++++++-
 net/core/sock.c        |  5 ++++-
 net/ipv4/ip_output.c   | 24 ++++++++++++++++++++++--
 net/ipv6/ip6_output.c  | 23 ++++++++++++++++++++++-
 5 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6850643508c1..756206bc8eca 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -483,6 +483,7 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len);
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			     struct msghdr *msg, int len,
 			     struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3eed21f64e0b..92df6db2c851 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -959,7 +959,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg->len = 1;
 	uarg->bytelen = size;
 	uarg->zerocopy = 1;
-	refcount_set(&uarg->refcnt, 1);
+	refcount_set(&uarg->refcnt, sk->sk_type == SOCK_STREAM ? 1 : 0);
 	sock_hold(sk);
 
 	return uarg;
@@ -1099,6 +1099,10 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
 		atomic_dec(&sk->sk_zckey);
 		uarg->len--;
 
+		/* datagram does not hold an extra ref for the syscall itself */
+		if (sk->sk_type != SOCK_STREAM && !refcount_read(&uarg->refcnt))
+			refcount_set(&uarg->refcnt, 1);
+
 		sock_zerocopy_put(uarg);
 	}
 }
@@ -1107,6 +1111,12 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
 extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 				   struct iov_iter *from, size_t length);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len)
+{
+	return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, len);
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter_dgram);
+
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			     struct msghdr *msg, int len,
 			     struct ubuf_info *uarg)
diff --git a/net/core/sock.c b/net/core/sock.c
index b2c3db169ca1..1480d7d92294 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1053,7 +1053,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 
 	case SO_ZEROCOPY:
 		if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
-			if (sk->sk_protocol != IPPROTO_TCP)
+			if (sk->sk_type == SOCK_RAW)
+				ret = -ENOTSUPP;
+			if (sk->sk_protocol != IPPROTO_TCP &&
+			    sk->sk_protocol != IPPROTO_UDP)
 				ret = -ENOTSUPP;
 		} else if (sk->sk_family != PF_RDS) {
 			ret = -ENOTSUPP;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9ccd6c28e420..b5986ff1250e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -864,8 +864,8 @@ static int __ip_append_data(struct sock *sk,
 			    unsigned int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
+	struct ubuf_info *uarg = NULL;
 	struct sk_buff *skb;
-
 	struct ip_options *opt = cork->opt;
 	int hh_len;
 	int exthdrlen;
@@ -913,6 +913,20 @@ static int __ip_append_data(struct sock *sk,
 	    !exthdrlen)
 		csummode = CHECKSUM_PARTIAL;
 
+	if (flags & MSG_ZEROCOPY && length) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+
+		if (rt->dst.dev->features & NETIF_F_SG &&
+		    csummode == CHECKSUM_PARTIAL) {
+			paged = true;
+		} else {
+			uarg->zerocopy = 0;
+			skb_zcopy_set(skb, uarg);
+		}
+	}
+
 	cork->length += length;
 
 	/* So, what's going on in the loop below?
@@ -1002,6 +1016,7 @@ static int __ip_append_data(struct sock *sk,
 			cork->tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
 			tskey = 0;
+			skb_zcopy_set(skb, uarg);
 
 			/*
 			 *	Find where to start putting bytes.
@@ -1063,7 +1078,7 @@ static int __ip_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else {
+		} else if (!uarg || !uarg->zerocopy) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
@@ -1093,6 +1108,10 @@ static int __ip_append_data(struct sock *sk,
 			skb->data_len += copy;
 			skb->truesize += copy;
 			wmem_alloc_delta += copy;
+		} else {
+			err = skb_zerocopy_iter_dgram(skb, from, copy);
+			if (err)
+				goto error;
 		}
 		offset += copy;
 		length -= copy;
@@ -1105,6 +1124,7 @@ static int __ip_append_data(struct sock *sk,
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 9fbcec4fb946..9725d283091c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1270,6 +1270,7 @@ static int __ip6_append_data(struct sock *sk,
 	int csummode = CHECKSUM_NONE;
 	unsigned int maxnonfragsize, headersize;
 	unsigned int wmem_alloc_delta = 0;
+	struct ubuf_info *uarg = NULL;
 	bool paged;
 
 	skb = skb_peek_tail(queue);
@@ -1338,6 +1339,20 @@ static int __ip6_append_data(struct sock *sk,
 			tskey = sk->sk_tskey++;
 	}
 
+	if (flags & MSG_ZEROCOPY && length) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+
+		if ((rt->dst.dev->features & NETIF_F_SG) &&
+		    csummode == CHECKSUM_PARTIAL) {
+			paged = true;
+		} else {
+			uarg->zerocopy = 0;
+			skb_zcopy_set(skb, uarg);
+		}
+	}
+
 	/*
 	 * Let's try using as much space as possible.
 	 * Use MTU if total length of the message fits into the MTU.
@@ -1460,6 +1475,7 @@ static int __ip6_append_data(struct sock *sk,
 			tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
 			tskey = 0;
+			skb_zcopy_set(skb, uarg);
 
 			/*
 			 *	Find where to start putting bytes
@@ -1520,7 +1536,7 @@ static int __ip6_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else {
+		} else if (!uarg || !uarg->zerocopy) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
@@ -1550,6 +1566,10 @@ static int __ip6_append_data(struct sock *sk,
 			skb->data_len += copy;
 			skb->truesize += copy;
 			wmem_alloc_delta += copy;
+		} else {
+			err = skb_zerocopy_iter_dgram(skb, from, copy);
+			if (err)
+				goto error;
 		}
 		offset += copy;
 		length -= copy;
@@ -1562,6 +1582,7 @@ static int __ip6_append_data(struct sock *sk,
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH RFC net-next 08/11] selftests: udp gso
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
                   ` (6 preceding siblings ...)
  2018-04-17 20:00 ` [PATCH RFC net-next 07/11] udp: zerocopy Willem de Bruijn
@ 2018-04-17 20:00 ` Willem de Bruijn
  2018-04-17 20:00 ` [PATCH RFC net-next 09/11] selftests: udp gso with connected sockets Willem de Bruijn
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Validate udp gso, including edge cases (such as min/max gso sizes).

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/net/.gitignore |   1 +
 tools/testing/selftests/net/Makefile   |   3 +-
 tools/testing/selftests/net/udpgso.c   | 486 +++++++++++++++++++++++++
 tools/testing/selftests/net/udpgso.sh  |  16 +
 4 files changed, 505 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/udpgso.c
 create mode 100755 tools/testing/selftests/net/udpgso.sh

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index c612d6e38c62..6333b789c8fa 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -7,3 +7,4 @@ reuseport_bpf_cpu
 reuseport_bpf_numa
 reuseport_dualstack
 reuseaddr_conflict
+udpgso
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 23e725f4305e..a05a657489b9 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,12 +5,13 @@ CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh rtnetlink.sh
-TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh
+TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
 TEST_GEN_FILES += tcp_mmap
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
+TEST_GEN_PROGS += udpgso
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/net/udpgso.c b/tools/testing/selftests/net/udpgso.c
new file mode 100644
index 000000000000..68a230dfee73
--- /dev/null
+++ b/tools/testing/selftests/net/udpgso.c
@@ -0,0 +1,486 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <stddef.h>
+#include <arpa/inet.h>
+#include <error.h>
+#include <errno.h>
+#include <net/if.h>
+#include <linux/in.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <netinet/if_ether.h>
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+#include <netinet/udp.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#ifndef ETH_MAX_MTU
+#define ETH_MAX_MTU	0xFFFFU
+#endif
+
+#ifndef UDP_SEGMENT
+#define UDP_SEGMENT		103
+#endif
+
+#define CONST_MTU_TEST	1500
+
+#define CONST_HDRLEN_V4		(sizeof(struct iphdr) + sizeof(struct udphdr))
+#define CONST_HDRLEN_V6		(sizeof(struct ip6_hdr) + sizeof(struct udphdr))
+
+#define CONST_MSS_V4		(CONST_MTU_TEST - CONST_HDRLEN_V4)
+#define CONST_MSS_V6		(CONST_MTU_TEST - CONST_HDRLEN_V6)
+
+#define CONST_MAX_SEGS_V4	(ETH_MAX_MTU / CONST_MSS_V4)
+#define CONST_MAX_SEGS_V6	(ETH_MAX_MTU / CONST_MSS_V6)
+
+static bool		cfg_do_ipv4;
+static bool		cfg_do_ipv6;
+static bool		cfg_do_connectionless;
+static bool		cfg_do_setsockopt;
+static int		cfg_specific_test_id = -1;
+
+static const char	cfg_ifname[] = "lo";
+static unsigned short	cfg_port = 9000;
+
+static char buf[ETH_MAX_MTU];
+
+struct testcase {
+	int tlen;		/* send() buffer size, may exceed mss */
+	bool tfail;		/* send() call is expected to fail */
+	int gso_len;		/* mss after applying gso */
+	int r_num_mss;		/* recv(): number of calls of full mss */
+	int r_len_last;		/* recv(): size of last non-mss dgram, if any */
+};
+
+const struct in6_addr addr6 = IN6ADDR_LOOPBACK_INIT;
+const struct in_addr addr4 = { .s_addr = __constant_htonl(INADDR_LOOPBACK + 2) };
+
+struct testcase testcases_v4[] = {
+	{
+		/* no GSO: send a single byte */
+		.tlen = 1,
+		.r_len_last = 1,
+	},
+	{
+		/* no GSO: send a single MSS */
+		.tlen = CONST_MSS_V4,
+		.r_num_mss = 1,
+	},
+	{
+		/* no GSO: send a single MSS + 1B: fail */
+		.tlen = CONST_MSS_V4 + 1,
+		.tfail = true,
+	},
+	{
+		/* send a single MSS: will fail with GSO, because the segment
+		 * logic in udp4_ufo_fragment demands a gso skb to be > MTU
+		 */
+		.tlen = CONST_MSS_V4,
+		.gso_len = CONST_MSS_V4,
+		.tfail = true,
+		.r_num_mss = 1,
+	},
+	{
+		/* send a single MSS + 1B */
+		.tlen = CONST_MSS_V4 + 1,
+		.gso_len = CONST_MSS_V4,
+		.r_num_mss = 1,
+		.r_len_last = 1,
+	},
+	{
+		/* send exactly 2 MSS */
+		.tlen = CONST_MSS_V4 * 2,
+		.gso_len = CONST_MSS_V4,
+		.r_num_mss = 2,
+	},
+	{
+		/* send 2 MSS + 1B */
+		.tlen = (CONST_MSS_V4 * 2) + 1,
+		.gso_len = CONST_MSS_V4,
+		.r_num_mss = 2,
+		.r_len_last = 1,
+	},
+	{
+		/* send MAX segs */
+		.tlen = (ETH_MAX_MTU / CONST_MSS_V4) * CONST_MSS_V4,
+		.gso_len = CONST_MSS_V4,
+		.r_num_mss = (ETH_MAX_MTU / CONST_MSS_V4),
+	},
+
+	{
+		/* send MAX bytes */
+		.tlen = ETH_MAX_MTU - CONST_HDRLEN_V4,
+		.gso_len = CONST_MSS_V4,
+		.r_num_mss = CONST_MAX_SEGS_V4,
+		.r_len_last = ETH_MAX_MTU - CONST_HDRLEN_V4 -
+			      (CONST_MAX_SEGS_V4 * CONST_MSS_V4),
+	},
+	{
+		/* send MAX + 1: fail */
+		.tlen = ETH_MAX_MTU - CONST_HDRLEN_V4 + 1,
+		.gso_len = CONST_MSS_V4,
+		.tfail = true,
+	},
+	{
+		/* EOL */
+	}
+};
+
+#ifndef IP6_MAX_MTU
+#define IP6_MAX_MTU	(ETH_MAX_MTU + sizeof(struct ip6_hdr))
+#endif
+
+struct testcase testcases_v6[] = {
+	{
+		/* no GSO: send a single byte */
+		.tlen = 1,
+		.r_len_last = 1,
+	},
+	{
+		/* no GSO: send a single MSS */
+		.tlen = CONST_MSS_V6,
+		.r_num_mss = 1,
+	},
+	{
+		/* no GSO: send a single MSS + 1B: fail */
+		.tlen = CONST_MSS_V6 + 1,
+		.tfail = true,
+	},
+	{
+		/* send a single MSS: will fail with GSO, because the segment
+		 * logic in udp4_ufo_fragment demands a gso skb to be > MTU
+		 */
+		.tlen = CONST_MSS_V6,
+		.gso_len = CONST_MSS_V6,
+		.tfail = true,
+		.r_num_mss = 1,
+	},
+	{
+		/* send a single MSS + 1B */
+		.tlen = CONST_MSS_V6 + 1,
+		.gso_len = CONST_MSS_V6,
+		.r_num_mss = 1,
+		.r_len_last = 1,
+	},
+	{
+		/* send exactly 2 MSS */
+		.tlen = CONST_MSS_V6 * 2,
+		.gso_len = CONST_MSS_V6,
+		.r_num_mss = 2,
+	},
+	{
+		/* send 2 MSS + 1B */
+		.tlen = (CONST_MSS_V6 * 2) + 1,
+		.gso_len = CONST_MSS_V6,
+		.r_num_mss = 2,
+		.r_len_last = 1,
+	},
+	{
+		/* send MAX segs */
+		.tlen = (IP6_MAX_MTU / CONST_MSS_V6) * CONST_MSS_V6,
+		.gso_len = CONST_MSS_V6,
+		.r_num_mss = (IP6_MAX_MTU / CONST_MSS_V6),
+	},
+
+	{
+		/* send MAX bytes */
+		.tlen = IP6_MAX_MTU - CONST_HDRLEN_V6,
+		.gso_len = CONST_MSS_V6,
+		.r_num_mss = CONST_MAX_SEGS_V6,
+		.r_len_last = IP6_MAX_MTU - CONST_HDRLEN_V6 -
+			      (CONST_MAX_SEGS_V6 * CONST_MSS_V6),
+	},
+	{
+		/* send MAX + 1: fail */
+		.tlen = IP6_MAX_MTU - CONST_HDRLEN_V6 + 1,
+		.gso_len = CONST_MSS_V6,
+		.tfail = true,
+	},
+	{
+		/* EOL */
+	}
+};
+
+static unsigned int get_device_mtu(int fd, const char *ifname)
+{
+	struct ifreq ifr;
+
+	memset(&ifr, 0, sizeof(ifr));
+
+	strcpy(ifr.ifr_name, ifname);
+
+	if (ioctl(fd, SIOCGIFMTU, &ifr))
+		error(1, errno, "ioctl get mtu");
+
+	return ifr.ifr_mtu;
+}
+
+static void __set_device_mtu(int fd, const char *ifname, unsigned int mtu)
+{
+	struct ifreq ifr;
+
+	memset(&ifr, 0, sizeof(ifr));
+
+	ifr.ifr_mtu = mtu;
+	strcpy(ifr.ifr_name, ifname);
+
+	if (ioctl(fd, SIOCSIFMTU, &ifr))
+		error(1, errno, "ioctl set mtu");
+}
+
+static void set_device_mtu(int fd, int mtu)
+{
+	int val;
+
+	val = get_device_mtu(fd, cfg_ifname);
+	fprintf(stderr, "device mtu (orig): %u\n", val);
+
+	__set_device_mtu(fd, cfg_ifname, mtu);
+	val = get_device_mtu(fd, cfg_ifname);
+	if (val != mtu)
+		error(1, 0, "unable to set device mtu to %u\n", val);
+
+	fprintf(stderr, "device mtu (test): %u\n", val);
+}
+
+static void set_pmtu_discover(int fd, bool is_ipv4)
+{
+	int level, name, val;
+
+	if (is_ipv4) {
+		level	= SOL_IP;
+		name	= IP_MTU_DISCOVER;
+		val	= IP_PMTUDISC_DO;
+	} else {
+		level	= SOL_IPV6;
+		name	= IPV6_MTU_DISCOVER;
+		val	= IPV6_PMTUDISC_DO;
+	}
+
+	if (setsockopt(fd, level, name, &val, sizeof(val)))
+		error(1, errno, "setsockopt path mtu");
+}
+
+static bool send_one(int fd, int len, int gso_len,
+		     struct sockaddr *addr, socklen_t alen)
+{
+	char control[CMSG_SPACE(sizeof(uint16_t))] = {0};
+	struct msghdr msg = {0};
+	struct iovec iov = {0};
+	struct cmsghdr *cm;
+	int ret;
+
+	iov.iov_base = buf;
+	iov.iov_len = len;
+
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	msg.msg_name = addr;
+	msg.msg_namelen = alen;
+
+	if (gso_len && !cfg_do_setsockopt) {
+		msg.msg_control = control;
+		msg.msg_controllen = sizeof(control);
+
+		cm = CMSG_FIRSTHDR(&msg);
+		cm->cmsg_level = SOL_UDP;
+		cm->cmsg_type = UDP_SEGMENT;
+		cm->cmsg_len = CMSG_LEN(sizeof(uint16_t));
+		*((uint16_t *) CMSG_DATA(cm)) = gso_len;
+	}
+
+	ret = sendmsg(fd, &msg, 0);
+	if (ret == -1 && (errno == EMSGSIZE || errno == ENOMEM))
+		return false;
+	if (ret == -1)
+		error(1, errno, "sendmsg");
+	if (ret != len)
+		error(1, 0, "sendto: %d != %u", ret, len);
+
+	return true;
+}
+
+static int recv_one(int fd, int flags)
+{
+	int ret;
+
+	ret = recv(fd, buf, sizeof(buf), flags);
+	if (ret == -1 && errno == EAGAIN && (flags & MSG_DONTWAIT))
+		return 0;
+	if (ret == -1)
+		error(1, errno, "recv");
+
+	return ret;
+}
+
+static void run_one(struct testcase *test, int fdt, int fdr,
+		    struct sockaddr *addr, socklen_t alen)
+{
+	int i, ret, val, mss;
+	bool sent;
+
+	fprintf(stderr, "ipv%d tx:%d gso:%d %s\n",
+			addr->sa_family == AF_INET ? 4 : 6,
+			test->tlen, test->gso_len,
+			test->tfail ? "(fail)" : "");
+
+	val = test->gso_len;
+	if (cfg_do_setsockopt) {
+		if (setsockopt(fdt, SOL_UDP, UDP_SEGMENT, &val, sizeof(val)))
+			error(1, errno, "setsockopt udp segment");
+	}
+
+	sent = send_one(fdt, test->tlen, test->gso_len, addr, alen);
+	if (sent && test->tfail)
+		error(1, 0, "send succeeded while expecting failure");
+	if (!sent && !test->tfail)
+		error(1, 0, "send failed while expecting success");
+	if (!sent)
+		return;
+
+	mss = addr->sa_family == AF_INET ? CONST_MSS_V4 : CONST_MSS_V6;
+
+	/* Recv all full MSS datagrams */
+	for (i = 0; i < test->r_num_mss; i++) {
+		ret = recv_one(fdr, 0);
+		if (ret != mss)
+			error(1, 0, "recv.%d: %d != %d", i, ret, mss);
+	}
+
+	/* Recv the non-full last datagram, if tlen was not a multiple of mss */
+	if (test->r_len_last) {
+		ret = recv_one(fdr, 0);
+		if (ret != test->r_len_last)
+			error(1, 0, "recv.%d: %d != %d (last)",
+			      i, ret, test->r_len_last);
+	}
+
+	/* Verify received all data */
+	ret = recv_one(fdr, MSG_DONTWAIT);
+	if (ret)
+		error(1, 0, "recv: unexpected datagram");
+}
+
+static void run_all(int fdt, int fdr, struct sockaddr *addr, socklen_t alen)
+{
+	struct testcase *tests, *test;
+
+	tests = addr->sa_family == AF_INET ? testcases_v4 : testcases_v6;
+
+	for (test = tests; test->tlen; test++) {
+		/* if a specific test is given, then skip all others */
+		if (cfg_specific_test_id == -1 ||
+		    cfg_specific_test_id == test - tests)
+			run_one(test, fdt, fdr, addr, alen);
+	}
+}
+
+static void run_test(struct sockaddr *addr, socklen_t alen)
+{
+	struct timeval tv = { .tv_usec = 100 * 1000 };
+	int fdr, fdt;
+
+	fdr = socket(addr->sa_family, SOCK_DGRAM, 0);
+	if (fdr == -1)
+		error(1, errno, "socket r");
+
+	if (bind(fdr, addr, alen))
+		error(1, errno, "bind");
+
+	/* Have tests fail quickly instead of hang */
+	if (setsockopt(fdr, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)))
+		error(1, errno, "setsockopt rcv timeout");
+
+	fdt = socket(addr->sa_family, SOCK_DGRAM, 0);
+	if (fdt == -1)
+		error(1, errno, "socket t");
+
+	/* Do not fragment these datagrams: only succeed if GSO works */
+	set_pmtu_discover(fdt, addr->sa_family == AF_INET);
+
+	if (cfg_do_connectionless) {
+		set_device_mtu(fdt, CONST_MTU_TEST);
+		run_all(fdt, fdr, addr, alen);
+	}
+
+	if (close(fdt))
+		error(1, errno, "close t");
+	if (close(fdr))
+		error(1, errno, "close r");
+}
+
+static void run_test_v4(void)
+{
+	struct sockaddr_in addr = {0};
+
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(cfg_port);
+	addr.sin_addr = addr4;
+
+	run_test((void *)&addr, sizeof(addr));
+}
+
+static void run_test_v6(void)
+{
+	struct sockaddr_in6 addr = {0};
+
+	addr.sin6_family = AF_INET6;
+	addr.sin6_port = htons(cfg_port);
+	addr.sin6_addr = addr6;
+
+	run_test((void *)&addr, sizeof(addr));
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "46Cst:")) != -1) {
+		switch (c) {
+		case '4':
+			cfg_do_ipv4 = true;
+			break;
+		case '6':
+			cfg_do_ipv6 = true;
+			break;
+		case 'C':
+			cfg_do_connectionless = true;
+			break;
+		case 's':
+			cfg_do_setsockopt = true;
+			break;
+		case 't':
+			cfg_specific_test_id = strtoul(optarg, NULL, 0);
+			break;
+		default:
+			error(1, 0, "%s: parse error", argv[0]);
+		}
+	}
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+
+	if (cfg_do_ipv4)
+		run_test_v4();
+	if (cfg_do_ipv6)
+		run_test_v6();
+
+	fprintf(stderr, "OK\n");
+	return 0;
+}
+
diff --git a/tools/testing/selftests/net/udpgso.sh b/tools/testing/selftests/net/udpgso.sh
new file mode 100755
index 000000000000..7977b97e060c
--- /dev/null
+++ b/tools/testing/selftests/net/udpgso.sh
@@ -0,0 +1,16 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a series of udpgso regression tests
+
+echo "ipv4 cmsg"
+./in_netns.sh ./udpgso -4 -C
+
+echo "ipv4 setsockopt"
+./in_netns.sh ./udpgso -4 -C -s
+
+echo "ipv6 cmsg"
+./in_netns.sh ./udpgso -6 -C
+
+echo "ipv6 setsockopt"
+./in_netns.sh ./udpgso -6 -C -s
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH RFC net-next 09/11] selftests: udp gso with connected sockets
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
                   ` (7 preceding siblings ...)
  2018-04-17 20:00 ` [PATCH RFC net-next 08/11] selftests: udp gso Willem de Bruijn
@ 2018-04-17 20:00 ` Willem de Bruijn
  2018-04-17 20:15 ` [PATCH RFC net-next 00/11] udp gso Sowmini Varadhan
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:00 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Connected sockets use path mtu instead of device mtu.

Test this path by inserting a route mtu that is lower than the device
mtu. Verify that the path mtu for the connection matches this lower
number, then run the same test as in the connectionless case.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/net/udpgso.c  | 117 +++++++++++++++++++++++++-
 tools/testing/selftests/net/udpgso.sh |   7 ++
 2 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/udpgso.c b/tools/testing/selftests/net/udpgso.c
index 68a230dfee73..a47dca025346 100644
--- a/tools/testing/selftests/net/udpgso.c
+++ b/tools/testing/selftests/net/udpgso.c
@@ -47,6 +47,7 @@
 
 static bool		cfg_do_ipv4;
 static bool		cfg_do_ipv6;
+static bool		cfg_do_connected;
 static bool		cfg_do_connectionless;
 static bool		cfg_do_setsockopt;
 static int		cfg_specific_test_id = -1;
@@ -273,6 +274,101 @@ static void set_pmtu_discover(int fd, bool is_ipv4)
 		error(1, errno, "setsockopt path mtu");
 }
 
+static unsigned int get_path_mtu(int fd, bool is_ipv4)
+{
+	socklen_t vallen;
+	unsigned int mtu;
+	int ret;
+
+	vallen = sizeof(mtu);
+	if (is_ipv4)
+		ret = getsockopt(fd, SOL_IP, IP_MTU, &mtu, &vallen);
+	else
+		ret = getsockopt(fd, SOL_IPV6, IPV6_MTU, &mtu, &vallen);
+
+	if (ret)
+		error(1, errno, "getsockopt mtu");
+
+
+	fprintf(stderr, "path mtu (read):  %u\n", mtu);
+	return mtu;
+}
+
+/* very wordy version of system("ip route add dev lo mtu 1500 127.0.0.3/32") */
+static void set_route_mtu(int mtu, bool is_ipv4)
+{
+	struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK };
+	struct nlmsghdr *nh;
+	struct rtattr *rta;
+	struct rtmsg *rt;
+	char data[NLMSG_ALIGN(sizeof(*nh)) +
+		  NLMSG_ALIGN(sizeof(*rt)) +
+		  NLMSG_ALIGN(RTA_LENGTH(sizeof(addr6))) +
+		  NLMSG_ALIGN(RTA_LENGTH(sizeof(int))) +
+		  NLMSG_ALIGN(RTA_LENGTH(0) + RTA_LENGTH(sizeof(int)))];
+	int fd, ret, alen, off = 0;
+
+	alen = is_ipv4 ? sizeof(addr4) : sizeof(addr6);
+
+	fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (fd == -1)
+		error(1, errno, "socket netlink");
+
+	memset(data, 0, sizeof(data));
+
+	nh = (void *)data;
+	nh->nlmsg_type = RTM_NEWROUTE;
+	nh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE;
+	off += NLMSG_ALIGN(sizeof(*nh));
+
+	rt = (void *)(data + off);
+	rt->rtm_family = is_ipv4 ? AF_INET : AF_INET6;
+	rt->rtm_table = RT_TABLE_MAIN;
+	rt->rtm_dst_len = alen << 3;
+	rt->rtm_protocol = RTPROT_BOOT;
+	rt->rtm_scope = RT_SCOPE_UNIVERSE;
+	rt->rtm_type = RTN_UNICAST;
+	off += NLMSG_ALIGN(sizeof(*rt));
+
+	rta = (void *)(data + off);
+	rta->rta_type = RTA_DST;
+	rta->rta_len = RTA_LENGTH(alen);
+	if (is_ipv4)
+		memcpy(RTA_DATA(rta), &addr4, alen);
+	else
+		memcpy(RTA_DATA(rta), &addr6, alen);
+	off += NLMSG_ALIGN(rta->rta_len);
+
+	rta = (void *)(data + off);
+	rta->rta_type = RTA_OIF;
+	rta->rta_len = RTA_LENGTH(sizeof(int));
+	*((int *)(RTA_DATA(rta))) = 1; //if_nametoindex("lo");
+	off += NLMSG_ALIGN(rta->rta_len);
+
+	/* MTU is a subtype in a metrics type */
+	rta = (void *)(data + off);
+	rta->rta_type = RTA_METRICS;
+	rta->rta_len = RTA_LENGTH(0) + RTA_LENGTH(sizeof(int));
+	off += NLMSG_ALIGN(rta->rta_len);
+
+	/* now fill MTU subtype. Note that it fits within above rta_len */
+	rta = (void *)(((char *) rta) + RTA_LENGTH(0));
+	rta->rta_type = RTAX_MTU;
+	rta->rta_len = RTA_LENGTH(sizeof(int));
+	*((int *)(RTA_DATA(rta))) = mtu;
+
+	nh->nlmsg_len = off;
+
+	ret = sendto(fd, data, off, 0, (void *)&nladdr, sizeof(nladdr));
+	if (ret != off)
+		error(1, errno, "send netlink: %uB != %uB\n", ret, off);
+
+	if (close(fd))
+		error(1, errno, "close netlink");
+
+	fprintf(stderr, "route mtu (test): %u\n", mtu);
+}
+
 static bool send_one(int fd, int len, int gso_len,
 		     struct sockaddr *addr, socklen_t alen)
 {
@@ -391,7 +487,7 @@ static void run_all(int fdt, int fdr, struct sockaddr *addr, socklen_t alen)
 static void run_test(struct sockaddr *addr, socklen_t alen)
 {
 	struct timeval tv = { .tv_usec = 100 * 1000 };
-	int fdr, fdt;
+	int fdr, fdt, val;
 
 	fdr = socket(addr->sa_family, SOCK_DGRAM, 0);
 	if (fdr == -1)
@@ -416,6 +512,20 @@ static void run_test(struct sockaddr *addr, socklen_t alen)
 		run_all(fdt, fdr, addr, alen);
 	}
 
+	if (cfg_do_connected) {
+		set_device_mtu(fdt, CONST_MTU_TEST + 100);
+		set_route_mtu(CONST_MTU_TEST, addr->sa_family == AF_INET);
+
+		if (connect(fdt, addr, alen))
+			error(1, errno, "connect");
+
+		val = get_path_mtu(fdt, addr->sa_family == AF_INET);
+		if (val != CONST_MTU_TEST)
+			error(1, 0, "bad path mtu %u\n", val);
+
+		run_all(fdt, fdr, addr, 0 /* use connected addr */);
+	}
+
 	if (close(fdt))
 		error(1, errno, "close t");
 	if (close(fdr))
@@ -448,7 +558,7 @@ static void parse_opts(int argc, char **argv)
 {
 	int c;
 
-	while ((c = getopt(argc, argv, "46Cst:")) != -1) {
+	while ((c = getopt(argc, argv, "46cCst:")) != -1) {
 		switch (c) {
 		case '4':
 			cfg_do_ipv4 = true;
@@ -456,6 +566,9 @@ static void parse_opts(int argc, char **argv)
 		case '6':
 			cfg_do_ipv6 = true;
 			break;
+		case 'c':
+			cfg_do_connected = true;
+			break;
 		case 'C':
 			cfg_do_connectionless = true;
 			break;
diff --git a/tools/testing/selftests/net/udpgso.sh b/tools/testing/selftests/net/udpgso.sh
index 7977b97e060c..7cdf0e7c1dde 100755
--- a/tools/testing/selftests/net/udpgso.sh
+++ b/tools/testing/selftests/net/udpgso.sh
@@ -14,3 +14,10 @@ echo "ipv6 cmsg"
 
 echo "ipv6 setsockopt"
 ./in_netns.sh ./udpgso -6 -C -s
+
+echo "ipv4 connected"
+./in_netns.sh ./udpgso -4 -c
+
+# blocked on 2nd loopback address
+# echo "ipv6 connected"
+# ./in_netns.sh ./udpgso -6 -c
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
                   ` (8 preceding siblings ...)
  2018-04-17 20:00 ` [PATCH RFC net-next 09/11] selftests: udp gso with connected sockets Willem de Bruijn
@ 2018-04-17 20:15 ` Sowmini Varadhan
  2018-04-17 20:23   ` Willem de Bruijn
  2018-04-18 11:17 ` Paolo Abeni
  2018-04-18 17:50 ` David Miller
  11 siblings, 1 reply; 52+ messages in thread
From: Sowmini Varadhan @ 2018-04-17 20:15 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, Willem de Bruijn

On (04/17/18 16:00), Willem de Bruijn wrote:
> 
> This patchset implements GSO for UDP. A process can concatenate and
> submit multiple datagrams to the same destination in one send call
> by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
> or passing an analogous cmsg at send time.
> 
> The stack will send the entire large (up to network layer max size)
> datagram through the protocol layer. At the GSO layer, it is broken
> up in individual segments. All receive the same network layer header
> and UDP src and dst port. All but the last segment have the same UDP
> header, but the last may differ in length and checksum.

I'll go through the patch-set later today/tomorrow (interesting!), 
but a question: what are message boundary semantics in this model? E.g.,
if I do a udp_sendmsg of 2000 bytes, and then then a udp_sendmsg of
512 bytes, with the receiver first recvmsg 2000 bytes and then the
512 bytes?

My understanding of the comment above is that no, it will not- 
because (assuming an mtu of 1500) there will be 2 UDP messages on 
the wire, the first one with 1500 bytes, and the second with 1012
bytes (with some discounts for the various L2/L3 etc headers)

--Sowmini

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-17 20:15 ` [PATCH RFC net-next 00/11] udp gso Sowmini Varadhan
@ 2018-04-17 20:23   ` Willem de Bruijn
  2018-04-17 20:48     ` Sowmini Varadhan
  0 siblings, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 20:23 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: Network Development, Willem de Bruijn

On Tue, Apr 17, 2018 at 4:15 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (04/17/18 16:00), Willem de Bruijn wrote:
>>
>> This patchset implements GSO for UDP. A process can concatenate and
>> submit multiple datagrams to the same destination in one send call
>> by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
>> or passing an analogous cmsg at send time.
>>
>> The stack will send the entire large (up to network layer max size)
>> datagram through the protocol layer. At the GSO layer, it is broken
>> up in individual segments. All receive the same network layer header
>> and UDP src and dst port. All but the last segment have the same UDP
>> header, but the last may differ in length and checksum.
>
> I'll go through the patch-set later today/tomorrow (interesting!),

Thanks!

> but a question: what are message boundary semantics in this model? E.g.,
> if I do a udp_sendmsg of 2000 bytes, and then then a udp_sendmsg of
> 512 bytes, with the receiver first recvmsg 2000 bytes and then the
> 512 bytes?
>
> My understanding of the comment above is that no, it will not-
> because (assuming an mtu of 1500) there will be 2 UDP messages on
> the wire, the first one with 1500 bytes, and the second with 1012
> bytes (with some discounts for the various L2/L3 etc headers)

Assuming IPv4 with an MTU of 1500 and the maximum segment
size of 1472, the receiver will see three datagrams with MSS of
1472B, 528B and 512B.

The feature does not combine payloads of subsequent datagrams.
It only combines payloads of subsequent send calls if they are
building a single datagram using corking. But due to lack of checksum
offload with corking, this is actually disabled.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-17 20:23   ` Willem de Bruijn
@ 2018-04-17 20:48     ` Sowmini Varadhan
  2018-04-17 21:07       ` Willem de Bruijn
  0 siblings, 1 reply; 52+ messages in thread
From: Sowmini Varadhan @ 2018-04-17 20:48 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, Willem de Bruijn

On (04/17/18 16:23), Willem de Bruijn wrote:
> 
> Assuming IPv4 with an MTU of 1500 and the maximum segment
> size of 1472, the receiver will see three datagrams with MSS of
> 1472B, 528B and 512B.

so the recvmsg will also pass up 1472, 526, 512, right?
If yes, how will the recvmsg differentiate between the case
(2000 byte message followed by 512 byte message) and
(1472 byte message, 526 byte message, then 512 byte message),
in other words, how are UDP message boundary semantics preserved?

--Sowmini

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-17 20:48     ` Sowmini Varadhan
@ 2018-04-17 21:07       ` Willem de Bruijn
  2018-04-18  2:25         ` Samudrala, Sridhar
  2018-08-31  9:09         ` Paolo Abeni
  0 siblings, 2 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-17 21:07 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: Network Development, Willem de Bruijn

On Tue, Apr 17, 2018 at 4:48 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (04/17/18 16:23), Willem de Bruijn wrote:
>>
>> Assuming IPv4 with an MTU of 1500 and the maximum segment
>> size of 1472, the receiver will see three datagrams with MSS of
>> 1472B, 528B and 512B.
>
> so the recvmsg will also pass up 1472, 526, 512, right?

That's right.

> If yes, how will the recvmsg differentiate between the case
> (2000 byte message followed by 512 byte message) and
> (1472 byte message, 526 byte message, then 512 byte message),
> in other words, how are UDP message boundary semantics preserved?

They aren't. This is purely an optimization to amortize the cost of
repeated tx stack traversal. Unlike UFO, which would preserve the
boundaries of the original larger than MTU datagram.

A prime use case is bulk transfer of data. Think video streaming
with QUIC. It must send MTU sized or smaller packets, but has
no application-layer requirement to reconstruct large packets on
the peer.

That said, for negotiated flows an inverse GRO feature could
conceivably be implemented to reduce rx stack traversal, too.
Though due to interleaving of packets on the wire, it aggregation
would be best effort, similar to TCP TSO and GRO using the
PSH bit as packetization signal.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 06/11] udp: add gso support to virtual devices
  2018-04-17 20:00 ` [PATCH RFC net-next 06/11] udp: add gso support to virtual devices Willem de Bruijn
@ 2018-04-18  0:43   ` Dimitris Michailidis
  2018-04-18  3:27     ` Willem de Bruijn
  0 siblings, 1 reply; 52+ messages in thread
From: Dimitris Michailidis @ 2018-04-18  0:43 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, Willem de Bruijn

On Tue, Apr 17, 2018 at 1:00 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> From: Willem de Bruijn <willemb@google.com>
>
> Virtual devices such as tunnels and bonding can handle large packets.
> Only segment packets when reaching a physical or loopback device.
>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  include/linux/netdev_features.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index 35b79f47a13d..1e4883bb02a7 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -80,6 +80,7 @@ enum {
>
>         NETIF_F_GRO_HW_BIT,             /* Hardware Generic receive offload */
>         NETIF_F_HW_TLS_RECORD_BIT,      /* Offload TLS record */
> +       NETIF_F_GSO_UDP_L4_BIT,         /* UDP payload GSO (not UFO) */

Please add an entry for the new flag to
net/core/ethtool.c:netdev_features_strings
and a description to Documentation/networking/netdev-features.txt.

>
>         /*
>          * Add your fresh new feature above and remember to update
> @@ -147,6 +148,7 @@ enum {
>  #define NETIF_F_HW_ESP_TX_CSUM __NETIF_F(HW_ESP_TX_CSUM)
>  #define        NETIF_F_RX_UDP_TUNNEL_PORT  __NETIF_F(RX_UDP_TUNNEL_PORT)
>  #define NETIF_F_HW_TLS_RECORD  __NETIF_F(HW_TLS_RECORD)
> +#define NETIF_F_GSO_UDP_L4     __NETIF_F(GSO_UDP_L4)
>
>  #define for_each_netdev_feature(mask_addr, bit)        \
>         for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
> @@ -216,6 +218,7 @@ enum {
>                                  NETIF_F_GSO_GRE_CSUM |                 \
>                                  NETIF_F_GSO_IPXIP4 |                   \
>                                  NETIF_F_GSO_IPXIP6 |                   \
> +                                NETIF_F_GSO_UDP_L4 |                   \
>                                  NETIF_F_GSO_UDP_TUNNEL |               \
>                                  NETIF_F_GSO_UDP_TUNNEL_CSUM)
>
> --
> 2.17.0.484.g0c8726318c-goog
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-17 21:07       ` Willem de Bruijn
@ 2018-04-18  2:25         ` Samudrala, Sridhar
  2018-04-18  3:33           ` Willem de Bruijn
  2018-08-31  9:09         ` Paolo Abeni
  1 sibling, 1 reply; 52+ messages in thread
From: Samudrala, Sridhar @ 2018-04-18  2:25 UTC (permalink / raw)
  To: Willem de Bruijn, Sowmini Varadhan; +Cc: Network Development, Willem de Bruijn


On 4/17/2018 2:07 PM, Willem de Bruijn wrote:
> On Tue, Apr 17, 2018 at 4:48 PM, Sowmini Varadhan
> <sowmini.varadhan@oracle.com> wrote:
>> On (04/17/18 16:23), Willem de Bruijn wrote:
>>> Assuming IPv4 with an MTU of 1500 and the maximum segment
>>> size of 1472, the receiver will see three datagrams with MSS of
>>> 1472B, 528B and 512B.
>> so the recvmsg will also pass up 1472, 526, 512, right?
> That's right.
>
>> If yes, how will the recvmsg differentiate between the case
>> (2000 byte message followed by 512 byte message) and
>> (1472 byte message, 526 byte message, then 512 byte message),
>> in other words, how are UDP message boundary semantics preserved?
> They aren't. This is purely an optimization to amortize the cost of
> repeated tx stack traversal. Unlike UFO, which would preserve the
> boundaries of the original larger than MTU datagram.

Doesn't this break UDP applications that expect message boundary
preservation semantics? Is it possible to negotiate this feature?

>
> A prime use case is bulk transfer of data. Think video streaming
> with QUIC. It must send MTU sized or smaller packets, but has
> no application-layer requirement to reconstruct large packets on
> the peer.
>
> That said, for negotiated flows an inverse GRO feature could
> conceivably be implemented to reduce rx stack traversal, too.
> Though due to interleaving of packets on the wire, it aggregation
> would be best effort, similar to TCP TSO and GRO using the
> PSH bit as packetization signal.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 06/11] udp: add gso support to virtual devices
  2018-04-18  0:43   ` Dimitris Michailidis
@ 2018-04-18  3:27     ` Willem de Bruijn
  0 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-18  3:27 UTC (permalink / raw)
  To: Dimitris Michailidis; +Cc: netdev, Willem de Bruijn

On Tue, Apr 17, 2018 at 8:43 PM, Dimitris Michailidis
<dmichail@google.com> wrote:
> On Tue, Apr 17, 2018 at 1:00 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> From: Willem de Bruijn <willemb@google.com>
>>
>> Virtual devices such as tunnels and bonding can handle large packets.
>> Only segment packets when reaching a physical or loopback device.
>>
>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>> ---
>>  include/linux/netdev_features.h | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
>> index 35b79f47a13d..1e4883bb02a7 100644
>> --- a/include/linux/netdev_features.h
>> +++ b/include/linux/netdev_features.h
>> @@ -80,6 +80,7 @@ enum {
>>
>>         NETIF_F_GRO_HW_BIT,             /* Hardware Generic receive offload */
>>         NETIF_F_HW_TLS_RECORD_BIT,      /* Offload TLS record */
>> +       NETIF_F_GSO_UDP_L4_BIT,         /* UDP payload GSO (not UFO) */
>
> Please add an entry for the new flag to
> net/core/ethtool.c:netdev_features_strings
> and a description to Documentation/networking/netdev-features.txt.

Will do. I initially wrote this as a transparent kernel-internal feature,
but indeed it should be observable and configurable.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18  2:25         ` Samudrala, Sridhar
@ 2018-04-18  3:33           ` Willem de Bruijn
  2018-04-18 12:31             ` Sowmini Varadhan
  0 siblings, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-18  3:33 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Sowmini Varadhan, Network Development, Willem de Bruijn

On Tue, Apr 17, 2018 at 10:25 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 4/17/2018 2:07 PM, Willem de Bruijn wrote:
>>
>> On Tue, Apr 17, 2018 at 4:48 PM, Sowmini Varadhan
>> <sowmini.varadhan@oracle.com> wrote:
>>>
>>> On (04/17/18 16:23), Willem de Bruijn wrote:
>>>>
>>>> Assuming IPv4 with an MTU of 1500 and the maximum segment
>>>> size of 1472, the receiver will see three datagrams with MSS of
>>>> 1472B, 528B and 512B.
>>>
>>> so the recvmsg will also pass up 1472, 526, 512, right?
>>
>> That's right.
>>
>>> If yes, how will the recvmsg differentiate between the case
>>> (2000 byte message followed by 512 byte message) and
>>> (1472 byte message, 526 byte message, then 512 byte message),
>>> in other words, how are UDP message boundary semantics preserved?
>>
>> They aren't. This is purely an optimization to amortize the cost of
>> repeated tx stack traversal. Unlike UFO, which would preserve the
>> boundaries of the original larger than MTU datagram.
>
>
> Doesn't this break UDP applications that expect message boundary
> preservation semantics? Is it possible to negotiate this feature?

A process has to explicitly request the feature with socket option
or cmsg UDP_SEGMENT. By setting that to gso size is signals
its intent to send multiple datagrams in one call.

Or were you responding to the hypothetical GRO example below?
Yes, that clearly would have to be limited to negotiated flows, not
unlike how foo-over-udp tunneling is detected. It is also not a serious
suggestion at this point.

>> A prime use case is bulk transfer of data. Think video streaming
>> with QUIC. It must send MTU sized or smaller packets, but has
>> no application-layer requirement to reconstruct large packets on
>> the peer.
>>
>> That said, for negotiated flows an inverse GRO feature could
>> conceivably be implemented to reduce rx stack traversal, too.
>> Though due to interleaving of packets on the wire, it aggregation
>> would be best effort, similar to TCP TSO and GRO using the
>> PSH bit as packetization signal.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
                   ` (9 preceding siblings ...)
  2018-04-17 20:15 ` [PATCH RFC net-next 00/11] udp gso Sowmini Varadhan
@ 2018-04-18 11:17 ` Paolo Abeni
  2018-04-18 13:49   ` Willem de Bruijn
  2018-04-18 17:24   ` David Miller
  2018-04-18 17:50 ` David Miller
  11 siblings, 2 replies; 52+ messages in thread
From: Paolo Abeni @ 2018-04-18 11:17 UTC (permalink / raw)
  To: Willem de Bruijn, netdev; +Cc: Willem de Bruijn

On Tue, 2018-04-17 at 16:00 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Segmentation offload reduces cycles/byte for large packets by
> amortizing the cost of protocol stack traversal.
> 
> This patchset implements GSO for UDP. A process can concatenate and
> submit multiple datagrams to the same destination in one send call
> by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
> or passing an analogous cmsg at send time.
> 
> The stack will send the entire large (up to network layer max size)
> datagram through the protocol layer. At the GSO layer, it is broken
> up in individual segments. All receive the same network layer header
> and UDP src and dst port. All but the last segment have the same UDP
> header, but the last may differ in length and checksum.

This is interesting, thanks for sharing!

I have some local patches somewhere implementing UDP GRO, but I never
tried to upstream them, since I lacked the associated GSO and I thought
that the use-case was not too relevant.

Given that your use-case is a connected socket - no per packet route
lookup - how does GSO performs compared to plain sendmmsg()? Have you
considered using and/or improving the latter?

When testing with Spectre/Meltdown mitigation in places, I expect that
the most relevant part of the gain is due to the single syscall per
burst.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18  3:33           ` Willem de Bruijn
@ 2018-04-18 12:31             ` Sowmini Varadhan
  2018-04-18 13:35               ` Eric Dumazet
                                 ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Sowmini Varadhan @ 2018-04-18 12:31 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Samudrala, Sridhar, Network Development, Willem de Bruijn


I went through the patch set and the code looks fine- it extends existing
infra for TCP/GSO to UDP.

One thing that was not clear to me about the API: shouldn't UDP_SEGMENT
just be automatically determined in the stack from the pmtu? Whats
the motivation for the socket option for this? also AIUI this can be
either a per-socket or a per-packet option?

However, I share Sridhar's concerns about the very fundamental change
to UDP message boundary semantics here.  There is actually no such thing
as a "segment" in udp, so in general this feature makes me a little
uneasy.  Well behaved udp applications should already be sending mtu
sized datagrams. And the not-so-well-behaved ones are probably relying
on IP fragmentation/reassembly to take care of datagram boundary semantics
for them?

As Sridhar points out, the feature is not really "negotiated" - one side
unilaterally sets the option. If the receiver is a classic/POSIX UDP
implementation, it will have no way of knowing that message boundaries
have been re-adjusted at the sender.  

One thought to recover from this: use the infra being proposed in
  https://tools.ietf.org/html/draft-touch-tsvwg-udp-options-09
to include a new UDP TLV option that tracks datagram# (similar to IP ID)
to help the receiver reassemble the UDP datagram and pass it up with
the POSIX-conformant UDP message boundary. I realize that this is also
not a perfect solution: as you point out, there are risks from
packet re-ordering/drops- you may well end up just reinventing IP
frag/re-assembly when you are done (with just the slight improvement
that each "fragment" has a full UDP header, so it has a better shot
at ECMP and RSS).

--Sowmini

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 12:31             ` Sowmini Varadhan
@ 2018-04-18 13:35               ` Eric Dumazet
  2018-04-18 13:47                 ` Sowmini Varadhan
  2018-04-18 13:59               ` Willem de Bruijn
  2018-04-18 17:28               ` David Miller
  2 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2018-04-18 13:35 UTC (permalink / raw)
  To: Sowmini Varadhan, Willem de Bruijn
  Cc: Samudrala, Sridhar, Network Development, Willem de Bruijn



On 04/18/2018 05:31 AM, Sowmini Varadhan wrote:
> 
> I went through the patch set and the code looks fine- it extends existing
> infra for TCP/GSO to UDP.
> 
> One thing that was not clear to me about the API: shouldn't UDP_SEGMENT
> just be automatically determined in the stack from the pmtu? Whats
> the motivation for the socket option for this? also AIUI this can be
> either a per-socket or a per-packet option?
> 
> However, I share Sridhar's concerns about the very fundamental change
> to UDP message boundary semantics here.  

There is no change at all.

This will only be used as a mechanism to send X packets of same size.

So instead of X system calls , one system call.

One traversal of some expensive part of the host stack.

The content on the wire should be the same.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 13:35               ` Eric Dumazet
@ 2018-04-18 13:47                 ` Sowmini Varadhan
  2018-04-18 13:51                   ` Willem de Bruijn
  2018-04-18 17:34                   ` David Miller
  0 siblings, 2 replies; 52+ messages in thread
From: Sowmini Varadhan @ 2018-04-18 13:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Samudrala, Sridhar, Network Development,
	Willem de Bruijn

On (04/18/18 06:35), Eric Dumazet wrote:
> 
> There is no change at all.
> 
> This will only be used as a mechanism to send X packets of same size.
> 
> So instead of X system calls , one system call.
> 
> One traversal of some expensive part of the host stack.
> 
> The content on the wire should be the same.

I'm sorry that's not how I interpret Willem's email below
(and maybe I misunderstood)

the following taken from https://www.spinics.net/lists/netdev/msg496150.html

Sowmini> If yes, how will the recvmsg differentiate between the case
Sowmini> (2000 byte message followed by 512 byte message) and
Sowmini> (1472 byte message, 526 byte message, then 512 byte message),
Sowmini> in other words, how are UDP message boundary semantics preserved?

Willem> They aren't. This is purely an optimization to amortize the cost of
Willem> repeated tx stack traversal. Unlike UFO, which would preserve the
Willem> boundaries of the original larger than MTU datagram.

As I understand Willem's explanation, if I do a sendmsg of 2000 bytes,
- classic UDP will send 2 IP fragments, the first one with a full UDP
  header, and the IP header indicating that this is the first frag for
  that ipid, with more frags to follow. The second frag will have the
  rest with the same ipid, it will not have a udp header,
  and it will indicatet that it is the last frag (no more frags).

  The receiver can thus use the ipid, "more-frags" bit, frag offset etc
  to stitch the 2000 byte udp message together and pass it up on the udp
  socket.

- in the "GSO" proposal my 2000  bytes of data are sent as *two*
  udp packets, each of them with a unique udp header, and uh_len set
  to 1476 (for first) and 526 (for second). The receiver has no clue
  that they are both part of the same UDP datagram, So wire format
  is not the same, am I mistaken?

--Sowmini

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 11:17 ` Paolo Abeni
@ 2018-04-18 13:49   ` Willem de Bruijn
  2018-05-24  0:02     ` Marcelo Ricardo Leitner
  2018-04-18 17:24   ` David Miller
  1 sibling, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-18 13:49 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Network Development, Willem de Bruijn

On Wed, Apr 18, 2018 at 7:17 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> On Tue, 2018-04-17 at 16:00 -0400, Willem de Bruijn wrote:
>> From: Willem de Bruijn <willemb@google.com>
>>
>> Segmentation offload reduces cycles/byte for large packets by
>> amortizing the cost of protocol stack traversal.
>>
>> This patchset implements GSO for UDP. A process can concatenate and
>> submit multiple datagrams to the same destination in one send call
>> by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
>> or passing an analogous cmsg at send time.
>>
>> The stack will send the entire large (up to network layer max size)
>> datagram through the protocol layer. At the GSO layer, it is broken
>> up in individual segments. All receive the same network layer header
>> and UDP src and dst port. All but the last segment have the same UDP
>> header, but the last may differ in length and checksum.
>
> This is interesting, thanks for sharing!
>
> I have some local patches somewhere implementing UDP GRO, but I never
> tried to upstream them, since I lacked the associated GSO and I thought
> that the use-case was not too relevant.
>
> Given that your use-case is a connected socket - no per packet route
> lookup - how does GSO performs compared to plain sendmmsg()? Have you
> considered using and/or improving the latter?
>
> When testing with Spectre/Meltdown mitigation in places, I expect that
> the most relevant part of the gain is due to the single syscall per
> burst.

The main benefit is actually not route lookup avoidance. Somewhat to
my surprise. The benchmark can be run both in connected and
unconnected ('-u') mode. Both saturate the cpu cycles, so only showing
throughput:

[connected]     udp tx:    825 MB/s   588336 calls/s  14008 msg/s
[unconnected] udp tx:    711 MB/s   506646 calls/s  12063 msg/s

This corresponds to results previously seen with other applications
of about 15%.

When looking at a perf report, there is no clear hot spot, which
indicates that the savings accrue across the protocol stack traversal.

I just hacked up a sendmmsg extension to the benchmark to verify.
Indeed that does not have nearly the same benefit as GSO:

udp tx:    976 MB/s   695394 calls/s  16557 msg/s

This matches the numbers seen from TCP without TSO and GSO.
That also has few system calls, but observes per MTU stack traversal.

I pushed the branch to my github at

  https://github.com/wdebruij/linux/tree/udpgso-20180418

and also the version I sent for RFC yesterday at

  https://github.com/wdebruij/linux/tree/udpgso-rfc-v1

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 13:47                 ` Sowmini Varadhan
@ 2018-04-18 13:51                   ` Willem de Bruijn
  2018-04-18 15:08                     ` Samudrala, Sridhar
  2018-04-18 17:40                     ` David Miller
  2018-04-18 17:34                   ` David Miller
  1 sibling, 2 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-18 13:51 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Eric Dumazet, Samudrala, Sridhar, Network Development, Willem de Bruijn

On Wed, Apr 18, 2018 at 9:47 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (04/18/18 06:35), Eric Dumazet wrote:
>>
>> There is no change at all.
>>
>> This will only be used as a mechanism to send X packets of same size.
>>
>> So instead of X system calls , one system call.
>>
>> One traversal of some expensive part of the host stack.
>>
>> The content on the wire should be the same.
>
> I'm sorry that's not how I interpret Willem's email below
> (and maybe I misunderstood)
>
> the following taken from https://www.spinics.net/lists/netdev/msg496150.html
>
> Sowmini> If yes, how will the recvmsg differentiate between the case
> Sowmini> (2000 byte message followed by 512 byte message) and
> Sowmini> (1472 byte message, 526 byte message, then 512 byte message),
> Sowmini> in other words, how are UDP message boundary semantics preserved?
>
> Willem> They aren't. This is purely an optimization to amortize the cost of
> Willem> repeated tx stack traversal. Unlike UFO, which would preserve the
> Willem> boundaries of the original larger than MTU datagram.
>
> As I understand Willem's explanation, if I do a sendmsg of 2000 bytes,
> - classic UDP will send 2 IP fragments, the first one with a full UDP
>   header, and the IP header indicating that this is the first frag for
>   that ipid, with more frags to follow. The second frag will have the
>   rest with the same ipid, it will not have a udp header,
>   and it will indicatet that it is the last frag (no more frags).
>
>   The receiver can thus use the ipid, "more-frags" bit, frag offset etc
>   to stitch the 2000 byte udp message together and pass it up on the udp
>   socket.
>
> - in the "GSO" proposal my 2000  bytes of data are sent as *two*
>   udp packets, each of them with a unique udp header, and uh_len set
>   to 1476 (for first) and 526 (for second). The receiver has no clue
>   that they are both part of the same UDP datagram, So wire format
>   is not the same, am I mistaken?

Eric is correct. If the application sets a segment size with UDP_SEGMENT
this is an instruction to the kernel to split the payload along that border into
separate discrete datagrams.

It does not matter what the behavior is without setting this option. If a
process wants to send a larger than MTU datagram and rely on the
kernel to fragment, then it should not set the option.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 12:31             ` Sowmini Varadhan
  2018-04-18 13:35               ` Eric Dumazet
@ 2018-04-18 13:59               ` Willem de Bruijn
  2018-04-18 14:28                 ` Willem de Bruijn
  2018-04-18 17:28               ` David Miller
  2 siblings, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-18 13:59 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Samudrala, Sridhar, Network Development, Willem de Bruijn

> One thing that was not clear to me about the API: shouldn't UDP_SEGMENT
> just be automatically determined in the stack from the pmtu? Whats
> the motivation for the socket option for this? also AIUI this can be
> either a per-socket or a per-packet option?

I decided to let the application explicitly set segment size, to avoid
bugs from the application assuming a different MTU from the one
used in the kernel for segmentation.

With path MTU, it is too easy for a process to incorrectly assume
link MTU or stale path MTU. With the current interface, if a process
tries to assemble segments larger than relevant path MTU, the
send call will fail.

A process may also explicitly want to send a chain of packets
smaller than MTU.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 13:59               ` Willem de Bruijn
@ 2018-04-18 14:28                 ` Willem de Bruijn
  0 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-18 14:28 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Samudrala, Sridhar, Network Development, Willem de Bruijn

On Wed, Apr 18, 2018 at 9:59 AM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>> One thing that was not clear to me about the API: shouldn't UDP_SEGMENT
>> just be automatically determined in the stack from the pmtu? Whats
>> the motivation for the socket option for this? also AIUI this can be
>> either a per-socket or a per-packet option?

I forgot to respond to the last point: yes, it is set either as a setsockopt or
passed as a cmsg for a given send call.

Especially when using unconnected sockets to communicate with many
clients, it is likely that this value will vary per call.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 13:51                   ` Willem de Bruijn
@ 2018-04-18 15:08                     ` Samudrala, Sridhar
  2018-04-18 17:40                     ` David Miller
  1 sibling, 0 replies; 52+ messages in thread
From: Samudrala, Sridhar @ 2018-04-18 15:08 UTC (permalink / raw)
  To: Willem de Bruijn, Sowmini Varadhan
  Cc: Eric Dumazet, Network Development, Willem de Bruijn

On 4/18/2018 6:51 AM, Willem de Bruijn wrote:
> On Wed, Apr 18, 2018 at 9:47 AM, Sowmini Varadhan
> <sowmini.varadhan@oracle.com> wrote:
>> On (04/18/18 06:35), Eric Dumazet wrote:
>>> There is no change at all.
>>>
>>> This will only be used as a mechanism to send X packets of same size.
>>>
>>> So instead of X system calls , one system call.
>>>
>>> One traversal of some expensive part of the host stack.
>>>
>>> The content on the wire should be the same.
>> I'm sorry that's not how I interpret Willem's email below
>> (and maybe I misunderstood)
>>
>> the following taken from https://www.spinics.net/lists/netdev/msg496150.html
>>
>> Sowmini> If yes, how will the recvmsg differentiate between the case
>> Sowmini> (2000 byte message followed by 512 byte message) and
>> Sowmini> (1472 byte message, 526 byte message, then 512 byte message),
>> Sowmini> in other words, how are UDP message boundary semantics preserved?
>>
>> Willem> They aren't. This is purely an optimization to amortize the cost of
>> Willem> repeated tx stack traversal. Unlike UFO, which would preserve the
>> Willem> boundaries of the original larger than MTU datagram.
>>
>> As I understand Willem's explanation, if I do a sendmsg of 2000 bytes,
>> - classic UDP will send 2 IP fragments, the first one with a full UDP
>>    header, and the IP header indicating that this is the first frag for
>>    that ipid, with more frags to follow. The second frag will have the
>>    rest with the same ipid, it will not have a udp header,
>>    and it will indicatet that it is the last frag (no more frags).
>>
>>    The receiver can thus use the ipid, "more-frags" bit, frag offset etc
>>    to stitch the 2000 byte udp message together and pass it up on the udp
>>    socket.
>>
>> - in the "GSO" proposal my 2000  bytes of data are sent as *two*
>>    udp packets, each of them with a unique udp header, and uh_len set
>>    to 1476 (for first) and 526 (for second). The receiver has no clue
>>    that they are both part of the same UDP datagram, So wire format
>>    is not the same, am I mistaken?
> Eric is correct. If the application sets a segment size with UDP_SEGMENT
> this is an instruction to the kernel to split the payload along that border into
> separate discrete datagrams.

OK. So the sender app is passing the message boundary info to the kernel via the socket
option and letting the kernel split the large payload into multiple UDP segments.


>
> It does not matter what the behavior is without setting this option. If a
> process wants to send a larger than MTU datagram and rely on the
> kernel to fragment, then it should not set the option.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 11:17 ` Paolo Abeni
  2018-04-18 13:49   ` Willem de Bruijn
@ 2018-04-18 17:24   ` David Miller
  1 sibling, 0 replies; 52+ messages in thread
From: David Miller @ 2018-04-18 17:24 UTC (permalink / raw)
  To: pabeni; +Cc: willemdebruijn.kernel, netdev, willemb

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 18 Apr 2018 13:17:54 +0200

> When testing with Spectre/Meltdown mitigation in places, I expect
> that the most relevant part of the gain is due to the single syscall
> per burst.

I was going to say exactly this.

Batching schemes that were borderline beneficial before spectre/meltdown
are more worthwhile now.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 12:31             ` Sowmini Varadhan
  2018-04-18 13:35               ` Eric Dumazet
  2018-04-18 13:59               ` Willem de Bruijn
@ 2018-04-18 17:28               ` David Miller
  2018-04-18 18:12                 ` Alexander Duyck
  2 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2018-04-18 17:28 UTC (permalink / raw)
  To: sowmini.varadhan
  Cc: willemdebruijn.kernel, sridhar.samudrala, netdev, willemb

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 18 Apr 2018 08:31:03 -0400

> However, I share Sridhar's concerns about the very fundamental change
> to UDP message boundary semantics here.  There is actually no such thing
> as a "segment" in udp, so in general this feature makes me a little
> uneasy.  Well behaved udp applications should already be sending mtu
> sized datagrams. And the not-so-well-behaved ones are probably relying
> on IP fragmentation/reassembly to take care of datagram boundary semantics
> for them?
> 
> As Sridhar points out, the feature is not really "negotiated" - one side
> unilaterally sets the option. If the receiver is a classic/POSIX UDP
> implementation, it will have no way of knowing that message boundaries
> have been re-adjusted at the sender.  

There are no "semantics".

What ends up on the wire is the same before the kernel/app changes as
afterwards.

The only difference is that instead of the application doing N - 1
sendmsg() calls with mtu sized writes, it's giving everything all at
once and asking the kernel to segment.

It even gives the application control over the size of the packets,
which I think is completely prudent in this situation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 13:47                 ` Sowmini Varadhan
  2018-04-18 13:51                   ` Willem de Bruijn
@ 2018-04-18 17:34                   ` David Miller
  1 sibling, 0 replies; 52+ messages in thread
From: David Miller @ 2018-04-18 17:34 UTC (permalink / raw)
  To: sowmini.varadhan
  Cc: eric.dumazet, willemdebruijn.kernel, sridhar.samudrala, netdev, willemb

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 18 Apr 2018 09:47:06 -0400

> - in the "GSO" proposal my 2000  bytes of data are sent as *two*
>   udp packets, each of them with a unique udp header, and uh_len set
>   to 1476 (for first) and 526 (for second). The receiver has no clue
>   that they are both part of the same UDP datagram, So wire format
>   is not the same, am I mistaken?

The sender is asking for two packets and this exact packetization,
that's why they explicitly set the socket option or provide the
cmsg value.

As Eric said, this is just exchanging N sendmsg() calls with 1.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 13:51                   ` Willem de Bruijn
  2018-04-18 15:08                     ` Samudrala, Sridhar
@ 2018-04-18 17:40                     ` David Miller
  1 sibling, 0 replies; 52+ messages in thread
From: David Miller @ 2018-04-18 17:40 UTC (permalink / raw)
  To: willemdebruijn.kernel
  Cc: sowmini.varadhan, eric.dumazet, sridhar.samudrala, netdev, willemb

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 18 Apr 2018 09:51:50 -0400

> Eric is correct. If the application sets a segment size with UDP_SEGMENT
> this is an instruction to the kernel to split the payload along that border into
> separate discrete datagrams.
> 
> It does not matter what the behavior is without setting this option. If a
> process wants to send a larger than MTU datagram and rely on the
> kernel to fragment, then it should not set the option.

+1

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
                   ` (10 preceding siblings ...)
  2018-04-18 11:17 ` Paolo Abeni
@ 2018-04-18 17:50 ` David Miller
  2018-04-18 18:12   ` Willem de Bruijn
  11 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2018-04-18 17:50 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Tue, 17 Apr 2018 16:00:50 -0400

> Segmentation offload reduces cycles/byte for large packets by
> amortizing the cost of protocol stack traversal.
> 
> This patchset implements GSO for UDP.

This looks great.

And as mentioned in other emails, the interface looks good too by letting
the app choose the segmentation point.

It's a real shame we lose checksumming with MSG_MORE, perhaps we can find
a way to lift that restriction?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 17:28               ` David Miller
@ 2018-04-18 18:12                 ` Alexander Duyck
  2018-04-18 18:22                   ` Willem de Bruijn
                                     ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Alexander Duyck @ 2018-04-18 18:12 UTC (permalink / raw)
  To: David Miller
  Cc: Sowmini Varadhan, Willem de Bruijn, Samudrala, Sridhar, Netdev,
	Willem de Bruijn

On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net> wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Wed, 18 Apr 2018 08:31:03 -0400
>
>> However, I share Sridhar's concerns about the very fundamental change
>> to UDP message boundary semantics here.  There is actually no such thing
>> as a "segment" in udp, so in general this feature makes me a little
>> uneasy.  Well behaved udp applications should already be sending mtu
>> sized datagrams. And the not-so-well-behaved ones are probably relying
>> on IP fragmentation/reassembly to take care of datagram boundary semantics
>> for them?
>>
>> As Sridhar points out, the feature is not really "negotiated" - one side
>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>> implementation, it will have no way of knowing that message boundaries
>> have been re-adjusted at the sender.
>
> There are no "semantics".
>
> What ends up on the wire is the same before the kernel/app changes as
> afterwards.
>
> The only difference is that instead of the application doing N - 1
> sendmsg() calls with mtu sized writes, it's giving everything all at
> once and asking the kernel to segment.
>
> It even gives the application control over the size of the packets,
> which I think is completely prudent in this situation.

My only concern with the patch set is verifying what mitigations are
in case so that we aren't trying to set an MSS size that results in a
frame larger than MTU. I'm still digging through the code and trying
to grok it, but I figured I might just put the question out there to
may my reviewing easier.

Also any plans for HW offload support for this? I vaguely recall that
the igb and ixgbe parts had support for something like this in
hardware. I would have to double check to see what exactly is
supported.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 17:50 ` David Miller
@ 2018-04-18 18:12   ` Willem de Bruijn
  2018-04-19 17:45     ` David Miller
  0 siblings, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-18 18:12 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Willem de Bruijn

On Wed, Apr 18, 2018 at 1:50 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Tue, 17 Apr 2018 16:00:50 -0400
>
>> Segmentation offload reduces cycles/byte for large packets by
>> amortizing the cost of protocol stack traversal.
>>
>> This patchset implements GSO for UDP.
>
> This looks great.
>
> And as mentioned in other emails, the interface looks good too by letting
> the app choose the segmentation point.
>
> It's a real shame we lose checksumming with MSG_MORE, perhaps we can find
> a way to lift that restriction?

Actually, yes, I should be able to relax that constraint with
segmentation.

It is there in case a corked packet may grow to the point of
having to be fragmented. But segmentation already ensures
that its datagrams always fit in MTU.

Should be simple refinement to

            !(flags & MSG_MORE) &&

in __ip_append_data.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 18:12                 ` Alexander Duyck
@ 2018-04-18 18:22                   ` Willem de Bruijn
  2018-04-20 17:38                     ` Alexander Duyck
  2018-04-18 19:33                   ` David Miller
  2018-04-20 18:27                   ` Tushar Dave
  2 siblings, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-18 18:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Sowmini Varadhan, Samudrala, Sridhar, Netdev,
	Willem de Bruijn

On Wed, Apr 18, 2018 at 2:12 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net> wrote:
>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> Date: Wed, 18 Apr 2018 08:31:03 -0400
>>
>>> However, I share Sridhar's concerns about the very fundamental change
>>> to UDP message boundary semantics here.  There is actually no such thing
>>> as a "segment" in udp, so in general this feature makes me a little
>>> uneasy.  Well behaved udp applications should already be sending mtu
>>> sized datagrams. And the not-so-well-behaved ones are probably relying
>>> on IP fragmentation/reassembly to take care of datagram boundary semantics
>>> for them?
>>>
>>> As Sridhar points out, the feature is not really "negotiated" - one side
>>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>>> implementation, it will have no way of knowing that message boundaries
>>> have been re-adjusted at the sender.
>>
>> There are no "semantics".
>>
>> What ends up on the wire is the same before the kernel/app changes as
>> afterwards.
>>
>> The only difference is that instead of the application doing N - 1
>> sendmsg() calls with mtu sized writes, it's giving everything all at
>> once and asking the kernel to segment.
>>
>> It even gives the application control over the size of the packets,
>> which I think is completely prudent in this situation.
>
> My only concern with the patch set is verifying what mitigations are
> in case so that we aren't trying to set an MSS size that results in a
> frame larger than MTU. I'm still digging through the code and trying
> to grok it, but I figured I might just put the question out there to
> may my reviewing easier.

This is checked in udp_send_skb in

                const int hlen = skb_network_header_len(skb) +
                                 sizeof(struct udphdr);

                if (hlen + cork->gso_size > cork->fragsize)
                        return -EINVAL;

At this point cork->fragsize will have been set in ip_setup_cork
based on device or path mtu:

        cork->fragsize = ip_sk_use_pmtu(sk) ?
                         dst_mtu(&rt->dst) : rt->dst.dev->mtu;

The feature bypasses the MTU sanity checks in __ip_append_data
by setting local variable mtu to a network layer max

        mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;

but the above should perform the same check, only later. The
check in udp_send_skb is simpler as it does not have to deal
with the case of fragmentation.

> Also any plans for HW offload support for this? I vaguely recall that
> the igb and ixgbe parts had support for something like this in
> hardware. I would have to double check to see what exactly is
> supported.

I hadn't given that much thought until the request yesterday to
expose the NETIF_F_GSO_UDP_L4 flag through ethtool. By
virtue of having only a single fixed segmentation length, it
appears reasonably straightforward to offload.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 18:12                 ` Alexander Duyck
  2018-04-18 18:22                   ` Willem de Bruijn
@ 2018-04-18 19:33                   ` David Miller
  2018-04-20 18:27                   ` Tushar Dave
  2 siblings, 0 replies; 52+ messages in thread
From: David Miller @ 2018-04-18 19:33 UTC (permalink / raw)
  To: alexander.duyck
  Cc: sowmini.varadhan, willemdebruijn.kernel, sridhar.samudrala,
	netdev, willemb

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Wed, 18 Apr 2018 11:12:06 -0700

> My only concern with the patch set is verifying what mitigations are
> in case so that we aren't trying to set an MSS size that results in a
> frame larger than MTU. I'm still digging through the code and trying
> to grok it, but I figured I might just put the question out there to
> may my reviewing easier.

It signals an error if a too large segment size is requested.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 18:12   ` Willem de Bruijn
@ 2018-04-19 17:45     ` David Miller
  0 siblings, 0 replies; 52+ messages in thread
From: David Miller @ 2018-04-19 17:45 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 18 Apr 2018 14:12:40 -0400

> Actually, yes, I should be able to relax that constraint with
> segmentation.
> 
> It is there in case a corked packet may grow to the point of
> having to be fragmented. But segmentation already ensures
> that its datagrams always fit in MTU.
> 
> Should be simple refinement to
> 
>             !(flags & MSG_MORE) &&
> 
> in __ip_append_data.

Right, if segmentation never sends greater than mtu then we can
relax the restriction in this way.

Great idea.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 18:22                   ` Willem de Bruijn
@ 2018-04-20 17:38                     ` Alexander Duyck
  2018-04-20 21:58                       ` Willem de Bruijn
  0 siblings, 1 reply; 52+ messages in thread
From: Alexander Duyck @ 2018-04-20 17:38 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Sowmini Varadhan, Samudrala, Sridhar, Netdev,
	Willem de Bruijn

On Wed, Apr 18, 2018 at 11:22 AM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Wed, Apr 18, 2018 at 2:12 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>>> Date: Wed, 18 Apr 2018 08:31:03 -0400
>>>
>>>> However, I share Sridhar's concerns about the very fundamental change
>>>> to UDP message boundary semantics here.  There is actually no such thing
>>>> as a "segment" in udp, so in general this feature makes me a little
>>>> uneasy.  Well behaved udp applications should already be sending mtu
>>>> sized datagrams. And the not-so-well-behaved ones are probably relying
>>>> on IP fragmentation/reassembly to take care of datagram boundary semantics
>>>> for them?
>>>>
>>>> As Sridhar points out, the feature is not really "negotiated" - one side
>>>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>>>> implementation, it will have no way of knowing that message boundaries
>>>> have been re-adjusted at the sender.
>>>
>>> There are no "semantics".
>>>
>>> What ends up on the wire is the same before the kernel/app changes as
>>> afterwards.
>>>
>>> The only difference is that instead of the application doing N - 1
>>> sendmsg() calls with mtu sized writes, it's giving everything all at
>>> once and asking the kernel to segment.
>>>
>>> It even gives the application control over the size of the packets,
>>> which I think is completely prudent in this situation.
>>
>> My only concern with the patch set is verifying what mitigations are
>> in case so that we aren't trying to set an MSS size that results in a
>> frame larger than MTU. I'm still digging through the code and trying
>> to grok it, but I figured I might just put the question out there to
>> may my reviewing easier.
>
> This is checked in udp_send_skb in
>
>                 const int hlen = skb_network_header_len(skb) +
>                                  sizeof(struct udphdr);
>
>                 if (hlen + cork->gso_size > cork->fragsize)
>                         return -EINVAL;
>
> At this point cork->fragsize will have been set in ip_setup_cork
> based on device or path mtu:
>
>         cork->fragsize = ip_sk_use_pmtu(sk) ?
>                          dst_mtu(&rt->dst) : rt->dst.dev->mtu;
>
> The feature bypasses the MTU sanity checks in __ip_append_data
> by setting local variable mtu to a network layer max
>
>         mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
>
> but the above should perform the same check, only later. The
> check in udp_send_skb is simpler as it does not have to deal
> with the case of fragmentation.
>
>> Also any plans for HW offload support for this? I vaguely recall that
>> the igb and ixgbe parts had support for something like this in
>> hardware. I would have to double check to see what exactly is
>> supported.
>
> I hadn't given that much thought until the request yesterday to
> expose the NETIF_F_GSO_UDP_L4 flag through ethtool. By
> virtue of having only a single fixed segmentation length, it
> appears reasonably straightforward to offload.

Actually I just got a chance to start on a review of things. Do we
need to have to use both GSO_UDP and and GSO_UDP_L4? It might be
better if we could split these up and specifically call out GSO_UDP as
UFO and GSO_UDP_L4 as being UDP segmentation.

My only concern is that I don't think devices would want to have to
try and advertise both GSO_UDP and GSO_UDP_L4 if they want to support
UDP segmentation only. Ideally we want to keep UFO separate from UDP
segmentation since they would be two different things.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 18:12                 ` Alexander Duyck
  2018-04-18 18:22                   ` Willem de Bruijn
  2018-04-18 19:33                   ` David Miller
@ 2018-04-20 18:27                   ` Tushar Dave
  2018-04-20 20:08                     ` Alexander Duyck
  2 siblings, 1 reply; 52+ messages in thread
From: Tushar Dave @ 2018-04-20 18:27 UTC (permalink / raw)
  To: Alexander Duyck, David Miller
  Cc: Sowmini Varadhan, Willem de Bruijn, Samudrala, Sridhar, Netdev,
	Willem de Bruijn



On 04/18/2018 11:12 AM, Alexander Duyck wrote:
> On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net> wrote:
>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> Date: Wed, 18 Apr 2018 08:31:03 -0400
>>
>>> However, I share Sridhar's concerns about the very fundamental change
>>> to UDP message boundary semantics here.  There is actually no such thing
>>> as a "segment" in udp, so in general this feature makes me a little
>>> uneasy.  Well behaved udp applications should already be sending mtu
>>> sized datagrams. And the not-so-well-behaved ones are probably relying
>>> on IP fragmentation/reassembly to take care of datagram boundary semantics
>>> for them?
>>>
>>> As Sridhar points out, the feature is not really "negotiated" - one side
>>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>>> implementation, it will have no way of knowing that message boundaries
>>> have been re-adjusted at the sender.
>>
>> There are no "semantics".
>>
>> What ends up on the wire is the same before the kernel/app changes as
>> afterwards.
>>
>> The only difference is that instead of the application doing N - 1
>> sendmsg() calls with mtu sized writes, it's giving everything all at
>> once and asking the kernel to segment.
>>
>> It even gives the application control over the size of the packets,
>> which I think is completely prudent in this situation.
> 
> My only concern with the patch set is verifying what mitigations are
> in case so that we aren't trying to set an MSS size that results in a
> frame larger than MTU. I'm still digging through the code and trying
> to grok it, but I figured I might just put the question out there to
> may my reviewing easier.
> 
> Also any plans for HW offload support for this? I vaguely recall that
> the igb and ixgbe parts had support for something like this in
> hardware. I would have to double check to see what exactly is
> supported.

Alex,

If by HW support you meant UFO (UDP Fragmentation Offload), then I have
dig into that last year using ixgbe. And I found that Intel 10G HW does
break large UDP packets into MTU size however it does not generate
*true* IP fragments. Instead, when large (> MTU) size UDP packet is
given to NIC, HW generates unique UDP packets with distinct IP
fragments. This makes it impossible for receiving station to reassemble
them into one UDP packet.

I am not sure about igb!

-Tushar


> 
> Thanks.
> 
> - Alex
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-20 18:27                   ` Tushar Dave
@ 2018-04-20 20:08                     ` Alexander Duyck
  2018-04-21  3:11                       ` Tushar Dave
  0 siblings, 1 reply; 52+ messages in thread
From: Alexander Duyck @ 2018-04-20 20:08 UTC (permalink / raw)
  To: Tushar Dave
  Cc: David Miller, Sowmini Varadhan, Willem de Bruijn, Samudrala,
	Sridhar, Netdev, Willem de Bruijn

On Fri, Apr 20, 2018 at 11:27 AM, Tushar Dave <tushar.n.dave@oracle.com> wrote:
>
>
> On 04/18/2018 11:12 AM, Alexander Duyck wrote:
>>
>> On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net>
>> wrote:
>>>
>>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>>> Date: Wed, 18 Apr 2018 08:31:03 -0400
>>>
>>>> However, I share Sridhar's concerns about the very fundamental change
>>>> to UDP message boundary semantics here.  There is actually no such thing
>>>> as a "segment" in udp, so in general this feature makes me a little
>>>> uneasy.  Well behaved udp applications should already be sending mtu
>>>> sized datagrams. And the not-so-well-behaved ones are probably relying
>>>> on IP fragmentation/reassembly to take care of datagram boundary
>>>> semantics
>>>> for them?
>>>>
>>>> As Sridhar points out, the feature is not really "negotiated" - one side
>>>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>>>> implementation, it will have no way of knowing that message boundaries
>>>> have been re-adjusted at the sender.
>>>
>>>
>>> There are no "semantics".
>>>
>>> What ends up on the wire is the same before the kernel/app changes as
>>> afterwards.
>>>
>>> The only difference is that instead of the application doing N - 1
>>> sendmsg() calls with mtu sized writes, it's giving everything all at
>>> once and asking the kernel to segment.
>>>
>>> It even gives the application control over the size of the packets,
>>> which I think is completely prudent in this situation.
>>
>>
>> My only concern with the patch set is verifying what mitigations are
>> in case so that we aren't trying to set an MSS size that results in a
>> frame larger than MTU. I'm still digging through the code and trying
>> to grok it, but I figured I might just put the question out there to
>> may my reviewing easier.
>>
>> Also any plans for HW offload support for this? I vaguely recall that
>> the igb and ixgbe parts had support for something like this in
>> hardware. I would have to double check to see what exactly is
>> supported.
>
>
> Alex,
>
> If by HW support you meant UFO (UDP Fragmentation Offload), then I have
> dig into that last year using ixgbe. And I found that Intel 10G HW does
> break large UDP packets into MTU size however it does not generate
> *true* IP fragments. Instead, when large (> MTU) size UDP packet is
> given to NIC, HW generates unique UDP packets with distinct IP
> fragments. This makes it impossible for receiving station to reassemble
> them into one UDP packet.
>
> I am not sure about igb!
>
> -Tushar

Tushar,

I am not sure you have been following this thread, but this is about
adding UDP GSO support, not fragmentation offload. With GSO support
the UDP frames are not expected to be reassembled they are meant to be
handled as individual frames.

What you have described is why I am interested. This patch set adds
support for GSO segmentation, not fragmentation.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-20 17:38                     ` Alexander Duyck
@ 2018-04-20 21:58                       ` Willem de Bruijn
  2018-04-21  2:08                         ` Alexander Duyck
  0 siblings, 1 reply; 52+ messages in thread
From: Willem de Bruijn @ 2018-04-20 21:58 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Sowmini Varadhan, Samudrala, Sridhar, Netdev,
	Willem de Bruijn

>>> Also any plans for HW offload support for this? I vaguely recall that
>>> the igb and ixgbe parts had support for something like this in
>>> hardware. I would have to double check to see what exactly is
>>> supported.
>>
>> I hadn't given that much thought until the request yesterday to
>> expose the NETIF_F_GSO_UDP_L4 flag through ethtool. By
>> virtue of having only a single fixed segmentation length, it
>> appears reasonably straightforward to offload.
>
> Actually I just got a chance to start on a review of things. Do we
> need to have to use both GSO_UDP and and GSO_UDP_L4? It might be
> better if we could split these up and specifically call out GSO_UDP as
> UFO and GSO_UDP_L4 as being UDP segmentation.

Thanks for taking a look, Alex.

Agreed, I'll revise that. My initial thought was that both gso skbs need
to take the same udp gso special case branches in places like act_csum
and ovs. But on rereading that seems an unsafe approach, as some
branches are fragmentation specific. I'll review them all and add
separate SKB_GSO_UDP_L4 cases where needed, instead.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-20 21:58                       ` Willem de Bruijn
@ 2018-04-21  2:08                         ` Alexander Duyck
  0 siblings, 0 replies; 52+ messages in thread
From: Alexander Duyck @ 2018-04-21  2:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Sowmini Varadhan, Samudrala, Sridhar, Netdev,
	Willem de Bruijn

On Fri, Apr 20, 2018 at 2:58 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>>>> Also any plans for HW offload support for this? I vaguely recall that
>>>> the igb and ixgbe parts had support for something like this in
>>>> hardware. I would have to double check to see what exactly is
>>>> supported.
>>>
>>> I hadn't given that much thought until the request yesterday to
>>> expose the NETIF_F_GSO_UDP_L4 flag through ethtool. By
>>> virtue of having only a single fixed segmentation length, it
>>> appears reasonably straightforward to offload.
>>
>> Actually I just got a chance to start on a review of things. Do we
>> need to have to use both GSO_UDP and and GSO_UDP_L4? It might be
>> better if we could split these up and specifically call out GSO_UDP as
>> UFO and GSO_UDP_L4 as being UDP segmentation.
>
> Thanks for taking a look, Alex.
>
> Agreed, I'll revise that. My initial thought was that both gso skbs need
> to take the same udp gso special case branches in places like act_csum
> and ovs. But on rereading that seems an unsafe approach, as some
> branches are fragmentation specific. I'll review them all and add
> separate SKB_GSO_UDP_L4 cases where needed, instead.

Sounds good. Keep me in the loop on these patches. I will see if we
can get support for ixgbe, ixgbevf, and igb to at least offload this
since it shouldn't be much of a lift to get that. Once we have that I
can also take a look and see if there would be much work needed to add
gso_partial support so we could deal with tunnel encapsulated UDP
segmentation offload support on those devices.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-20 20:08                     ` Alexander Duyck
@ 2018-04-21  3:11                       ` Tushar Dave
  0 siblings, 0 replies; 52+ messages in thread
From: Tushar Dave @ 2018-04-21  3:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Sowmini Varadhan, Willem de Bruijn, Samudrala,
	Sridhar, Netdev, Willem de Bruijn



On 04/20/2018 01:08 PM, Alexander Duyck wrote:
> On Fri, Apr 20, 2018 at 11:27 AM, Tushar Dave <tushar.n.dave@oracle.com> wrote:
>>
>>
>> On 04/18/2018 11:12 AM, Alexander Duyck wrote:
>>>
>>> On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net>
>>> wrote:
>>>>
>>>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>>>> Date: Wed, 18 Apr 2018 08:31:03 -0400
>>>>
>>>>> However, I share Sridhar's concerns about the very fundamental change
>>>>> to UDP message boundary semantics here.  There is actually no such thing
>>>>> as a "segment" in udp, so in general this feature makes me a little
>>>>> uneasy.  Well behaved udp applications should already be sending mtu
>>>>> sized datagrams. And the not-so-well-behaved ones are probably relying
>>>>> on IP fragmentation/reassembly to take care of datagram boundary
>>>>> semantics
>>>>> for them?
>>>>>
>>>>> As Sridhar points out, the feature is not really "negotiated" - one side
>>>>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>>>>> implementation, it will have no way of knowing that message boundaries
>>>>> have been re-adjusted at the sender.
>>>>
>>>>
>>>> There are no "semantics".
>>>>
>>>> What ends up on the wire is the same before the kernel/app changes as
>>>> afterwards.
>>>>
>>>> The only difference is that instead of the application doing N - 1
>>>> sendmsg() calls with mtu sized writes, it's giving everything all at
>>>> once and asking the kernel to segment.
>>>>
>>>> It even gives the application control over the size of the packets,
>>>> which I think is completely prudent in this situation.
>>>
>>>
>>> My only concern with the patch set is verifying what mitigations are
>>> in case so that we aren't trying to set an MSS size that results in a
>>> frame larger than MTU. I'm still digging through the code and trying
>>> to grok it, but I figured I might just put the question out there to
>>> may my reviewing easier.
>>>
>>> Also any plans for HW offload support for this? I vaguely recall that
>>> the igb and ixgbe parts had support for something like this in
>>> hardware. I would have to double check to see what exactly is
>>> supported.
>>
>>
>> Alex,
>>
>> If by HW support you meant UFO (UDP Fragmentation Offload), then I have
>> dig into that last year using ixgbe. And I found that Intel 10G HW does
>> break large UDP packets into MTU size however it does not generate
>> *true* IP fragments. Instead, when large (> MTU) size UDP packet is
>> given to NIC, HW generates unique UDP packets with distinct IP
>> fragments. This makes it impossible for receiving station to reassemble
>> them into one UDP packet.
>>
>> I am not sure about igb!
>>
>> -Tushar
> 
> Tushar,
> 
> I am not sure you have been following this thread, but this is about
> adding UDP GSO support, not fragmentation offload. With GSO support
> the UDP frames are not expected to be reassembled they are meant to be
> handled as individual frames.
> 
> What you have described is why I am interested. This patch set adds
> support for GSO segmentation, not fragmentation.
I see. Never mind.
Thanks.

-Tushar
> 
> Thanks.
> 
> - Alex
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-18 13:49   ` Willem de Bruijn
@ 2018-05-24  0:02     ` Marcelo Ricardo Leitner
  2018-05-24  1:15       ` Willem de Bruijn
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-05-24  0:02 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Paolo Abeni, Network Development, Willem de Bruijn

On Wed, Apr 18, 2018 at 09:49:18AM -0400, Willem de Bruijn wrote:
> I just hacked up a sendmmsg extension to the benchmark to verify.
> Indeed that does not have nearly the same benefit as GSO:
> 
> udp tx:    976 MB/s   695394 calls/s  16557 msg/s
> 
> This matches the numbers seen from TCP without TSO and GSO.
> That also has few system calls, but observes per MTU stack traversal.

Reviving this old thread because it's the only place I saw sendmmsg
being mentioned.

sendmmsg shouldn't be considered as an alternative, but rather as a
complement. Then instead of the application building one large request
and request the stack to fragment it, it could simply build the
sendmmsg request and the stack would group the mmsg into a gso skb. It
seems more natural to the application. But well, both (sendmmsg and
the option to fragment) are Linux-specific..

For that we need sendmmsg to do something smarter than doing several
sendmsg calls, yes.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-05-24  0:02     ` Marcelo Ricardo Leitner
@ 2018-05-24  1:15       ` Willem de Bruijn
  0 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-05-24  1:15 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Paolo Abeni, Network Development, Willem de Bruijn

On Wed, May 23, 2018 at 8:02 PM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Wed, Apr 18, 2018 at 09:49:18AM -0400, Willem de Bruijn wrote:
>> I just hacked up a sendmmsg extension to the benchmark to verify.
>> Indeed that does not have nearly the same benefit as GSO:
>>
>> udp tx:    976 MB/s   695394 calls/s  16557 msg/s
>>
>> This matches the numbers seen from TCP without TSO and GSO.
>> That also has few system calls, but observes per MTU stack traversal.
>
> Reviving this old thread because it's the only place I saw sendmmsg
> being mentioned.
>
> sendmmsg shouldn't be considered as an alternative, but rather as a
> complement. Then instead of the application building one large request
> and request the stack to fragment it, it could simply build the
> sendmmsg request and the stack would group the mmsg into a gso skb. It
> seems more natural to the application. But well, both (sendmmsg and
> the option to fragment) are Linux-specific..
>
> For that we need sendmmsg to do something smarter than doing several
> sendmsg calls, yes.

I agree. See also my original point:

"An alternative implementation that would allow non-uniform
  segment length is to use GSO_BY_FRAGS like SCTP. This would
  likely require MSG_MORE to build the list using multiple
  send calls (or one sendmmsg). The two approaches are not
  mutually-exclusive, so that could be a follow-up."

Clear advantages of GSO_BY_FRAGS are that segments do
not have to be of equal length and that converting existing users
of sendmmsg is trivial.

On the other hand, this is less likely to be offloaded to hardware,
as it requires non-constant metadata in the descriptor.

Both cases also potentially apply to the GRO path to allow for
efficient forwarding. And to the udp socket rx layer to allow for
queuing batches of datagrams at a time, then carving off one
gso_size per recvmsg.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-04-17 21:07       ` Willem de Bruijn
  2018-04-18  2:25         ` Samudrala, Sridhar
@ 2018-08-31  9:09         ` Paolo Abeni
  2018-08-31 10:09           ` Eric Dumazet
  2018-08-31 13:08           ` Willem de Bruijn
  1 sibling, 2 replies; 52+ messages in thread
From: Paolo Abeni @ 2018-08-31  9:09 UTC (permalink / raw)
  To: Willem de Bruijn, Sowmini Varadhan; +Cc: Network Development, Willem de Bruijn

Hi,

On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> That said, for negotiated flows an inverse GRO feature could
> conceivably be implemented to reduce rx stack traversal, too.
> Though due to interleaving of packets on the wire, it aggregation
> would be best effort, similar to TCP TSO and GRO using the
> PSH bit as packetization signal.

Reviving this old thread, before I forgot again. I have some local
patches implementing UDP GRO in a dual way to current GSO_UDP_L4
implementation: several datagram with the same length are aggregated
into a single one, and the user space receive a single larger packet
instead of multiple ones. I hope quic can leverage such scenario, but I
really know nothing about the protocol.

I measure roughly a 50% performance improvement with udpgso_bench in
respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
usage on the receiver[1]. Some additional hacking to the general GRO
bits is required to avoid useless socket lookups for ingress UDP
packets when UDP_GSO is not enabled.

If there is interest on this topic, I can share some RFC patches
(hopefully somewhat next week).

Cheers,

Paolo

[1] With udpgso_bench_tx, the bottle-neck is again the sender, even
with GSO enabled. With a pktgen sender, the bottle-neck become the rx
softirqd, and I see a lot of time consumed due to retpolines in the GRO
code. In both scenarios skb_release_data() becomes the topmost perf
offender for the user space process.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-08-31  9:09         ` Paolo Abeni
@ 2018-08-31 10:09           ` Eric Dumazet
  2018-08-31 13:08           ` Willem de Bruijn
  1 sibling, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2018-08-31 10:09 UTC (permalink / raw)
  To: Paolo Abeni, Willem de Bruijn, Sowmini Varadhan
  Cc: Network Development, Willem de Bruijn



On 08/31/2018 02:09 AM, Paolo Abeni wrote:
> I hope quic can leverage such scenario, but I
> really know nothing about the protocol.
>

Most QUIC receivers are mobile phones, laptops, with wifi without GRO anyway...

Even if they had GRO, the inter-packet delay would be too high for GRO to be successful.

(vast majority of QUIC flows are < 100 Mbits because of the last mile hop)

GSO UDP is used on servers with clear gains, but there are not
really high speed receivers where GRO could be used.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-08-31  9:09         ` Paolo Abeni
  2018-08-31 10:09           ` Eric Dumazet
@ 2018-08-31 13:08           ` Willem de Bruijn
  2018-08-31 13:44             ` Paolo Abeni
  2018-09-03  8:02             ` Steffen Klassert
  1 sibling, 2 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-08-31 13:08 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Sowmini Varadhan, Network Development, Willem de Bruijn

On Fri, Aug 31, 2018 at 5:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> > That said, for negotiated flows an inverse GRO feature could
> > conceivably be implemented to reduce rx stack traversal, too.
> > Though due to interleaving of packets on the wire, it aggregation
> > would be best effort, similar to TCP TSO and GRO using the
> > PSH bit as packetization signal.
>
> Reviving this old thread, before I forgot again. I have some local
> patches implementing UDP GRO in a dual way to current GSO_UDP_L4
> implementation: several datagram with the same length are aggregated
> into a single one, and the user space receive a single larger packet
> instead of multiple ones. I hope quic can leverage such scenario, but I
> really know nothing about the protocol.
>
> I measure roughly a 50% performance improvement with udpgso_bench in
> respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
> usage on the receiver[1]. Some additional hacking to the general GRO
> bits is required to avoid useless socket lookups for ingress UDP
> packets when UDP_GSO is not enabled.
>
> If there is interest on this topic, I can share some RFC patches
> (hopefully somewhat next week).

As Eric pointed out, QUIC reception on mobile clients over the WAN
may not see much gain. But apparently there is a non-trivial amount
of traffic the other way, to servers. Again, WAN might limit whatever
gain we get, but I do want to look into that. And there are other UDP high
throughput workloads (with or without QUIC) between servers.

If you have patches, please do share them. I actually also have a rough
patch that I did not consider ready to share yet. Based on Tom's existing
socket lookup in udp_gro_receive to detect whether a local destination
exists and whether it has set an option to support receiving coalesced
payloads (along with a cmsg to share the segment size).

Converting udp_recvmsg to split apart gso packets to make this
transparent seems to me to be too complex and not worth the effort.

If a local socket is not found in udp_gro_receive, this could also be
tentative interpreted as a non-local path (with false positives), enabling
transparent use of GRO + GSO batching on the forwarding path.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-08-31 13:08           ` Willem de Bruijn
@ 2018-08-31 13:44             ` Paolo Abeni
  2018-08-31 15:11               ` Willem de Bruijn
  2018-09-03  8:02             ` Steffen Klassert
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Abeni @ 2018-08-31 13:44 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Sowmini Varadhan, Network Development, Willem de Bruijn

On Fri, 2018-08-31 at 09:08 -0400, Willem de Bruijn wrote:
> On Fri, Aug 31, 2018 at 5:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > Hi,
> > 
> > On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> > > That said, for negotiated flows an inverse GRO feature could
> > > conceivably be implemented to reduce rx stack traversal, too.
> > > Though due to interleaving of packets on the wire, it aggregation
> > > would be best effort, similar to TCP TSO and GRO using the
> > > PSH bit as packetization signal.
> > 
> > Reviving this old thread, before I forgot again. I have some local
> > patches implementing UDP GRO in a dual way to current GSO_UDP_L4
> > implementation: several datagram with the same length are aggregated
> > into a single one, and the user space receive a single larger packet
> > instead of multiple ones. I hope quic can leverage such scenario, but I
> > really know nothing about the protocol.
> > 
> > I measure roughly a 50% performance improvement with udpgso_bench in
> > respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
> > usage on the receiver[1]. Some additional hacking to the general GRO
> > bits is required to avoid useless socket lookups for ingress UDP
> > packets when UDP_GSO is not enabled.
> > 
> > If there is interest on this topic, I can share some RFC patches
> > (hopefully somewhat next week).
> 
> As Eric pointed out, QUIC reception on mobile clients over the WAN
> may not see much gain. But apparently there is a non-trivial amount
> of traffic the other way, to servers. Again, WAN might limit whatever
> gain we get, but I do want to look into that. And there are other UDP high
> throughput workloads (with or without QUIC) between servers.
> 
> If you have patches, please do share them. 

I'll try to clean them up and send them next week (as RFC).

> I actually also have a rough
> patch that I did not consider ready to share yet. Based on Tom's existing
> socket lookup in udp_gro_receive to detect whether a local destination
> exists and whether it has set an option to support receiving coalesced
> payloads (along with a cmsg to share the segment size).

That is more or less what I'm doing here.
Side note: I had test it in baremetal, as veth/lo do not trigger the
GRO path: selftest of this feature is not so straightforward.

> Converting udp_recvmsg to split apart gso packets to make this
> transparent seems to me to be too complex and not worth the effort.

Agreed. Moreover doing many, small, recvmsg() instead of a single,
large, one will hit the performances very badly due to PTI and
HARDENED_USERCOPY.

> If a local socket is not found in udp_gro_receive, this could also be
> tentative interpreted as a non-local path (with false positives), enabling
> transparent use of GRO + GSO batching on the forwarding path.

That sounds interesting, even if false positive looks dangerous to me.
Just to be on the same page, which false positive examples are you
thinking at? UDP sockets bound to local address behind NAT?

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-08-31 13:44             ` Paolo Abeni
@ 2018-08-31 15:11               ` Willem de Bruijn
  0 siblings, 0 replies; 52+ messages in thread
From: Willem de Bruijn @ 2018-08-31 15:11 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Sowmini Varadhan, Network Development, Willem de Bruijn

On Fri, Aug 31, 2018 at 9:44 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Fri, 2018-08-31 at 09:08 -0400, Willem de Bruijn wrote:
> > On Fri, Aug 31, 2018 at 5:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> > > > That said, for negotiated flows an inverse GRO feature could
> > > > conceivably be implemented to reduce rx stack traversal, too.
> > > > Though due to interleaving of packets on the wire, it aggregation
> > > > would be best effort, similar to TCP TSO and GRO using the
> > > > PSH bit as packetization signal.
> > >
> > > Reviving this old thread, before I forgot again. I have some local
> > > patches implementing UDP GRO in a dual way to current GSO_UDP_L4
> > > implementation: several datagram with the same length are aggregated
> > > into a single one, and the user space receive a single larger packet
> > > instead of multiple ones. I hope quic can leverage such scenario, but I
> > > really know nothing about the protocol.
> > >
> > > I measure roughly a 50% performance improvement with udpgso_bench in
> > > respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
> > > usage on the receiver[1]. Some additional hacking to the general GRO
> > > bits is required to avoid useless socket lookups for ingress UDP
> > > packets when UDP_GSO is not enabled.
> > >
> > > If there is interest on this topic, I can share some RFC patches
> > > (hopefully somewhat next week).
> >
> > As Eric pointed out, QUIC reception on mobile clients over the WAN
> > may not see much gain. But apparently there is a non-trivial amount
> > of traffic the other way, to servers. Again, WAN might limit whatever
> > gain we get, but I do want to look into that. And there are other UDP high
> > throughput workloads (with or without QUIC) between servers.
> >
> > If you have patches, please do share them.
>
> I'll try to clean them up and send them next week (as RFC).
>
> > I actually also have a rough
> > patch that I did not consider ready to share yet. Based on Tom's existing
> > socket lookup in udp_gro_receive to detect whether a local destination
> > exists and whether it has set an option to support receiving coalesced
> > payloads (along with a cmsg to share the segment size).
>
> That is more or less what I'm doing here.
> Side note: I had test it in baremetal, as veth/lo do not trigger the
> GRO path: selftest of this feature is not so straightforward.
>
> > Converting udp_recvmsg to split apart gso packets to make this
> > transparent seems to me to be too complex and not worth the effort.
>
> Agreed. Moreover doing many, small, recvmsg() instead of a single,
> large, one will hit the performances very badly due to PTI and
> HARDENED_USERCOPY.
>
> > If a local socket is not found in udp_gro_receive, this could also be
> > tentative interpreted as a non-local path (with false positives), enabling
> > transparent use of GRO + GSO batching on the forwarding path.
>
> That sounds interesting, even if false positive looks dangerous to me.
> Just to be on the same page, which false positive examples are you
> thinking at? UDP sockets bound to local address behind NAT?

Any packets that would otherwise get dropped, such as packets with
local destination, but no local bound socket. This may increase the
amount of cycles spent on such packets, potentially increasing a DoS
vector.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-08-31 13:08           ` Willem de Bruijn
  2018-08-31 13:44             ` Paolo Abeni
@ 2018-09-03  8:02             ` Steffen Klassert
  2018-09-03 11:45               ` Sowmini Varadhan
  1 sibling, 1 reply; 52+ messages in thread
From: Steffen Klassert @ 2018-09-03  8:02 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Paolo Abeni, Sowmini Varadhan, Network Development, Willem de Bruijn

On Fri, Aug 31, 2018 at 09:08:59AM -0400, Willem de Bruijn wrote:
> On Fri, Aug 31, 2018 at 5:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > Hi,
> >
> > On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> > > That said, for negotiated flows an inverse GRO feature could
> > > conceivably be implemented to reduce rx stack traversal, too.
> > > Though due to interleaving of packets on the wire, it aggregation
> > > would be best effort, similar to TCP TSO and GRO using the
> > > PSH bit as packetization signal.
> >
> > Reviving this old thread, before I forgot again. I have some local
> > patches implementing UDP GRO in a dual way to current GSO_UDP_L4
> > implementation: several datagram with the same length are aggregated
> > into a single one, and the user space receive a single larger packet
> > instead of multiple ones. I hope quic can leverage such scenario, but I
> > really know nothing about the protocol.
> >
> > I measure roughly a 50% performance improvement with udpgso_bench in
> > respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
> > usage on the receiver[1]. Some additional hacking to the general GRO
> > bits is required to avoid useless socket lookups for ingress UDP
> > packets when UDP_GSO is not enabled.
> >
> > If there is interest on this topic, I can share some RFC patches
> > (hopefully somewhat next week).
> 
> As Eric pointed out, QUIC reception on mobile clients over the WAN
> may not see much gain. But apparently there is a non-trivial amount
> of traffic the other way, to servers. Again, WAN might limit whatever
> gain we get, but I do want to look into that. And there are other UDP high
> throughput workloads (with or without QUIC) between servers.
> 
> If you have patches, please do share them. I actually also have a rough
> patch that I did not consider ready to share yet. Based on Tom's existing
> socket lookup in udp_gro_receive to detect whether a local destination
> exists and whether it has set an option to support receiving coalesced
> payloads (along with a cmsg to share the segment size).
> 
> Converting udp_recvmsg to split apart gso packets to make this
> transparent seems to me to be too complex and not worth the effort.
> 
> If a local socket is not found in udp_gro_receive, this could also be
> tentative interpreted as a non-local path (with false positives), enabling
> transparent use of GRO + GSO batching on the forwarding path.

On forwarding, it might be even better to build packet lists instead
of merging the payload. This would save the cost of the merge at GRO
and the split at GSO (at least when the split happens in software).

I'm working on patches that builds such skb lists. The list is chained
at the frag_list pointer of the first skb, all subsequent skbs are linked
to the next pointer of the skb. It looks like this:

|---------|        |---------|        |---------|
|flow 1   |        |flow 1   |        |flow 1   |
|---------|        |---------|        |---------|
|frag_list|<-\     |frag_list|        |frag_list|
|---------|   \    |---------|        |---------|
|next     |    \---|next     |<-------|next     |
|---------|        |---------|        |---------|

Furthermore, I'm trying to integrate this with the new skb listification
work. Instead of just linking the packets into the lists as they arrive,
I try to sort them into flows, so that packets of the same flow can
travel together and lookups etc. are just done for one representative
skb of a given flow.

Here I build the packet lists like this:


|---------|        |---------|        |---------|
|flow 1   |        |flow 1   |        |flow 1   |
|---------|        |---------|        |---------|
|frag_list|<-\     |frag_list|        |frag_list|
|---------|   \    |---------|        |---------|
|next     |<-\ \---|next     |<-------|next     |
|---------|   \    |---------|        |---------|
              |
              |
              |    |---------|        |---------|        |---------|
              |    |flow 2   |        |flow 2   |        |flow 2   |
              |    |---------|        |---------|        |---------|
              |    |frag_list|<-\     |frag_list|        |frag_list|
              |    |---------|   \    |---------|        |---------|
              |----|next     |<-\ \---|next     |<-------|next     |
                   |---------|   \    |---------|        |---------|
                                 |
                                 |
                                 |    |---------|        |---------|       |---------|
                                 |    |flow 3   |        |flow 3   |       |flow 3   |
                                 |    |---------|        |---------|       |---------|
                                 |    |frag_list|<-\     |frag_list|       |frag_list|
                                 |    |---------|   \    |---------|       |---------|
                                 |----|next     |    \---|next     |<------|next     |
                                      |---------|        |---------|       |---------|

I've tried plumb it through to the forwarding/output path
and to the UDP receive queue. I have a basic working version,
but there are still unresolved things (UDP encapsulation etc).
So for now it is just my playground. But I  thought it might
make sense to share the idea here...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC net-next 00/11] udp gso
  2018-09-03  8:02             ` Steffen Klassert
@ 2018-09-03 11:45               ` Sowmini Varadhan
  0 siblings, 0 replies; 52+ messages in thread
From: Sowmini Varadhan @ 2018-09-03 11:45 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Willem de Bruijn, Paolo Abeni, Network Development,
	Willem de Bruijn, alexander.h.duyck

On (09/03/18 10:02), Steffen Klassert wrote:
> I'm working on patches that builds such skb lists. The list is chained
> at the frag_list pointer of the first skb, all subsequent skbs are linked
> to the next pointer of the skb. It looks like this:

there are some risks to using the frag_list pointer, Alex Duyck
had pointed this out to me in 
  https://www.mail-archive.com/netdev@vger.kernel.org/msg131081.html
(see last paragraph, or search for the string "gotcha" there)

I dont know the details of your playground patch, but might want to
watch out for those to make sure it is immune to these issues..

--Sowmini

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2018-09-03 16:05 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-17 20:00 [PATCH RFC net-next 00/11] udp gso Willem de Bruijn
2018-04-17 20:00 ` [PATCH RFC net-next 01/11] udp: expose inet cork to udp Willem de Bruijn
2018-04-17 20:00 ` [PATCH RFC net-next 02/11] udp: add gso Willem de Bruijn
2018-04-17 20:00 ` [PATCH RFC net-next 03/11] udp: better wmem accounting on gso Willem de Bruijn
2018-04-17 20:00 ` [PATCH RFC net-next 04/11] udp: paged allocation with gso Willem de Bruijn
2018-04-17 20:00 ` [PATCH RFC net-next 05/11] udp: add gso segment cmsg Willem de Bruijn
2018-04-17 20:00 ` [PATCH RFC net-next 06/11] udp: add gso support to virtual devices Willem de Bruijn
2018-04-18  0:43   ` Dimitris Michailidis
2018-04-18  3:27     ` Willem de Bruijn
2018-04-17 20:00 ` [PATCH RFC net-next 07/11] udp: zerocopy Willem de Bruijn
2018-04-17 20:00 ` [PATCH RFC net-next 08/11] selftests: udp gso Willem de Bruijn
2018-04-17 20:00 ` [PATCH RFC net-next 09/11] selftests: udp gso with connected sockets Willem de Bruijn
2018-04-17 20:15 ` [PATCH RFC net-next 00/11] udp gso Sowmini Varadhan
2018-04-17 20:23   ` Willem de Bruijn
2018-04-17 20:48     ` Sowmini Varadhan
2018-04-17 21:07       ` Willem de Bruijn
2018-04-18  2:25         ` Samudrala, Sridhar
2018-04-18  3:33           ` Willem de Bruijn
2018-04-18 12:31             ` Sowmini Varadhan
2018-04-18 13:35               ` Eric Dumazet
2018-04-18 13:47                 ` Sowmini Varadhan
2018-04-18 13:51                   ` Willem de Bruijn
2018-04-18 15:08                     ` Samudrala, Sridhar
2018-04-18 17:40                     ` David Miller
2018-04-18 17:34                   ` David Miller
2018-04-18 13:59               ` Willem de Bruijn
2018-04-18 14:28                 ` Willem de Bruijn
2018-04-18 17:28               ` David Miller
2018-04-18 18:12                 ` Alexander Duyck
2018-04-18 18:22                   ` Willem de Bruijn
2018-04-20 17:38                     ` Alexander Duyck
2018-04-20 21:58                       ` Willem de Bruijn
2018-04-21  2:08                         ` Alexander Duyck
2018-04-18 19:33                   ` David Miller
2018-04-20 18:27                   ` Tushar Dave
2018-04-20 20:08                     ` Alexander Duyck
2018-04-21  3:11                       ` Tushar Dave
2018-08-31  9:09         ` Paolo Abeni
2018-08-31 10:09           ` Eric Dumazet
2018-08-31 13:08           ` Willem de Bruijn
2018-08-31 13:44             ` Paolo Abeni
2018-08-31 15:11               ` Willem de Bruijn
2018-09-03  8:02             ` Steffen Klassert
2018-09-03 11:45               ` Sowmini Varadhan
2018-04-18 11:17 ` Paolo Abeni
2018-04-18 13:49   ` Willem de Bruijn
2018-05-24  0:02     ` Marcelo Ricardo Leitner
2018-05-24  1:15       ` Willem de Bruijn
2018-04-18 17:24   ` David Miller
2018-04-18 17:50 ` David Miller
2018-04-18 18:12   ` Willem de Bruijn
2018-04-19 17:45     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.