All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-27 17:06 ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 17:06 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

This patchset is merely a RFC for the moment. There are some
controversial points that I'd like to discuss before actually proposing
the patches.

The points are more detailed in the 3rd patch.

Another approach that I can think of, is using something related to
what Dave Miller mentioned on the thread 'Optimizing instruction-cache,
more packets at each stage' about skb bundling, so I Cc'ed people from
that thread too.
SCTP could generate a list of related skbs, after all that's pretty much
what this patchset does but using GRO/GSO infrastructure that is already
there.

PS: I also have code for GRO on top of this patchset, but it needs more
work yet.

Thanks!

Marcelo Ricardo Leitner (3):
  skbuff: export skb_gro_receive
  sctp: offloading support structure
  sctp: Add GSO support

 include/linux/netdev_features.h |   7 +-
 include/linux/netdevice.h       |   1 +
 include/linux/skbuff.h          |   2 +
 include/net/sctp/sctp.h         |   4 +
 net/core/dev.c                  |   6 +-
 net/core/skbuff.c               |  13 +-
 net/ipv4/af_inet.c              |   1 +
 net/sctp/Makefile               |   3 +-
 net/sctp/offload.c              | 100 ++++++++++++
 net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
 net/sctp/protocol.c             |   3 +
 net/sctp/socket.c               |   2 +
 12 files changed, 351 insertions(+), 129 deletions(-)
 create mode 100644 net/sctp/offload.c

-- 
2.5.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-27 17:06 ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 17:06 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

This patchset is merely a RFC for the moment. There are some
controversial points that I'd like to discuss before actually proposing
the patches.

The points are more detailed in the 3rd patch.

Another approach that I can think of, is using something related to
what Dave Miller mentioned on the thread 'Optimizing instruction-cache,
more packets at each stage' about skb bundling, so I Cc'ed people from
that thread too.
SCTP could generate a list of related skbs, after all that's pretty much
what this patchset does but using GRO/GSO infrastructure that is already
there.

PS: I also have code for GRO on top of this patchset, but it needs more
work yet.

Thanks!

Marcelo Ricardo Leitner (3):
  skbuff: export skb_gro_receive
  sctp: offloading support structure
  sctp: Add GSO support

 include/linux/netdev_features.h |   7 +-
 include/linux/netdevice.h       |   1 +
 include/linux/skbuff.h          |   2 +
 include/net/sctp/sctp.h         |   4 +
 net/core/dev.c                  |   6 +-
 net/core/skbuff.c               |  13 +-
 net/ipv4/af_inet.c              |   1 +
 net/sctp/Makefile               |   3 +-
 net/sctp/offload.c              | 100 ++++++++++++
 net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
 net/sctp/protocol.c             |   3 +
 net/sctp/socket.c               |   2 +
 12 files changed, 351 insertions(+), 129 deletions(-)
 create mode 100644 net/sctp/offload.c

-- 
2.5.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH net-next 1/3] skbuff: export skb_gro_receive
  2016-01-27 17:06 ` Marcelo Ricardo Leitner
@ 2016-01-27 17:06   ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 17:06 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

sctp GSO requires it and sctp can be compiled as a module, so export
this function.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/core/skbuff.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b2df375ec9c2173a8132b8efa1c3062f0510284b..704b69682085dec77f3d0f990aaf0024afd705b9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3312,6 +3312,7 @@ done:
 	NAPI_GRO_CB(skb)->same_flow = 1;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(skb_gro_receive);
 
 void __init skb_init(void)
 {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH net-next 1/3] skbuff: export skb_gro_receive
@ 2016-01-27 17:06   ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 17:06 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

sctp GSO requires it and sctp can be compiled as a module, so export
this function.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/core/skbuff.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b2df375ec9c2173a8132b8efa1c3062f0510284b..704b69682085dec77f3d0f990aaf0024afd705b9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3312,6 +3312,7 @@ done:
 	NAPI_GRO_CB(skb)->same_flow = 1;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(skb_gro_receive);
 
 void __init skb_init(void)
 {
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH net-next 2/3] sctp: offloading support structure
  2016-01-27 17:06 ` Marcelo Ricardo Leitner
@ 2016-01-27 17:06   ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 17:06 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

This patch just adds initial offloading bits, just to ease reviewing of
the next one. It will be merged with GSO patch itself in the actual
submission.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/linux/skbuff.h  |  2 ++
 include/net/sctp/sctp.h |  4 ++++
 net/sctp/Makefile       |  3 ++-
 net/sctp/offload.c      | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 net/sctp/protocol.c     |  3 +++
 5 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 net/sctp/offload.c

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 11f935c1a090419d6cda938aa925bfa79de3616b..7d0b02ad241b5c5936aea6b66941e42867f2e9e0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -481,6 +481,8 @@ enum {
 	SKB_GSO_UDP_TUNNEL_CSUM = 1 << 11,
 
 	SKB_GSO_TUNNEL_REMCSUM = 1 << 12,
+
+	SKB_GSO_SCTP = 1 << 13,
 };
 
 #if BITS_PER_LONG > 32
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 835aa2ed987092634a4242314e9eabb51d1e4e35..4b1159188525f193a11af95700e91f88b10268b6 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -165,6 +165,10 @@ void sctp_assocs_proc_exit(struct net *net);
 int sctp_remaddr_proc_init(struct net *net);
 void sctp_remaddr_proc_exit(struct net *net);
 
+/*
+ * sctp/offload.c
+ */
+int sctp_offload_init(void);
 
 /*
  * Module global variables
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 3b4ffb021cf1728353b5311e519604759a03b617..a89d3f51604d1474b72ee10b9d9c6a2200c2ed9d 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -10,7 +10,8 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  transport.o chunk.o sm_make_chunk.o ulpevent.o \
 	  inqueue.o outqueue.o ulpqueue.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
-	  output.o input.o debug.o ssnmap.o auth.o
+	  output.o input.o debug.o ssnmap.o auth.o \
+	  offload.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
new file mode 100644
index 0000000000000000000000000000000000000000..7080a6318da7110c1688dd0c5bb240356dbd0cd3
--- /dev/null
+++ b/net/sctp/offload.c
@@ -0,0 +1,47 @@
+/*
+ * sctp_offload - GRO/GSO Offloading for SCTP
+ *
+ * Copyright (C) 2015, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/kprobes.h>
+#include <linux/socket.h>
+#include <linux/sctp.h>
+#include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
+#include <linux/module.h>
+#include <linux/kfifo.h>
+#include <linux/time.h>
+#include <net/net_namespace.h>
+
+#include <linux/skbuff.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/checksum.h>
+#include <net/protocol.h>
+
+static const struct net_offload sctp_offload = {
+	.callbacks = {
+	},
+};
+
+int __init sctp_offload_init(void)
+{
+	return inet_add_offload(&sctp_offload, IPPROTO_SCTP);
+}
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index ab0d538a74ed593571cfaef02cd1bb7ce872abe6..a63464e56e46f046cb73f589c844e0203b96a5cd 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1485,6 +1485,9 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_add_protocol;
 
+	if (sctp_offload_init() < 0)
+		pr_crit("%s: Cannot add SCTP protocol offload\n", __func__);
+
 out:
 	return status;
 err_v6_add_protocol:
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH net-next 2/3] sctp: offloading support structure
@ 2016-01-27 17:06   ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 17:06 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

This patch just adds initial offloading bits, just to ease reviewing of
the next one. It will be merged with GSO patch itself in the actual
submission.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/linux/skbuff.h  |  2 ++
 include/net/sctp/sctp.h |  4 ++++
 net/sctp/Makefile       |  3 ++-
 net/sctp/offload.c      | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 net/sctp/protocol.c     |  3 +++
 5 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 net/sctp/offload.c

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 11f935c1a090419d6cda938aa925bfa79de3616b..7d0b02ad241b5c5936aea6b66941e42867f2e9e0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -481,6 +481,8 @@ enum {
 	SKB_GSO_UDP_TUNNEL_CSUM = 1 << 11,
 
 	SKB_GSO_TUNNEL_REMCSUM = 1 << 12,
+
+	SKB_GSO_SCTP = 1 << 13,
 };
 
 #if BITS_PER_LONG > 32
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 835aa2ed987092634a4242314e9eabb51d1e4e35..4b1159188525f193a11af95700e91f88b10268b6 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -165,6 +165,10 @@ void sctp_assocs_proc_exit(struct net *net);
 int sctp_remaddr_proc_init(struct net *net);
 void sctp_remaddr_proc_exit(struct net *net);
 
+/*
+ * sctp/offload.c
+ */
+int sctp_offload_init(void);
 
 /*
  * Module global variables
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 3b4ffb021cf1728353b5311e519604759a03b617..a89d3f51604d1474b72ee10b9d9c6a2200c2ed9d 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -10,7 +10,8 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  transport.o chunk.o sm_make_chunk.o ulpevent.o \
 	  inqueue.o outqueue.o ulpqueue.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
-	  output.o input.o debug.o ssnmap.o auth.o
+	  output.o input.o debug.o ssnmap.o auth.o \
+	  offload.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
new file mode 100644
index 0000000000000000000000000000000000000000..7080a6318da7110c1688dd0c5bb240356dbd0cd3
--- /dev/null
+++ b/net/sctp/offload.c
@@ -0,0 +1,47 @@
+/*
+ * sctp_offload - GRO/GSO Offloading for SCTP
+ *
+ * Copyright (C) 2015, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/kprobes.h>
+#include <linux/socket.h>
+#include <linux/sctp.h>
+#include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
+#include <linux/module.h>
+#include <linux/kfifo.h>
+#include <linux/time.h>
+#include <net/net_namespace.h>
+
+#include <linux/skbuff.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/checksum.h>
+#include <net/protocol.h>
+
+static const struct net_offload sctp_offload = {
+	.callbacks = {
+	},
+};
+
+int __init sctp_offload_init(void)
+{
+	return inet_add_offload(&sctp_offload, IPPROTO_SCTP);
+}
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index ab0d538a74ed593571cfaef02cd1bb7ce872abe6..a63464e56e46f046cb73f589c844e0203b96a5cd 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1485,6 +1485,9 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_add_protocol;
 
+	if (sctp_offload_init() < 0)
+		pr_crit("%s: Cannot add SCTP protocol offload\n", __func__);
+
 out:
 	return status;
 err_v6_add_protocol:
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH net-next 3/3] sctp: Add GSO support
  2016-01-27 17:06 ` Marcelo Ricardo Leitner
@ 2016-01-27 17:06   ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 17:06 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

This patch enables SCTP to do GSO.

SCTP has this pecualiarty that its packets cannot be just segmented to
(P)MTU. Its chunks must be contained in IP segments, padding respected.
So we can't just generate a big skb, set gso_size to the fragmentation
point and deliver it to IP layer.

Instead, this patch proposes that SCTP build a skb as it would be if it
was received using GRO. That is, there will be a cover skb with the
headers (incluing SCTP one) and children ones containing the actual SCTP
chunks, already segmented in a way that respects SCTP RFCs and MTU.

This way SCTP can benefit from GSO and instead of passing several
packets through the stack, it can pass a single large packet if there
are enough data queued and cwnd allows.

Main points that need help:
- Usage of skb_gro_receive()
  It fits nicely in there and properly handles offsets/lens, though the
  name means another thing. If you agree with this usage, we can rename
  it to something like skb_coalesce

- Checksum handling
  Why only packets with checksum offloaded can be GSOed? Most of the
  NICs doesn't support SCTP CRC offloading and this will nearly defeat
  this feature. If checksum is being computed in sw, it doesn't really
  matter if it's earlier or later, right?
  This patch hacks skb_needs_check() to allow using GSO with sw-computed
  checksums.
  Also the meaning of UNNECESSARY and NONE are quite foggy to me yet and
  its usage may be wrong.

- gso_size = 1
  There is skb_is_gso() all over the stack and it basically checks for
  non-zero skb_shinfo(skb)->gso_size. Setting it to 1 is the hacky way I
  found to keep skb_is_gso() working while being able to signal to
  skb_segment() that it shouldn't use gso_size but instead the fragment
  sizes themselves. skb_segment() will mainly just unpack the skb then.

- socket / gso max values
  usage of sk_setup_caps() still needs a review

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/linux/netdev_features.h |   7 +-
 include/linux/netdevice.h       |   1 +
 net/core/dev.c                  |   6 +-
 net/core/skbuff.c               |  12 +-
 net/ipv4/af_inet.c              |   1 +
 net/sctp/offload.c              |  53 +++++++
 net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
 net/sctp/socket.c               |   2 +
 8 files changed, 292 insertions(+), 128 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index d9654f0eecb3519383441afa6b131ff9a5898485..f678998841f1800e0f2fe416a79935197d4ed305 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -48,8 +48,9 @@ enum {
 	NETIF_F_GSO_UDP_TUNNEL_BIT,	/* ... UDP TUNNEL with TSO */
 	NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
 	NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
+	NETIF_F_GSO_SCTP_BIT,		/* ... SCTP fragmentation */
 	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
-		NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
+		NETIF_F_GSO_SCTP_BIT,
 
 	NETIF_F_FCOE_CRC_BIT,		/* FCoE CRC32 */
 	NETIF_F_SCTP_CRC_BIT,		/* SCTP checksum offload */
@@ -119,6 +120,7 @@ enum {
 #define NETIF_F_GSO_UDP_TUNNEL	__NETIF_F(GSO_UDP_TUNNEL)
 #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
+#define NETIF_F_GSO_SCTP	__NETIF_F(GSO_SCTP)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX	__NETIF_F(HW_VLAN_STAG_RX)
 #define NETIF_F_HW_VLAN_STAG_TX	__NETIF_F(HW_VLAN_STAG_TX)
@@ -144,7 +146,8 @@ enum {
 
 /* List of features with software fallbacks. */
 #define NETIF_F_GSO_SOFTWARE	(NETIF_F_TSO | NETIF_F_TSO_ECN | \
-				 NETIF_F_TSO6 | NETIF_F_UFO)
+				 NETIF_F_TSO6 | NETIF_F_UFO | \
+				 NETIF_F_GSO_SCTP)
 
 /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
  * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 289c2314d76668b8357728382bb33d6828617458..ce14fab858bf96dd0f85aca237350c8d8317756e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3928,6 +3928,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
+	BUILD_BUG_ON(SKB_GSO_SCTP    != (NETIF_F_GSO_SCTP >> NETIF_F_GSO_SHIFT));
 
 	return (features & feature) == feature;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
 static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
 {
 	if (tx_path)
-		return skb->ip_summed != CHECKSUM_PARTIAL;
+		/* FIXME: Why only packets with checksum offloading are
+		 * supported for GSO?
+		 */
+		return skb->ip_summed != CHECKSUM_PARTIAL &&
+		       skb->ip_summed != CHECKSUM_UNNECESSARY;
 	else
 		return skb->ip_summed == CHECKSUM_NONE;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 704b69682085dec77f3d0f990aaf0024afd705b9..96f223f8d769d2765fd64348830c76cb222906c8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3017,8 +3017,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 		int size;
 
 		len = head_skb->len - offset;
-		if (len > mss)
-			len = mss;
+		if (len > mss) {
+			/* FIXME: A define is surely welcomed, but maybe
+			 * shinfo->txflags is better for this flag, but
+			 * we need to expand it then
+			 */
+			if (mss == 1)
+				len = list_skb->len;
+			else
+				len = mss;
+		}
 
 		hsize = skb_headlen(head_skb) - offset;
 		if (hsize < 0)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5c5db6636704daa0c49fc13e84b2c5b282a44ed3..ec1c779bb664d1399d74f2bd7016e30b648ce47d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1220,6 +1220,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_UDP_TUNNEL |
 		       SKB_GSO_UDP_TUNNEL_CSUM |
 		       SKB_GSO_TUNNEL_REMCSUM |
+		       SKB_GSO_SCTP |
 		       0)))
 		goto out;
 
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -36,8 +36,61 @@
 #include <net/sctp/checksum.h>
 #include <net/protocol.h>
 
+static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
+{
+	skb->ip_summed = CHECKSUM_NONE;
+	return sctp_compute_cksum(skb, skb_transport_offset(skb));
+}
+
+static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
+					netdev_features_t features)
+{
+	struct sk_buff *segs = ERR_PTR(-EINVAL);
+	struct sctphdr *sh;
+
+	sh = sctp_hdr(skb);
+	if (!pskb_may_pull(skb, sizeof(*sh)))
+		goto out;
+
+	__skb_pull(skb, sizeof(*sh));
+
+	if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
+		/* Packet is from an untrusted source, reset gso_segs. */
+		int type = skb_shinfo(skb)->gso_type;
+
+		if (unlikely(type &
+			     ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
+			       0) ||
+			     !(type & (SKB_GSO_SCTP))))
+			goto out;
+
+		/* This should not happen as no NIC has SCTP GSO
+		 * offloading, it's always via software and thus we
+		 * won't send a large packet down the stack.
+		 */
+		WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
+		goto out;
+	}
+
+	segs = skb_segment(skb, features);
+	if (IS_ERR(segs))
+		goto out;
+
+	/* All that is left is update SCTP CRC if necessary */
+	for (skb = segs; skb; skb = skb->next) {
+		if (skb->ip_summed != CHECKSUM_PARTIAL) {
+			sh = sctp_hdr(skb);
+			sh->checksum = sctp_gso_make_checksum(skb);
+		}
+	}
+
+out:
+	return segs;
+}
+
 static const struct net_offload sctp_offload = {
 	.callbacks = {
+		.gso_segment = sctp_gso_segment,
 	},
 };
 
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 9d610eddd19ef2320fc34ae9d91e7426ae5f50f9..5e619b1b7b47737447bce746b2420bac3427fde4 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -381,12 +381,14 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	struct sctp_transport *tp = packet->transport;
 	struct sctp_association *asoc = tp->asoc;
 	struct sctphdr *sh;
-	struct sk_buff *nskb;
+	struct sk_buff *nskb = NULL, *head = NULL;
 	struct sctp_chunk *chunk, *tmp;
-	struct sock *sk;
+	struct sock *sk = asoc->base.sk;
 	int err = 0;
 	int padding;		/* How much padding do we need?  */
+	int pkt_size;
 	__u8 has_data = 0;
+	int gso = 0;
 	struct dst_entry *dst;
 	unsigned char *auth = NULL;	/* pointer to auth in skb data */
 
@@ -396,37 +398,44 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	if (list_empty(&packet->chunk_list))
 		return err;
 
-	/* Set up convenience variables... */
+	/* TODO: double check this */
 	chunk = list_entry(packet->chunk_list.next, struct sctp_chunk, list);
 	sk = chunk->skb->sk;
+	dst_hold(tp->dst);
+	sk_setup_caps(sk, tp->dst);
+
+	if (packet->size > tp->pathmtu) {
+		WARN_ON(packet->ipfragok);
+		if (sk_can_gso(sk)) {
+			gso = 1;
+			pkt_size = packet->overhead;
+		} else {
+			/* Something nasty happened */
+			/* FIXME */
+			printk("Damn, we can't GSO and packet is too big %d for pmtu %d.\n",
+			       packet->size, tp->pathmtu);
+			goto nomem;
+		}
+	} else {
+		pkt_size = packet->size;
+	}
 
-	/* Allocate the new skb.  */
-	nskb = alloc_skb(packet->size + MAX_HEADER, GFP_ATOMIC);
-	if (!nskb)
+	/* Allocate the head skb, or main one if not in GSO */
+	head = alloc_skb(pkt_size + MAX_HEADER, GFP_ATOMIC);
+	if (!head)
 		goto nomem;
+	if (gso) {
+		NAPI_GRO_CB(head)->last = head;
+	} else {
+		nskb = head;
+	}
 
 	/* Make sure the outbound skb has enough header room reserved. */
-	skb_reserve(nskb, packet->overhead + MAX_HEADER);
-
-	/* Set the owning socket so that we know where to get the
-	 * destination IP address.
-	 */
-	sctp_packet_set_owner_w(nskb, sk);
-
-	if (!sctp_transport_dst_check(tp)) {
-		sctp_transport_route(tp, NULL, sctp_sk(sk));
-		if (asoc && (asoc->param_flags & SPP_PMTUD_ENABLE)) {
-			sctp_assoc_sync_pmtu(sk, asoc);
-		}
-	}
-	dst = dst_clone(tp->dst);
-	if (!dst)
-		goto no_route;
-	skb_dst_set(nskb, dst);
+	skb_reserve(head, packet->overhead + MAX_HEADER);
 
 	/* Build the SCTP header.  */
-	sh = (struct sctphdr *)skb_push(nskb, sizeof(struct sctphdr));
-	skb_reset_transport_header(nskb);
+	sh = (struct sctphdr *)skb_push(head, sizeof(struct sctphdr));
+	skb_reset_transport_header(head);
 	sh->source = htons(packet->source_port);
 	sh->dest   = htons(packet->destination_port);
 
@@ -441,90 +450,164 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	sh->vtag     = htonl(packet->vtag);
 	sh->checksum = 0;
 
-	/**
-	 * 6.10 Bundling
-	 *
-	 *    An endpoint bundles chunks by simply including multiple
-	 *    chunks in one outbound SCTP packet.  ...
+	/* Set the owning socket so that we know where to get the
+	 * destination IP address.
 	 */
+	sctp_packet_set_owner_w(head, sk);
 
-	/**
-	 * 3.2  Chunk Field Descriptions
-	 *
-	 * The total length of a chunk (including Type, Length and
-	 * Value fields) MUST be a multiple of 4 bytes.  If the length
-	 * of the chunk is not a multiple of 4 bytes, the sender MUST
-	 * pad the chunk with all zero bytes and this padding is not
-	 * included in the chunk length field.  The sender should
-	 * never pad with more than 3 bytes.
-	 *
-	 * [This whole comment explains WORD_ROUND() below.]
-	 */
+	if (!sctp_transport_dst_check(tp)) {
+		sctp_transport_route(tp, NULL, sctp_sk(sk));
+		if (asoc && (asoc->param_flags & SPP_PMTUD_ENABLE)) {
+			sctp_assoc_sync_pmtu(sk, asoc);
+		}
+	}
+	dst = dst_clone(tp->dst);
+	if (!dst)
+		goto no_route;
+	skb_dst_set(head, dst);
 
 	pr_debug("***sctp_transmit_packet***\n");
 
-	list_for_each_entry_safe(chunk, tmp, &packet->chunk_list, list) {
-		list_del_init(&chunk->list);
-		if (sctp_chunk_is_data(chunk)) {
-			/* 6.3.1 C4) When data is in flight and when allowed
-			 * by rule C5, a new RTT measurement MUST be made each
-			 * round trip.  Furthermore, new RTT measurements
-			 * SHOULD be made no more than once per round-trip
-			 * for a given destination transport address.
-			 */
-
-			if (!chunk->resent && !tp->rto_pending) {
-				chunk->rtt_in_progress = 1;
-				tp->rto_pending = 1;
+	do {
+		/* Set up convenience variables... */
+		chunk = list_entry(packet->chunk_list.next, struct sctp_chunk, list);
+		WARN_ON(sk != chunk->skb->sk); /* XXX */
+
+		/* Calculate packet size, so it fits in PMTU. Leave
+		 * other chunks for the next packets. */
+		if (gso) {
+			pkt_size = packet->overhead;
+			list_for_each_entry(chunk, &packet->chunk_list, list) {
+				int padded = WORD_ROUND(chunk->skb->len);
+				if (pkt_size + padded > tp->pathmtu)
+					break;
+				pkt_size += padded;
 			}
 
-			has_data = 1;
+			/* Allocate the new skb.  */
+			nskb = alloc_skb(pkt_size + MAX_HEADER, GFP_ATOMIC);
+
+			/* Make sure the outbound skb has enough header room reserved. */
+			if (nskb)
+				skb_reserve(nskb, packet->overhead + MAX_HEADER);
 		}
+		if (!nskb)
+			goto nomem;
+
+		/**
+		 * 3.2  Chunk Field Descriptions
+		 *
+		 * The total length of a chunk (including Type, Length and
+		 * Value fields) MUST be a multiple of 4 bytes.  If the length
+		 * of the chunk is not a multiple of 4 bytes, the sender MUST
+		 * pad the chunk with all zero bytes and this padding is not
+		 * included in the chunk length field.  The sender should
+		 * never pad with more than 3 bytes.
+		 *
+		 * [This whole comment explains WORD_ROUND() below.]
+		 */
+
+		pkt_size -= packet->overhead;
+		list_for_each_entry_safe(chunk, tmp, &packet->chunk_list, list) {
+			list_del_init(&chunk->list);
+			if (sctp_chunk_is_data(chunk)) {
+				/* 6.3.1 C4) When data is in flight and when allowed
+				 * by rule C5, a new RTT measurement MUST be made each
+				 * round trip.  Furthermore, new RTT measurements
+				 * SHOULD be made no more than once per round-trip
+				 * for a given destination transport address.
+				 */
+
+				if (!chunk->resent && !tp->rto_pending) {
+					chunk->rtt_in_progress = 1;
+					tp->rto_pending = 1;
+				}
+
+				has_data = 1;
+			}
+
+			padding = WORD_ROUND(chunk->skb->len) - chunk->skb->len;
+			if (padding)
+				memset(skb_put(chunk->skb, padding), 0, padding);
 
-		padding = WORD_ROUND(chunk->skb->len) - chunk->skb->len;
-		if (padding)
-			memset(skb_put(chunk->skb, padding), 0, padding);
+			/* if this is the auth chunk that we are adding,
+			 * store pointer where it will be added and put
+			 * the auth into the packet.
+			 */
+			if (chunk == packet->auth) {
+				auth = skb_tail_pointer(nskb);
+			}
+
+			memcpy(skb_put(nskb, chunk->skb->len),
+				       chunk->skb->data, chunk->skb->len);
+
+			pr_debug("*** Chunk:%p[%s] %s 0x%x, length:%d, chunk->skb->len:%d, "
+				 "rtt_in_progress:%d\n", chunk,
+				 sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)),
+				 chunk->has_tsn ? "TSN" : "No TSN",
+				 chunk->has_tsn ? ntohl(chunk->subh.data_hdr->tsn) : 0,
+				 ntohs(chunk->chunk_hdr->length), chunk->skb->len,
+				 chunk->rtt_in_progress);
+
+			/*
+			 * If this is a control chunk, this is our last
+			 * reference. Free data chunks after they've been
+			 * acknowledged or have failed.
+			 * Re-queue auth chunks if needed.
+			 */
+			pkt_size -= WORD_ROUND(chunk->skb->len);
+
+			if (chunk == packet->auth && !list_empty(&packet->chunk_list))
+				list_add(&chunk->list, &packet->chunk_list);
+			else if (!sctp_chunk_is_data(chunk))
+				sctp_chunk_free(chunk);
 
-		/* if this is the auth chunk that we are adding,
-		 * store pointer where it will be added and put
-		 * the auth into the packet.
+			if (!pkt_size)
+				break;
+		}
+
+		/* SCTP-AUTH, Section 6.2
+		 *    The sender MUST calculate the MAC as described in RFC2104 [2]
+		 *    using the hash function H as described by the MAC Identifier and
+		 *    the shared association key K based on the endpoint pair shared key
+		 *    described by the shared key identifier.  The 'data' used for the
+		 *    computation of the AUTH-chunk is given by the AUTH chunk with its
+		 *    HMAC field set to zero (as shown in Figure 6) followed by all
+		 *    chunks that are placed after the AUTH chunk in the SCTP packet.
 		 */
-		if (chunk == packet->auth)
-			auth = skb_tail_pointer(nskb);
-
-		memcpy(skb_put(nskb, chunk->skb->len),
-			       chunk->skb->data, chunk->skb->len);
-
-		pr_debug("*** Chunk:%p[%s] %s 0x%x, length:%d, chunk->skb->len:%d, "
-			 "rtt_in_progress:%d\n", chunk,
-			 sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)),
-			 chunk->has_tsn ? "TSN" : "No TSN",
-			 chunk->has_tsn ? ntohl(chunk->subh.data_hdr->tsn) : 0,
-			 ntohs(chunk->chunk_hdr->length), chunk->skb->len,
-			 chunk->rtt_in_progress);
-
-		/*
-		 * If this is a control chunk, this is our last
-		 * reference. Free data chunks after they've been
-		 * acknowledged or have failed.
+		if (auth)
+			sctp_auth_calculate_hmac(asoc, nskb,
+						(struct sctp_auth_chunk *)auth,
+						GFP_ATOMIC);
+
+		/* Set up the IP options.  */
+		/* BUG: not implemented
+		 * For v4 this all lives somewhere in sk->sk_opt...
 		 */
-		if (!sctp_chunk_is_data(chunk))
-			sctp_chunk_free(chunk);
-	}
 
-	/* SCTP-AUTH, Section 6.2
-	 *    The sender MUST calculate the MAC as described in RFC2104 [2]
-	 *    using the hash function H as described by the MAC Identifier and
-	 *    the shared association key K based on the endpoint pair shared key
-	 *    described by the shared key identifier.  The 'data' used for the
-	 *    computation of the AUTH-chunk is given by the AUTH chunk with its
-	 *    HMAC field set to zero (as shown in Figure 6) followed by all
-	 *    chunks that are placed after the AUTH chunk in the SCTP packet.
-	 */
-	if (auth)
-		sctp_auth_calculate_hmac(asoc, nskb,
-					(struct sctp_auth_chunk *)auth,
-					GFP_ATOMIC);
+		/* Dump that on IP!  */
+		if (asoc) {
+			asoc->stats.opackets++;
+			if (asoc->peer.last_sent_to != tp)
+				/* Considering the multiple CPU scenario, this is a
+				 * "correcter" place for last_sent_to.  --xguo
+				 */
+				asoc->peer.last_sent_to = tp;
+		}
+
+
+		if (!gso ||
+		    skb_shinfo(head)->gso_segs >= sk->sk_gso_max_segs)
+//		    head->len + asoc->pathmtu >= sk->sk_gso_max_size)
+			break;
+
+		if (skb_gro_receive(&head, nskb))
+			goto nomem;
+		skb_shinfo(head)->gso_segs++;
+		/* FIXME: below is a lie */
+		skb_shinfo(head)->gso_size = 1;
+		nskb = NULL;
+	} while (!list_empty(&packet->chunk_list));
 
 	/* 2) Calculate the Adler-32 checksum of the whole packet,
 	 *    including the SCTP common header and all the
@@ -532,16 +615,21 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	 *
 	 * Note: Adler-32 is no longer applicable, as has been replaced
 	 * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
+	 *
+	 * If it's a GSO packet, it's postponed to sctp_skb_segment.
 	 */
-	if (!sctp_checksum_disable) {
+	if (!sctp_checksum_disable || gso) {
 		if (!(dst->dev->features & NETIF_F_SCTP_CRC) ||
 		    (dst_xfrm(dst) != NULL) || packet->ipfragok) {
-			sh->checksum = sctp_compute_cksum(nskb, 0);
+			if (!gso)
+				sh->checksum = sctp_compute_cksum(head, 0);
+			else
+				head->ip_summed = CHECKSUM_UNNECESSARY;
 		} else {
 			/* no need to seed pseudo checksum for SCTP */
-			nskb->ip_summed = CHECKSUM_PARTIAL;
-			nskb->csum_start = skb_transport_header(nskb) - nskb->head;
-			nskb->csum_offset = offsetof(struct sctphdr, checksum);
+			head->ip_summed = CHECKSUM_PARTIAL;
+			head->csum_start = skb_transport_header(head) - head->head;
+			head->csum_offset = offsetof(struct sctphdr, checksum);
 		}
 	}
 
@@ -557,22 +645,7 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	 * Note: The works for IPv6 layer checks this bit too later
 	 * in transmission.  See IP6_ECN_flow_xmit().
 	 */
-	tp->af_specific->ecn_capable(nskb->sk);
-
-	/* Set up the IP options.  */
-	/* BUG: not implemented
-	 * For v4 this all lives somewhere in sk->sk_opt...
-	 */
-
-	/* Dump that on IP!  */
-	if (asoc) {
-		asoc->stats.opackets++;
-		if (asoc->peer.last_sent_to != tp)
-			/* Considering the multiple CPU scenario, this is a
-			 * "correcter" place for last_sent_to.  --xguo
-			 */
-			asoc->peer.last_sent_to = tp;
-	}
+	tp->af_specific->ecn_capable(head->sk);
 
 	if (has_data) {
 		struct timer_list *timer;
@@ -589,16 +662,23 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 		}
 	}
 
-	pr_debug("***sctp_transmit_packet*** skb->len:%d\n", nskb->len);
+	pr_debug("***sctp_transmit_packet*** skb->len:%d\n", head->len);
 
-	nskb->ignore_df = packet->ipfragok;
-	tp->af_specific->sctp_xmit(nskb, tp);
+	head->ignore_df = packet->ipfragok;
+	printk("%s %d %d %d\n", __func__, head->len,
+	       packet->transport->pathmtu,
+	       packet->transport->pathmtu - packet->overhead);
+	if (gso)
+		skb_shinfo(head)->gso_type = SKB_GSO_SCTP;
+	tp->af_specific->sctp_xmit(head, tp);
 
 out:
 	sctp_packet_reset(packet);
+	sk_dst_reset(sk); /* FIXME: double check */
 	return err;
 no_route:
 	kfree_skb(nskb);
+	kfree_skb(head);
 
 	if (asoc)
 		IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
@@ -635,7 +715,7 @@ nomem:
 static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
 					   struct sctp_chunk *chunk)
 {
-	size_t datasize, rwnd, inflight, flight_size;
+	size_t datasize, rwnd, inflight, flight_size, maxsize;
 	struct sctp_transport *transport = packet->transport;
 	struct sctp_association *asoc = transport->asoc;
 	struct sctp_outq *q = &asoc->outqueue;
@@ -705,7 +785,15 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
 	/* Check whether this chunk and all the rest of pending data will fit
 	 * or delay in hopes of bundling a full sized packet.
 	 */
-	if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead)
+	if (packet->ipfragok) {
+		/* Means chunk won't fit and needs fragmentation at
+		 * transport level, so we can't do GSO.
+		 */
+		maxsize = transport->pathmtu;
+	} else {
+		maxsize = transport->dst->dev->gso_max_size;
+	}
+	if (chunk->skb->len + q->out_qlen >= maxsize - packet->overhead)
 		/* Enough data queued to fill a packet */
 		return SCTP_XMIT_OK;
 
@@ -764,6 +852,8 @@ static sctp_xmit_t sctp_packet_will_fit(struct sctp_packet *packet,
 
 	/* Decide if we need to fragment or resubmit later. */
 	if (too_big) {
+		struct net_device *dev = packet->transport->dst->dev;
+
 		/* It's OK to fragmet at IP level if any one of the following
 		 * is true:
 		 * 	1. The packet is empty (meaning this chunk is greater
@@ -779,9 +869,11 @@ static sctp_xmit_t sctp_packet_will_fit(struct sctp_packet *packet,
 			 * actually hit this condition
 			 */
 			packet->ipfragok = 1;
-		} else {
+		} else if (psize + chunk_len > dev->gso_max_size - packet->overhead) {
+			/* Hit GSO limit, gotta flush */
 			retval = SCTP_XMIT_PMTU_FULL;
 		}
+		/* Otherwise it will fit in the GSO packet */
 	}
 
 	return retval;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 5ca2ebfe0be83882fcb841de6fa8029b6455ef85..064e5d375e612f2ec745f384d35f0e4c6b96212c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4001,6 +4001,8 @@ static int sctp_init_sock(struct sock *sk)
 		return -ESOCKTNOSUPPORT;
 	}
 
+	sk->sk_gso_type = SKB_GSO_SCTP;
+
 	/* Initialize default send parameters. These parameters can be
 	 * modified with the SCTP_DEFAULT_SEND_PARAM socket option.
 	 */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH net-next 3/3] sctp: Add GSO support
@ 2016-01-27 17:06   ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 17:06 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

This patch enables SCTP to do GSO.

SCTP has this pecualiarty that its packets cannot be just segmented to
(P)MTU. Its chunks must be contained in IP segments, padding respected.
So we can't just generate a big skb, set gso_size to the fragmentation
point and deliver it to IP layer.

Instead, this patch proposes that SCTP build a skb as it would be if it
was received using GRO. That is, there will be a cover skb with the
headers (incluing SCTP one) and children ones containing the actual SCTP
chunks, already segmented in a way that respects SCTP RFCs and MTU.

This way SCTP can benefit from GSO and instead of passing several
packets through the stack, it can pass a single large packet if there
are enough data queued and cwnd allows.

Main points that need help:
- Usage of skb_gro_receive()
  It fits nicely in there and properly handles offsets/lens, though the
  name means another thing. If you agree with this usage, we can rename
  it to something like skb_coalesce

- Checksum handling
  Why only packets with checksum offloaded can be GSOed? Most of the
  NICs doesn't support SCTP CRC offloading and this will nearly defeat
  this feature. If checksum is being computed in sw, it doesn't really
  matter if it's earlier or later, right?
  This patch hacks skb_needs_check() to allow using GSO with sw-computed
  checksums.
  Also the meaning of UNNECESSARY and NONE are quite foggy to me yet and
  its usage may be wrong.

- gso_size = 1
  There is skb_is_gso() all over the stack and it basically checks for
  non-zero skb_shinfo(skb)->gso_size. Setting it to 1 is the hacky way I
  found to keep skb_is_gso() working while being able to signal to
  skb_segment() that it shouldn't use gso_size but instead the fragment
  sizes themselves. skb_segment() will mainly just unpack the skb then.

- socket / gso max values
  usage of sk_setup_caps() still needs a review

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/linux/netdev_features.h |   7 +-
 include/linux/netdevice.h       |   1 +
 net/core/dev.c                  |   6 +-
 net/core/skbuff.c               |  12 +-
 net/ipv4/af_inet.c              |   1 +
 net/sctp/offload.c              |  53 +++++++
 net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
 net/sctp/socket.c               |   2 +
 8 files changed, 292 insertions(+), 128 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index d9654f0eecb3519383441afa6b131ff9a5898485..f678998841f1800e0f2fe416a79935197d4ed305 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -48,8 +48,9 @@ enum {
 	NETIF_F_GSO_UDP_TUNNEL_BIT,	/* ... UDP TUNNEL with TSO */
 	NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
 	NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
+	NETIF_F_GSO_SCTP_BIT,		/* ... SCTP fragmentation */
 	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
-		NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
+		NETIF_F_GSO_SCTP_BIT,
 
 	NETIF_F_FCOE_CRC_BIT,		/* FCoE CRC32 */
 	NETIF_F_SCTP_CRC_BIT,		/* SCTP checksum offload */
@@ -119,6 +120,7 @@ enum {
 #define NETIF_F_GSO_UDP_TUNNEL	__NETIF_F(GSO_UDP_TUNNEL)
 #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
+#define NETIF_F_GSO_SCTP	__NETIF_F(GSO_SCTP)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX	__NETIF_F(HW_VLAN_STAG_RX)
 #define NETIF_F_HW_VLAN_STAG_TX	__NETIF_F(HW_VLAN_STAG_TX)
@@ -144,7 +146,8 @@ enum {
 
 /* List of features with software fallbacks. */
 #define NETIF_F_GSO_SOFTWARE	(NETIF_F_TSO | NETIF_F_TSO_ECN | \
-				 NETIF_F_TSO6 | NETIF_F_UFO)
+				 NETIF_F_TSO6 | NETIF_F_UFO | \
+				 NETIF_F_GSO_SCTP)
 
 /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
  * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 289c2314d76668b8357728382bb33d6828617458..ce14fab858bf96dd0f85aca237350c8d8317756e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3928,6 +3928,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
+	BUILD_BUG_ON(SKB_GSO_SCTP    != (NETIF_F_GSO_SCTP >> NETIF_F_GSO_SHIFT));
 
 	return (features & feature) = feature;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
 static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
 {
 	if (tx_path)
-		return skb->ip_summed != CHECKSUM_PARTIAL;
+		/* FIXME: Why only packets with checksum offloading are
+		 * supported for GSO?
+		 */
+		return skb->ip_summed != CHECKSUM_PARTIAL &&
+		       skb->ip_summed != CHECKSUM_UNNECESSARY;
 	else
 		return skb->ip_summed = CHECKSUM_NONE;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 704b69682085dec77f3d0f990aaf0024afd705b9..96f223f8d769d2765fd64348830c76cb222906c8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3017,8 +3017,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 		int size;
 
 		len = head_skb->len - offset;
-		if (len > mss)
-			len = mss;
+		if (len > mss) {
+			/* FIXME: A define is surely welcomed, but maybe
+			 * shinfo->txflags is better for this flag, but
+			 * we need to expand it then
+			 */
+			if (mss = 1)
+				len = list_skb->len;
+			else
+				len = mss;
+		}
 
 		hsize = skb_headlen(head_skb) - offset;
 		if (hsize < 0)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5c5db6636704daa0c49fc13e84b2c5b282a44ed3..ec1c779bb664d1399d74f2bd7016e30b648ce47d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1220,6 +1220,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_UDP_TUNNEL |
 		       SKB_GSO_UDP_TUNNEL_CSUM |
 		       SKB_GSO_TUNNEL_REMCSUM |
+		       SKB_GSO_SCTP |
 		       0)))
 		goto out;
 
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -36,8 +36,61 @@
 #include <net/sctp/checksum.h>
 #include <net/protocol.h>
 
+static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
+{
+	skb->ip_summed = CHECKSUM_NONE;
+	return sctp_compute_cksum(skb, skb_transport_offset(skb));
+}
+
+static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
+					netdev_features_t features)
+{
+	struct sk_buff *segs = ERR_PTR(-EINVAL);
+	struct sctphdr *sh;
+
+	sh = sctp_hdr(skb);
+	if (!pskb_may_pull(skb, sizeof(*sh)))
+		goto out;
+
+	__skb_pull(skb, sizeof(*sh));
+
+	if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
+		/* Packet is from an untrusted source, reset gso_segs. */
+		int type = skb_shinfo(skb)->gso_type;
+
+		if (unlikely(type &
+			     ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
+			       0) ||
+			     !(type & (SKB_GSO_SCTP))))
+			goto out;
+
+		/* This should not happen as no NIC has SCTP GSO
+		 * offloading, it's always via software and thus we
+		 * won't send a large packet down the stack.
+		 */
+		WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
+		goto out;
+	}
+
+	segs = skb_segment(skb, features);
+	if (IS_ERR(segs))
+		goto out;
+
+	/* All that is left is update SCTP CRC if necessary */
+	for (skb = segs; skb; skb = skb->next) {
+		if (skb->ip_summed != CHECKSUM_PARTIAL) {
+			sh = sctp_hdr(skb);
+			sh->checksum = sctp_gso_make_checksum(skb);
+		}
+	}
+
+out:
+	return segs;
+}
+
 static const struct net_offload sctp_offload = {
 	.callbacks = {
+		.gso_segment = sctp_gso_segment,
 	},
 };
 
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 9d610eddd19ef2320fc34ae9d91e7426ae5f50f9..5e619b1b7b47737447bce746b2420bac3427fde4 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -381,12 +381,14 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	struct sctp_transport *tp = packet->transport;
 	struct sctp_association *asoc = tp->asoc;
 	struct sctphdr *sh;
-	struct sk_buff *nskb;
+	struct sk_buff *nskb = NULL, *head = NULL;
 	struct sctp_chunk *chunk, *tmp;
-	struct sock *sk;
+	struct sock *sk = asoc->base.sk;
 	int err = 0;
 	int padding;		/* How much padding do we need?  */
+	int pkt_size;
 	__u8 has_data = 0;
+	int gso = 0;
 	struct dst_entry *dst;
 	unsigned char *auth = NULL;	/* pointer to auth in skb data */
 
@@ -396,37 +398,44 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	if (list_empty(&packet->chunk_list))
 		return err;
 
-	/* Set up convenience variables... */
+	/* TODO: double check this */
 	chunk = list_entry(packet->chunk_list.next, struct sctp_chunk, list);
 	sk = chunk->skb->sk;
+	dst_hold(tp->dst);
+	sk_setup_caps(sk, tp->dst);
+
+	if (packet->size > tp->pathmtu) {
+		WARN_ON(packet->ipfragok);
+		if (sk_can_gso(sk)) {
+			gso = 1;
+			pkt_size = packet->overhead;
+		} else {
+			/* Something nasty happened */
+			/* FIXME */
+			printk("Damn, we can't GSO and packet is too big %d for pmtu %d.\n",
+			       packet->size, tp->pathmtu);
+			goto nomem;
+		}
+	} else {
+		pkt_size = packet->size;
+	}
 
-	/* Allocate the new skb.  */
-	nskb = alloc_skb(packet->size + MAX_HEADER, GFP_ATOMIC);
-	if (!nskb)
+	/* Allocate the head skb, or main one if not in GSO */
+	head = alloc_skb(pkt_size + MAX_HEADER, GFP_ATOMIC);
+	if (!head)
 		goto nomem;
+	if (gso) {
+		NAPI_GRO_CB(head)->last = head;
+	} else {
+		nskb = head;
+	}
 
 	/* Make sure the outbound skb has enough header room reserved. */
-	skb_reserve(nskb, packet->overhead + MAX_HEADER);
-
-	/* Set the owning socket so that we know where to get the
-	 * destination IP address.
-	 */
-	sctp_packet_set_owner_w(nskb, sk);
-
-	if (!sctp_transport_dst_check(tp)) {
-		sctp_transport_route(tp, NULL, sctp_sk(sk));
-		if (asoc && (asoc->param_flags & SPP_PMTUD_ENABLE)) {
-			sctp_assoc_sync_pmtu(sk, asoc);
-		}
-	}
-	dst = dst_clone(tp->dst);
-	if (!dst)
-		goto no_route;
-	skb_dst_set(nskb, dst);
+	skb_reserve(head, packet->overhead + MAX_HEADER);
 
 	/* Build the SCTP header.  */
-	sh = (struct sctphdr *)skb_push(nskb, sizeof(struct sctphdr));
-	skb_reset_transport_header(nskb);
+	sh = (struct sctphdr *)skb_push(head, sizeof(struct sctphdr));
+	skb_reset_transport_header(head);
 	sh->source = htons(packet->source_port);
 	sh->dest   = htons(packet->destination_port);
 
@@ -441,90 +450,164 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	sh->vtag     = htonl(packet->vtag);
 	sh->checksum = 0;
 
-	/**
-	 * 6.10 Bundling
-	 *
-	 *    An endpoint bundles chunks by simply including multiple
-	 *    chunks in one outbound SCTP packet.  ...
+	/* Set the owning socket so that we know where to get the
+	 * destination IP address.
 	 */
+	sctp_packet_set_owner_w(head, sk);
 
-	/**
-	 * 3.2  Chunk Field Descriptions
-	 *
-	 * The total length of a chunk (including Type, Length and
-	 * Value fields) MUST be a multiple of 4 bytes.  If the length
-	 * of the chunk is not a multiple of 4 bytes, the sender MUST
-	 * pad the chunk with all zero bytes and this padding is not
-	 * included in the chunk length field.  The sender should
-	 * never pad with more than 3 bytes.
-	 *
-	 * [This whole comment explains WORD_ROUND() below.]
-	 */
+	if (!sctp_transport_dst_check(tp)) {
+		sctp_transport_route(tp, NULL, sctp_sk(sk));
+		if (asoc && (asoc->param_flags & SPP_PMTUD_ENABLE)) {
+			sctp_assoc_sync_pmtu(sk, asoc);
+		}
+	}
+	dst = dst_clone(tp->dst);
+	if (!dst)
+		goto no_route;
+	skb_dst_set(head, dst);
 
 	pr_debug("***sctp_transmit_packet***\n");
 
-	list_for_each_entry_safe(chunk, tmp, &packet->chunk_list, list) {
-		list_del_init(&chunk->list);
-		if (sctp_chunk_is_data(chunk)) {
-			/* 6.3.1 C4) When data is in flight and when allowed
-			 * by rule C5, a new RTT measurement MUST be made each
-			 * round trip.  Furthermore, new RTT measurements
-			 * SHOULD be made no more than once per round-trip
-			 * for a given destination transport address.
-			 */
-
-			if (!chunk->resent && !tp->rto_pending) {
-				chunk->rtt_in_progress = 1;
-				tp->rto_pending = 1;
+	do {
+		/* Set up convenience variables... */
+		chunk = list_entry(packet->chunk_list.next, struct sctp_chunk, list);
+		WARN_ON(sk != chunk->skb->sk); /* XXX */
+
+		/* Calculate packet size, so it fits in PMTU. Leave
+		 * other chunks for the next packets. */
+		if (gso) {
+			pkt_size = packet->overhead;
+			list_for_each_entry(chunk, &packet->chunk_list, list) {
+				int padded = WORD_ROUND(chunk->skb->len);
+				if (pkt_size + padded > tp->pathmtu)
+					break;
+				pkt_size += padded;
 			}
 
-			has_data = 1;
+			/* Allocate the new skb.  */
+			nskb = alloc_skb(pkt_size + MAX_HEADER, GFP_ATOMIC);
+
+			/* Make sure the outbound skb has enough header room reserved. */
+			if (nskb)
+				skb_reserve(nskb, packet->overhead + MAX_HEADER);
 		}
+		if (!nskb)
+			goto nomem;
+
+		/**
+		 * 3.2  Chunk Field Descriptions
+		 *
+		 * The total length of a chunk (including Type, Length and
+		 * Value fields) MUST be a multiple of 4 bytes.  If the length
+		 * of the chunk is not a multiple of 4 bytes, the sender MUST
+		 * pad the chunk with all zero bytes and this padding is not
+		 * included in the chunk length field.  The sender should
+		 * never pad with more than 3 bytes.
+		 *
+		 * [This whole comment explains WORD_ROUND() below.]
+		 */
+
+		pkt_size -= packet->overhead;
+		list_for_each_entry_safe(chunk, tmp, &packet->chunk_list, list) {
+			list_del_init(&chunk->list);
+			if (sctp_chunk_is_data(chunk)) {
+				/* 6.3.1 C4) When data is in flight and when allowed
+				 * by rule C5, a new RTT measurement MUST be made each
+				 * round trip.  Furthermore, new RTT measurements
+				 * SHOULD be made no more than once per round-trip
+				 * for a given destination transport address.
+				 */
+
+				if (!chunk->resent && !tp->rto_pending) {
+					chunk->rtt_in_progress = 1;
+					tp->rto_pending = 1;
+				}
+
+				has_data = 1;
+			}
+
+			padding = WORD_ROUND(chunk->skb->len) - chunk->skb->len;
+			if (padding)
+				memset(skb_put(chunk->skb, padding), 0, padding);
 
-		padding = WORD_ROUND(chunk->skb->len) - chunk->skb->len;
-		if (padding)
-			memset(skb_put(chunk->skb, padding), 0, padding);
+			/* if this is the auth chunk that we are adding,
+			 * store pointer where it will be added and put
+			 * the auth into the packet.
+			 */
+			if (chunk = packet->auth) {
+				auth = skb_tail_pointer(nskb);
+			}
+
+			memcpy(skb_put(nskb, chunk->skb->len),
+				       chunk->skb->data, chunk->skb->len);
+
+			pr_debug("*** Chunk:%p[%s] %s 0x%x, length:%d, chunk->skb->len:%d, "
+				 "rtt_in_progress:%d\n", chunk,
+				 sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)),
+				 chunk->has_tsn ? "TSN" : "No TSN",
+				 chunk->has_tsn ? ntohl(chunk->subh.data_hdr->tsn) : 0,
+				 ntohs(chunk->chunk_hdr->length), chunk->skb->len,
+				 chunk->rtt_in_progress);
+
+			/*
+			 * If this is a control chunk, this is our last
+			 * reference. Free data chunks after they've been
+			 * acknowledged or have failed.
+			 * Re-queue auth chunks if needed.
+			 */
+			pkt_size -= WORD_ROUND(chunk->skb->len);
+
+			if (chunk = packet->auth && !list_empty(&packet->chunk_list))
+				list_add(&chunk->list, &packet->chunk_list);
+			else if (!sctp_chunk_is_data(chunk))
+				sctp_chunk_free(chunk);
 
-		/* if this is the auth chunk that we are adding,
-		 * store pointer where it will be added and put
-		 * the auth into the packet.
+			if (!pkt_size)
+				break;
+		}
+
+		/* SCTP-AUTH, Section 6.2
+		 *    The sender MUST calculate the MAC as described in RFC2104 [2]
+		 *    using the hash function H as described by the MAC Identifier and
+		 *    the shared association key K based on the endpoint pair shared key
+		 *    described by the shared key identifier.  The 'data' used for the
+		 *    computation of the AUTH-chunk is given by the AUTH chunk with its
+		 *    HMAC field set to zero (as shown in Figure 6) followed by all
+		 *    chunks that are placed after the AUTH chunk in the SCTP packet.
 		 */
-		if (chunk = packet->auth)
-			auth = skb_tail_pointer(nskb);
-
-		memcpy(skb_put(nskb, chunk->skb->len),
-			       chunk->skb->data, chunk->skb->len);
-
-		pr_debug("*** Chunk:%p[%s] %s 0x%x, length:%d, chunk->skb->len:%d, "
-			 "rtt_in_progress:%d\n", chunk,
-			 sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)),
-			 chunk->has_tsn ? "TSN" : "No TSN",
-			 chunk->has_tsn ? ntohl(chunk->subh.data_hdr->tsn) : 0,
-			 ntohs(chunk->chunk_hdr->length), chunk->skb->len,
-			 chunk->rtt_in_progress);
-
-		/*
-		 * If this is a control chunk, this is our last
-		 * reference. Free data chunks after they've been
-		 * acknowledged or have failed.
+		if (auth)
+			sctp_auth_calculate_hmac(asoc, nskb,
+						(struct sctp_auth_chunk *)auth,
+						GFP_ATOMIC);
+
+		/* Set up the IP options.  */
+		/* BUG: not implemented
+		 * For v4 this all lives somewhere in sk->sk_opt...
 		 */
-		if (!sctp_chunk_is_data(chunk))
-			sctp_chunk_free(chunk);
-	}
 
-	/* SCTP-AUTH, Section 6.2
-	 *    The sender MUST calculate the MAC as described in RFC2104 [2]
-	 *    using the hash function H as described by the MAC Identifier and
-	 *    the shared association key K based on the endpoint pair shared key
-	 *    described by the shared key identifier.  The 'data' used for the
-	 *    computation of the AUTH-chunk is given by the AUTH chunk with its
-	 *    HMAC field set to zero (as shown in Figure 6) followed by all
-	 *    chunks that are placed after the AUTH chunk in the SCTP packet.
-	 */
-	if (auth)
-		sctp_auth_calculate_hmac(asoc, nskb,
-					(struct sctp_auth_chunk *)auth,
-					GFP_ATOMIC);
+		/* Dump that on IP!  */
+		if (asoc) {
+			asoc->stats.opackets++;
+			if (asoc->peer.last_sent_to != tp)
+				/* Considering the multiple CPU scenario, this is a
+				 * "correcter" place for last_sent_to.  --xguo
+				 */
+				asoc->peer.last_sent_to = tp;
+		}
+
+
+		if (!gso ||
+		    skb_shinfo(head)->gso_segs >= sk->sk_gso_max_segs)
+//		    head->len + asoc->pathmtu >= sk->sk_gso_max_size)
+			break;
+
+		if (skb_gro_receive(&head, nskb))
+			goto nomem;
+		skb_shinfo(head)->gso_segs++;
+		/* FIXME: below is a lie */
+		skb_shinfo(head)->gso_size = 1;
+		nskb = NULL;
+	} while (!list_empty(&packet->chunk_list));
 
 	/* 2) Calculate the Adler-32 checksum of the whole packet,
 	 *    including the SCTP common header and all the
@@ -532,16 +615,21 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	 *
 	 * Note: Adler-32 is no longer applicable, as has been replaced
 	 * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
+	 *
+	 * If it's a GSO packet, it's postponed to sctp_skb_segment.
 	 */
-	if (!sctp_checksum_disable) {
+	if (!sctp_checksum_disable || gso) {
 		if (!(dst->dev->features & NETIF_F_SCTP_CRC) ||
 		    (dst_xfrm(dst) != NULL) || packet->ipfragok) {
-			sh->checksum = sctp_compute_cksum(nskb, 0);
+			if (!gso)
+				sh->checksum = sctp_compute_cksum(head, 0);
+			else
+				head->ip_summed = CHECKSUM_UNNECESSARY;
 		} else {
 			/* no need to seed pseudo checksum for SCTP */
-			nskb->ip_summed = CHECKSUM_PARTIAL;
-			nskb->csum_start = skb_transport_header(nskb) - nskb->head;
-			nskb->csum_offset = offsetof(struct sctphdr, checksum);
+			head->ip_summed = CHECKSUM_PARTIAL;
+			head->csum_start = skb_transport_header(head) - head->head;
+			head->csum_offset = offsetof(struct sctphdr, checksum);
 		}
 	}
 
@@ -557,22 +645,7 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	 * Note: The works for IPv6 layer checks this bit too later
 	 * in transmission.  See IP6_ECN_flow_xmit().
 	 */
-	tp->af_specific->ecn_capable(nskb->sk);
-
-	/* Set up the IP options.  */
-	/* BUG: not implemented
-	 * For v4 this all lives somewhere in sk->sk_opt...
-	 */
-
-	/* Dump that on IP!  */
-	if (asoc) {
-		asoc->stats.opackets++;
-		if (asoc->peer.last_sent_to != tp)
-			/* Considering the multiple CPU scenario, this is a
-			 * "correcter" place for last_sent_to.  --xguo
-			 */
-			asoc->peer.last_sent_to = tp;
-	}
+	tp->af_specific->ecn_capable(head->sk);
 
 	if (has_data) {
 		struct timer_list *timer;
@@ -589,16 +662,23 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 		}
 	}
 
-	pr_debug("***sctp_transmit_packet*** skb->len:%d\n", nskb->len);
+	pr_debug("***sctp_transmit_packet*** skb->len:%d\n", head->len);
 
-	nskb->ignore_df = packet->ipfragok;
-	tp->af_specific->sctp_xmit(nskb, tp);
+	head->ignore_df = packet->ipfragok;
+	printk("%s %d %d %d\n", __func__, head->len,
+	       packet->transport->pathmtu,
+	       packet->transport->pathmtu - packet->overhead);
+	if (gso)
+		skb_shinfo(head)->gso_type = SKB_GSO_SCTP;
+	tp->af_specific->sctp_xmit(head, tp);
 
 out:
 	sctp_packet_reset(packet);
+	sk_dst_reset(sk); /* FIXME: double check */
 	return err;
 no_route:
 	kfree_skb(nskb);
+	kfree_skb(head);
 
 	if (asoc)
 		IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
@@ -635,7 +715,7 @@ nomem:
 static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
 					   struct sctp_chunk *chunk)
 {
-	size_t datasize, rwnd, inflight, flight_size;
+	size_t datasize, rwnd, inflight, flight_size, maxsize;
 	struct sctp_transport *transport = packet->transport;
 	struct sctp_association *asoc = transport->asoc;
 	struct sctp_outq *q = &asoc->outqueue;
@@ -705,7 +785,15 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
 	/* Check whether this chunk and all the rest of pending data will fit
 	 * or delay in hopes of bundling a full sized packet.
 	 */
-	if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead)
+	if (packet->ipfragok) {
+		/* Means chunk won't fit and needs fragmentation at
+		 * transport level, so we can't do GSO.
+		 */
+		maxsize = transport->pathmtu;
+	} else {
+		maxsize = transport->dst->dev->gso_max_size;
+	}
+	if (chunk->skb->len + q->out_qlen >= maxsize - packet->overhead)
 		/* Enough data queued to fill a packet */
 		return SCTP_XMIT_OK;
 
@@ -764,6 +852,8 @@ static sctp_xmit_t sctp_packet_will_fit(struct sctp_packet *packet,
 
 	/* Decide if we need to fragment or resubmit later. */
 	if (too_big) {
+		struct net_device *dev = packet->transport->dst->dev;
+
 		/* It's OK to fragmet at IP level if any one of the following
 		 * is true:
 		 * 	1. The packet is empty (meaning this chunk is greater
@@ -779,9 +869,11 @@ static sctp_xmit_t sctp_packet_will_fit(struct sctp_packet *packet,
 			 * actually hit this condition
 			 */
 			packet->ipfragok = 1;
-		} else {
+		} else if (psize + chunk_len > dev->gso_max_size - packet->overhead) {
+			/* Hit GSO limit, gotta flush */
 			retval = SCTP_XMIT_PMTU_FULL;
 		}
+		/* Otherwise it will fit in the GSO packet */
 	}
 
 	return retval;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 5ca2ebfe0be83882fcb841de6fa8029b6455ef85..064e5d375e612f2ec745f384d35f0e4c6b96212c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4001,6 +4001,8 @@ static int sctp_init_sock(struct sock *sk)
 		return -ESOCKTNOSUPPORT;
 	}
 
+	sk->sk_gso_type = SKB_GSO_SCTP;
+
 	/* Initialize default send parameters. These parameters can be
 	 * modified with the SCTP_DEFAULT_SEND_PARAM socket option.
 	 */
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 1/3] skbuff: export skb_gro_receive
  2016-01-27 17:06   ` Marcelo Ricardo Leitner
@ 2016-01-27 18:35     ` Eric Dumazet
  -1 siblings, 0 replies; 49+ messages in thread
From: Eric Dumazet @ 2016-01-27 18:35 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

On Wed, 2016-01-27 at 15:06 -0200, Marcelo Ricardo Leitner wrote:
> sctp GSO requires it and sctp can be compiled as a module, so export
> this function.
> 
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  net/core/skbuff.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b2df375ec9c2173a8132b8efa1c3062f0510284b..704b69682085dec77f3d0f990aaf0024afd705b9 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3312,6 +3312,7 @@ done:
>  	NAPI_GRO_CB(skb)->same_flow = 1;
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(skb_gro_receive);
>  
>  void __init skb_init(void)
>  {


Normally, all the offloading support belongs in vmlinux, so this export
is not needed.

For instance, we support GRO IPV6 even if IPv6 is not enabled on the
host.

Ie all these net/ipv6 files are included in vmlinux if CONFIG_INET is
enabled

ip6_offload.o tcpv6_offload.o udp_offload.o exthdrs_offload.o

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 1/3] skbuff: export skb_gro_receive
@ 2016-01-27 18:35     ` Eric Dumazet
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Dumazet @ 2016-01-27 18:35 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

On Wed, 2016-01-27 at 15:06 -0200, Marcelo Ricardo Leitner wrote:
> sctp GSO requires it and sctp can be compiled as a module, so export
> this function.
> 
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  net/core/skbuff.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b2df375ec9c2173a8132b8efa1c3062f0510284b..704b69682085dec77f3d0f990aaf0024afd705b9 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3312,6 +3312,7 @@ done:
>  	NAPI_GRO_CB(skb)->same_flow = 1;
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(skb_gro_receive);
>  
>  void __init skb_init(void)
>  {


Normally, all the offloading support belongs in vmlinux, so this export
is not needed.

For instance, we support GRO IPV6 even if IPv6 is not enabled on the
host.

Ie all these net/ipv6 files are included in vmlinux if CONFIG_INET is
enabled

ip6_offload.o tcpv6_offload.o udp_offload.o exthdrs_offload.o




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 1/3] skbuff: export skb_gro_receive
  2016-01-27 18:35     ` Eric Dumazet
@ 2016-01-27 18:46       ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 18:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

Em 27-01-2016 16:35, Eric Dumazet escreveu:
> On Wed, 2016-01-27 at 15:06 -0200, Marcelo Ricardo Leitner wrote:
>> sctp GSO requires it and sctp can be compiled as a module, so export
>> this function.
>>
>> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
>> ---
>>   net/core/skbuff.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index b2df375ec9c2173a8132b8efa1c3062f0510284b..704b69682085dec77f3d0f990aaf0024afd705b9 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -3312,6 +3312,7 @@ done:
>>   	NAPI_GRO_CB(skb)->same_flow = 1;
>>   	return 0;
>>   }
>> +EXPORT_SYMBOL_GPL(skb_gro_receive);
>>
>>   void __init skb_init(void)
>>   {
>
>
> Normally, all the offloading support belongs in vmlinux, so this export
> is not needed.
>
> For instance, we support GRO IPV6 even if IPv6 is not enabled on the
> host.
>
> Ie all these net/ipv6 files are included in vmlinux if CONFIG_INET is
> enabled
>
> ip6_offload.o tcpv6_offload.o udp_offload.o exthdrs_offload.o

Okay. For SCTP it might be a bit harder due to its CRC, which is using 
crc32c function. I'll check it.

Thanks,
Marcelo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 1/3] skbuff: export skb_gro_receive
@ 2016-01-27 18:46       ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-27 18:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

Em 27-01-2016 16:35, Eric Dumazet escreveu:
> On Wed, 2016-01-27 at 15:06 -0200, Marcelo Ricardo Leitner wrote:
>> sctp GSO requires it and sctp can be compiled as a module, so export
>> this function.
>>
>> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
>> ---
>>   net/core/skbuff.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index b2df375ec9c2173a8132b8efa1c3062f0510284b..704b69682085dec77f3d0f990aaf0024afd705b9 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -3312,6 +3312,7 @@ done:
>>   	NAPI_GRO_CB(skb)->same_flow = 1;
>>   	return 0;
>>   }
>> +EXPORT_SYMBOL_GPL(skb_gro_receive);
>>
>>   void __init skb_init(void)
>>   {
>
>
> Normally, all the offloading support belongs in vmlinux, so this export
> is not needed.
>
> For instance, we support GRO IPV6 even if IPv6 is not enabled on the
> host.
>
> Ie all these net/ipv6 files are included in vmlinux if CONFIG_INET is
> enabled
>
> ip6_offload.o tcpv6_offload.o udp_offload.o exthdrs_offload.o

Okay. For SCTP it might be a bit harder due to its CRC, which is using 
crc32c function. I'll check it.

Thanks,
Marcelo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-27 17:06 ` Marcelo Ricardo Leitner
                   ` (3 preceding siblings ...)
  (?)
@ 2016-01-28 13:51 ` David Laight
  2016-01-28 15:53     ` 'Marcelo Ricardo Leitner'
  2016-01-28 17:54     ` Michael Tuexen
  -1 siblings, 2 replies; 49+ messages in thread
From: David Laight @ 2016-01-28 13:51 UTC (permalink / raw)
  To: 'Marcelo Ricardo Leitner', netdev
  Cc: Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

From: Marcelo Ricardo Leitner
> Sent: 27 January 2016 17:07
> This patchset is merely a RFC for the moment. There are some
> controversial points that I'd like to discuss before actually proposing
> the patches.

You also need to look at how a 'user' can actually get SCTP to
merge data chunks in the first place.

With Nagle disabled (and it probably has to be since the data flow
is unlikely to be 'command-response' or 'unidirectional bulk')
it is currently almost impossible to get more than one chunk
into an ethernet frame.

Support for MSG_MORE would help.

Given the current implementation you can get almost the required
behaviour by turning nagle off and on repeatedly.

I did wonder whether the queued data could actually be picked up
be a Heartbeat chunk that is probing a different remote address
(which would be bad news).

	David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 13:51 ` [RFC PATCH net-next 0/3] sctp: add " David Laight
@ 2016-01-28 15:53     ` 'Marcelo Ricardo Leitner'
  2016-01-28 17:54     ` Michael Tuexen
  1 sibling, 0 replies; 49+ messages in thread
From: 'Marcelo Ricardo Leitner' @ 2016-01-28 15:53 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

On Thu, Jan 28, 2016 at 01:51:02PM +0000, David Laight wrote:
> From: Marcelo Ricardo Leitner
> > Sent: 27 January 2016 17:07
> > This patchset is merely a RFC for the moment. There are some
> > controversial points that I'd like to discuss before actually proposing
> > the patches.
> 
> You also need to look at how a 'user' can actually get SCTP to
> merge data chunks in the first place.
> 
> With Nagle disabled (and it probably has to be since the data flow
> is unlikely to be 'command-response' or 'unidirectional bulk')
> it is currently almost impossible to get more than one chunk
> into an ethernet frame.
> 
> Support for MSG_MORE would help.
> 
> Given the current implementation you can get almost the required
> behaviour by turning nagle off and on repeatedly.

That's pretty much expected, I think. Without Nagle, if bandwidth and
cwnd allow, segment will be sent. GSO by itself shouldn't cause a
buffering to protect from that.

If something causes a bottleneck, tx may get queue up.  Like if I do a
stress test in my system, generally receiver side is slower than sender,
so I end up having tx buffers pretty easily. It mimics bandwidth
restrictions.

There is also the case of sending large data chunks, where
sctp_sendmsg() will segment it into smaller chunks already.

But yes, agreed, MSG_MORE is at least a welcomed compliment here,
specially for applications generating a train of chunks. Will put that in
my ToDo here, thanks.

> I did wonder whether the queued data could actually be picked up
> be a Heartbeat chunk that is probing a different remote address
> (which would be bad news).

I don't follow. You mean if a heartbeat may get stuck in queue or if
sending of a heartbeat can end up carrying additional data by accident?

  Marcelo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-28 15:53     ` 'Marcelo Ricardo Leitner'
  0 siblings, 0 replies; 49+ messages in thread
From: 'Marcelo Ricardo Leitner' @ 2016-01-28 15:53 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

On Thu, Jan 28, 2016 at 01:51:02PM +0000, David Laight wrote:
> From: Marcelo Ricardo Leitner
> > Sent: 27 January 2016 17:07
> > This patchset is merely a RFC for the moment. There are some
> > controversial points that I'd like to discuss before actually proposing
> > the patches.
> 
> You also need to look at how a 'user' can actually get SCTP to
> merge data chunks in the first place.
> 
> With Nagle disabled (and it probably has to be since the data flow
> is unlikely to be 'command-response' or 'unidirectional bulk')
> it is currently almost impossible to get more than one chunk
> into an ethernet frame.
> 
> Support for MSG_MORE would help.
> 
> Given the current implementation you can get almost the required
> behaviour by turning nagle off and on repeatedly.

That's pretty much expected, I think. Without Nagle, if bandwidth and
cwnd allow, segment will be sent. GSO by itself shouldn't cause a
buffering to protect from that.

If something causes a bottleneck, tx may get queue up.  Like if I do a
stress test in my system, generally receiver side is slower than sender,
so I end up having tx buffers pretty easily. It mimics bandwidth
restrictions.

There is also the case of sending large data chunks, where
sctp_sendmsg() will segment it into smaller chunks already.

But yes, agreed, MSG_MORE is at least a welcomed compliment here,
specially for applications generating a train of chunks. Will put that in
my ToDo here, thanks.

> I did wonder whether the queued data could actually be picked up
> be a Heartbeat chunk that is probing a different remote address
> (which would be bad news).

I don't follow. You mean if a heartbeat may get stuck in queue or if
sending of a heartbeat can end up carrying additional data by accident?

  Marcelo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 15:53     ` 'Marcelo Ricardo Leitner'
  (?)
@ 2016-01-28 17:30     ` David Laight
  2016-01-28 20:55         ` 'Marcelo Ricardo Leitner'
  -1 siblings, 1 reply; 49+ messages in thread
From: David Laight @ 2016-01-28 17:30 UTC (permalink / raw)
  To: 'Marcelo Ricardo Leitner'
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

From: 'Marcelo Ricardo Leitner'
> Sent: 28 January 2016 15:53
> On Thu, Jan 28, 2016 at 01:51:02PM +0000, David Laight wrote:
> > From: Marcelo Ricardo Leitner
> > > Sent: 27 January 2016 17:07
> > > This patchset is merely a RFC for the moment. There are some
> > > controversial points that I'd like to discuss before actually proposing
> > > the patches.
> >
> > You also need to look at how a 'user' can actually get SCTP to
> > merge data chunks in the first place.
> >
> > With Nagle disabled (and it probably has to be since the data flow
> > is unlikely to be 'command-response' or 'unidirectional bulk')
> > it is currently almost impossible to get more than one chunk
> > into an ethernet frame.
> >
> > Support for MSG_MORE would help.
> >
> > Given the current implementation you can get almost the required
> > behaviour by turning nagle off and on repeatedly.
> 
> That's pretty much expected, I think. Without Nagle, if bandwidth and
> cwnd allow, segment will be sent. GSO by itself shouldn't cause a
> buffering to protect from that.
> 
> If something causes a bottleneck, tx may get queue up.  Like if I do a
> stress test in my system, generally receiver side is slower than sender,
> so I end up having tx buffers pretty easily. It mimics bandwidth
> restrictions.

Imagine a using M2UA to connect local machines (one running mtp3, the other mtp2).
Configure two linksets of 16 signalling links and perform a double-reflect
loopback test.
The SCTP connection won't ever saturate, so every msu ends up in its own
ethernet packet.
It is easy to generate 1000's of ethernet frames/sec on a single connection.

(We do this with something not entirely quite like M2UA over TCP,
even then it is very hard to get multiple message into a single
ethernet frame.)


> There is also the case of sending large data chunks, where
> sctp_sendmsg() will segment it into smaller chunks already.

I presume they are merged before being passed to the receiving socket?

> But yes, agreed, MSG_MORE is at least a welcomed compliment here,
> specially for applications generating a train of chunks. Will put that in
> my ToDo here, thanks.

I've posted a patch in the past for MSG_MORE, didn't quite work.

> > I did wonder whether the queued data could actually be picked up
> > be a Heartbeat chunk that is probing a different remote address
> > (which would be bad news).
> 
> I don't follow. You mean if a heartbeat may get stuck in queue or if
> sending of a heartbeat can end up carrying additional data by accident?

My suspicion was that the heartbeat would carry the queued data.

	David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 13:51 ` [RFC PATCH net-next 0/3] sctp: add " David Laight
@ 2016-01-28 17:54     ` Michael Tuexen
  2016-01-28 17:54     ` Michael Tuexen
  1 sibling, 0 replies; 49+ messages in thread
From: Michael Tuexen @ 2016-01-28 17:54 UTC (permalink / raw)
  To: David Laight
  Cc: Marcelo Ricardo Leitner, netdev, Neil Horman, Vlad Yasevich,
	David Miller, brouer, alexander.duyck, alexei.starovoitov,
	borkmann, marek, hannes, fw, pabeni, john.r.fastabend,
	linux-sctp

> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
> 
> From: Marcelo Ricardo Leitner
>> Sent: 27 January 2016 17:07
>> This patchset is merely a RFC for the moment. There are some
>> controversial points that I'd like to discuss before actually proposing
>> the patches.
> 
> You also need to look at how a 'user' can actually get SCTP to
> merge data chunks in the first place.
> 
> With Nagle disabled (and it probably has to be since the data flow
> is unlikely to be 'command-response' or 'unidirectional bulk')
> it is currently almost impossible to get more than one chunk
> into an ethernet frame.
> 
> Support for MSG_MORE would help.
What about adding support for the explicit EOR mode as specified in
https://tools.ietf.org/html/rfc6458#section-8.1.26

Best regards
Michael
> 
> Given the current implementation you can get almost the required
> behaviour by turning nagle off and on repeatedly.
> 
> I did wonder whether the queued data could actually be picked up
> be a Heartbeat chunk that is probing a different remote address
> (which would be bad news).
> 
> 	David
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-28 17:54     ` Michael Tuexen
  0 siblings, 0 replies; 49+ messages in thread
From: Michael Tuexen @ 2016-01-28 17:54 UTC (permalink / raw)
  To: David Laight
  Cc: Marcelo Ricardo Leitner, netdev, Neil Horman, Vlad Yasevich,
	David Miller, brouer, alexander.duyck, alexei.starovoitov,
	borkmann, marek, hannes, fw, pabeni, john.r.fastabend,
	linux-sctp

> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
> 
> From: Marcelo Ricardo Leitner
>> Sent: 27 January 2016 17:07
>> This patchset is merely a RFC for the moment. There are some
>> controversial points that I'd like to discuss before actually proposing
>> the patches.
> 
> You also need to look at how a 'user' can actually get SCTP to
> merge data chunks in the first place.
> 
> With Nagle disabled (and it probably has to be since the data flow
> is unlikely to be 'command-response' or 'unidirectional bulk')
> it is currently almost impossible to get more than one chunk
> into an ethernet frame.
> 
> Support for MSG_MORE would help.
What about adding support for the explicit EOR mode as specified in
https://tools.ietf.org/html/rfc6458#section-8.1.26

Best regards
Michael
> 
> Given the current implementation you can get almost the required
> behaviour by turning nagle off and on repeatedly.
> 
> I did wonder whether the queued data could actually be picked up
> be a Heartbeat chunk that is probing a different remote address
> (which would be bad news).
> 
> 	David
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 17:30     ` David Laight
@ 2016-01-28 20:55         ` 'Marcelo Ricardo Leitner'
  0 siblings, 0 replies; 49+ messages in thread
From: 'Marcelo Ricardo Leitner' @ 2016-01-28 20:55 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

On Thu, Jan 28, 2016 at 05:30:24PM +0000, David Laight wrote:
> From: 'Marcelo Ricardo Leitner'
> > Sent: 28 January 2016 15:53
> > On Thu, Jan 28, 2016 at 01:51:02PM +0000, David Laight wrote:
> > > From: Marcelo Ricardo Leitner
> > > > Sent: 27 January 2016 17:07
> > > > This patchset is merely a RFC for the moment. There are some
> > > > controversial points that I'd like to discuss before actually proposing
> > > > the patches.
> > >
> > > You also need to look at how a 'user' can actually get SCTP to
> > > merge data chunks in the first place.
> > >
> > > With Nagle disabled (and it probably has to be since the data flow
> > > is unlikely to be 'command-response' or 'unidirectional bulk')
> > > it is currently almost impossible to get more than one chunk
> > > into an ethernet frame.
> > >
> > > Support for MSG_MORE would help.
> > >
> > > Given the current implementation you can get almost the required
> > > behaviour by turning nagle off and on repeatedly.
> > 
> > That's pretty much expected, I think. Without Nagle, if bandwidth and
> > cwnd allow, segment will be sent. GSO by itself shouldn't cause a
> > buffering to protect from that.
> > 
> > If something causes a bottleneck, tx may get queue up.  Like if I do a
> > stress test in my system, generally receiver side is slower than sender,
> > so I end up having tx buffers pretty easily. It mimics bandwidth
> > restrictions.
> 
> Imagine a using M2UA to connect local machines (one running mtp3, the other mtp2).
> Configure two linksets of 16 signalling links and perform a double-reflect
> loopback test.
> The SCTP connection won't ever saturate, so every msu ends up in its own
> ethernet packet.
> It is easy to generate 1000's of ethernet frames/sec on a single connection.
> 
> (We do this with something not entirely quite like M2UA over TCP,
> even then it is very hard to get multiple message into a single
> ethernet frame.)
> 

Agreed, GSO won't help much in there without a corking feature like
MSG_MORE or that on/off switch on Nagle you mentioned.

Thing about this (and also GRO) is on identifying how much can be spent
on waiting for the next chunk/packet without causing issues to the
application. Nagle is there and helps quite a lot, but timing-sensitive
applications will turn it off.

GSO will then rely on a bottleneck for causing tx to get buffered (hmm
there goes the timing sensitivity), which then can be GSOed without
problems, or it will require that application to provide some tips, like
MSG_MORE. But well, GSO is mainly meant for bulk stuff.

> > There is also the case of sending large data chunks, where
> > sctp_sendmsg() will segment it into smaller chunks already.
> 
> I presume they are merged before being passed to the receiving socket?

Yes. SCTP will reassemble the chunk and only deliver it to receiving
application when all pieces needed are there.

> > But yes, agreed, MSG_MORE is at least a welcomed compliment here,
> > specially for applications generating a train of chunks. Will put that in
> > my ToDo here, thanks.
> 
> I've posted a patch in the past for MSG_MORE, didn't quite work.

Ahh cool. Can you share the archive link please? Maybe I can take it
from there then.

> > > I did wonder whether the queued data could actually be picked up
> > > be a Heartbeat chunk that is probing a different remote address
> > > (which would be bad news).
> > 
> > I don't follow. You mean if a heartbeat may get stuck in queue or if
> > sending of a heartbeat can end up carrying additional data by accident?
> 
> My suspicion was that the heartbeat would carry the queued data.

I'm afraid I'm  still not following, sorry. You mean that this GSO patch
would cause the heartbeat to carry queued data? If yes, no, because for
SCTP side of it it mangles the packet size and make it look bigger
instead of handling multiple packets. It will then break this large
sctp_packet into several sk_buff and glue them together as if they were
GROed, allowing skb_segment to just split them back. The reason the
sctp_packet is generated, being it due to user data or control chunks
like heartbeats, is not modified.

  Marcelo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-28 20:55         ` 'Marcelo Ricardo Leitner'
  0 siblings, 0 replies; 49+ messages in thread
From: 'Marcelo Ricardo Leitner' @ 2016-01-28 20:55 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

On Thu, Jan 28, 2016 at 05:30:24PM +0000, David Laight wrote:
> From: 'Marcelo Ricardo Leitner'
> > Sent: 28 January 2016 15:53
> > On Thu, Jan 28, 2016 at 01:51:02PM +0000, David Laight wrote:
> > > From: Marcelo Ricardo Leitner
> > > > Sent: 27 January 2016 17:07
> > > > This patchset is merely a RFC for the moment. There are some
> > > > controversial points that I'd like to discuss before actually proposing
> > > > the patches.
> > >
> > > You also need to look at how a 'user' can actually get SCTP to
> > > merge data chunks in the first place.
> > >
> > > With Nagle disabled (and it probably has to be since the data flow
> > > is unlikely to be 'command-response' or 'unidirectional bulk')
> > > it is currently almost impossible to get more than one chunk
> > > into an ethernet frame.
> > >
> > > Support for MSG_MORE would help.
> > >
> > > Given the current implementation you can get almost the required
> > > behaviour by turning nagle off and on repeatedly.
> > 
> > That's pretty much expected, I think. Without Nagle, if bandwidth and
> > cwnd allow, segment will be sent. GSO by itself shouldn't cause a
> > buffering to protect from that.
> > 
> > If something causes a bottleneck, tx may get queue up.  Like if I do a
> > stress test in my system, generally receiver side is slower than sender,
> > so I end up having tx buffers pretty easily. It mimics bandwidth
> > restrictions.
> 
> Imagine a using M2UA to connect local machines (one running mtp3, the other mtp2).
> Configure two linksets of 16 signalling links and perform a double-reflect
> loopback test.
> The SCTP connection won't ever saturate, so every msu ends up in its own
> ethernet packet.
> It is easy to generate 1000's of ethernet frames/sec on a single connection.
> 
> (We do this with something not entirely quite like M2UA over TCP,
> even then it is very hard to get multiple message into a single
> ethernet frame.)
> 

Agreed, GSO won't help much in there without a corking feature like
MSG_MORE or that on/off switch on Nagle you mentioned.

Thing about this (and also GRO) is on identifying how much can be spent
on waiting for the next chunk/packet without causing issues to the
application. Nagle is there and helps quite a lot, but timing-sensitive
applications will turn it off.

GSO will then rely on a bottleneck for causing tx to get buffered (hmm
there goes the timing sensitivity), which then can be GSOed without
problems, or it will require that application to provide some tips, like
MSG_MORE. But well, GSO is mainly meant for bulk stuff.

> > There is also the case of sending large data chunks, where
> > sctp_sendmsg() will segment it into smaller chunks already.
> 
> I presume they are merged before being passed to the receiving socket?

Yes. SCTP will reassemble the chunk and only deliver it to receiving
application when all pieces needed are there.

> > But yes, agreed, MSG_MORE is at least a welcomed compliment here,
> > specially for applications generating a train of chunks. Will put that in
> > my ToDo here, thanks.
> 
> I've posted a patch in the past for MSG_MORE, didn't quite work.

Ahh cool. Can you share the archive link please? Maybe I can take it
from there then.

> > > I did wonder whether the queued data could actually be picked up
> > > be a Heartbeat chunk that is probing a different remote address
> > > (which would be bad news).
> > 
> > I don't follow. You mean if a heartbeat may get stuck in queue or if
> > sending of a heartbeat can end up carrying additional data by accident?
> 
> My suspicion was that the heartbeat would carry the queued data.

I'm afraid I'm  still not following, sorry. You mean that this GSO patch
would cause the heartbeat to carry queued data? If yes, no, because for
SCTP side of it it mangles the packet size and make it look bigger
instead of handling multiple packets. It will then break this large
sctp_packet into several sk_buff and glue them together as if they were
GROed, allowing skb_segment to just split them back. The reason the
sctp_packet is generated, being it due to user data or control chunks
like heartbeats, is not modified.

  Marcelo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 17:54     ` Michael Tuexen
@ 2016-01-28 21:03       ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-28 21:03 UTC (permalink / raw)
  To: Michael Tuexen
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp

On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
> > On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
> > 
> > From: Marcelo Ricardo Leitner
> >> Sent: 27 January 2016 17:07
> >> This patchset is merely a RFC for the moment. There are some
> >> controversial points that I'd like to discuss before actually proposing
> >> the patches.
> > 
> > You also need to look at how a 'user' can actually get SCTP to
> > merge data chunks in the first place.
> > 
> > With Nagle disabled (and it probably has to be since the data flow
> > is unlikely to be 'command-response' or 'unidirectional bulk')
> > it is currently almost impossible to get more than one chunk
> > into an ethernet frame.
> > 
> > Support for MSG_MORE would help.
> What about adding support for the explicit EOR mode as specified in
> https://tools.ietf.org/html/rfc6458#section-8.1.26

Seizing the moment to clarify my understanding on that. :)
Such multiple calls to send system calls will result in a single data
chunk. Is that so? That's what I get from that text and also from this
snippet:
"Sending a message using sendmsg() is atomic unless explicit end of
record (EOR) marking is enabled on the socket specified by sd (see
Section 8.1.26)."

Best regards,
Marcelo

> Best regards
> Michael
> > 
> > Given the current implementation you can get almost the required
> > behaviour by turning nagle off and on repeatedly.
> > 
> > I did wonder whether the queued data could actually be picked up
> > be a Heartbeat chunk that is probing a different remote address
> > (which would be bad news).
> > 
> > 	David
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-28 21:03       ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-28 21:03 UTC (permalink / raw)
  To: Michael Tuexen
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp

On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
> > On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
> > 
> > From: Marcelo Ricardo Leitner
> >> Sent: 27 January 2016 17:07
> >> This patchset is merely a RFC for the moment. There are some
> >> controversial points that I'd like to discuss before actually proposing
> >> the patches.
> > 
> > You also need to look at how a 'user' can actually get SCTP to
> > merge data chunks in the first place.
> > 
> > With Nagle disabled (and it probably has to be since the data flow
> > is unlikely to be 'command-response' or 'unidirectional bulk')
> > it is currently almost impossible to get more than one chunk
> > into an ethernet frame.
> > 
> > Support for MSG_MORE would help.
> What about adding support for the explicit EOR mode as specified in
> https://tools.ietf.org/html/rfc6458#section-8.1.26

Seizing the moment to clarify my understanding on that. :)
Such multiple calls to send system calls will result in a single data
chunk. Is that so? That's what I get from that text and also from this
snippet:
"Sending a message using sendmsg() is atomic unless explicit end of
record (EOR) marking is enabled on the socket specified by sd (see
Section 8.1.26)."

Best regards,
Marcelo

> Best regards
> Michael
> > 
> > Given the current implementation you can get almost the required
> > behaviour by turning nagle off and on repeatedly.
> > 
> > I did wonder whether the queued data could actually be picked up
> > be a Heartbeat chunk that is probing a different remote address
> > (which would be bad news).
> > 
> > 	David
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 21:03       ` Marcelo Ricardo Leitner
@ 2016-01-28 23:36         ` Michael Tuexen
  -1 siblings, 0 replies; 49+ messages in thread
From: Michael Tuexen @ 2016-01-28 23:36 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp


> On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> 
> On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
>>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
>>> 
>>> From: Marcelo Ricardo Leitner
>>>> Sent: 27 January 2016 17:07
>>>> This patchset is merely a RFC for the moment. There are some
>>>> controversial points that I'd like to discuss before actually proposing
>>>> the patches.
>>> 
>>> You also need to look at how a 'user' can actually get SCTP to
>>> merge data chunks in the first place.
>>> 
>>> With Nagle disabled (and it probably has to be since the data flow
>>> is unlikely to be 'command-response' or 'unidirectional bulk')
>>> it is currently almost impossible to get more than one chunk
>>> into an ethernet frame.
>>> 
>>> Support for MSG_MORE would help.
>> What about adding support for the explicit EOR mode as specified in
>> https://tools.ietf.org/html/rfc6458#section-8.1.26
> 
> Seizing the moment to clarify my understanding on that. :)
> Such multiple calls to send system calls will result in a single data
> chunk. Is that so? That's what I get from that text and also from this
No. It results in a single user message. This means you can send
a user message larger than the send buffer size. How the user message
is fragmented in DATA chunks is transparent to the upper layer.

Does this make things clearer?

Best regards
Michael
> snippet:
> "Sending a message using sendmsg() is atomic unless explicit end of
> record (EOR) marking is enabled on the socket specified by sd (see
> Section 8.1.26)."
> 
> Best regards,
> Marcelo
> 
>> Best regards
>> Michael
>>> 
>>> Given the current implementation you can get almost the required
>>> behaviour by turning nagle off and on repeatedly.
>>> 
>>> I did wonder whether the queued data could actually be picked up
>>> be a Heartbeat chunk that is probing a different remote address
>>> (which would be bad news).
>>> 
>>> 	David
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-28 23:36         ` Michael Tuexen
  0 siblings, 0 replies; 49+ messages in thread
From: Michael Tuexen @ 2016-01-28 23:36 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp


> On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> 
> On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
>>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
>>> 
>>> From: Marcelo Ricardo Leitner
>>>> Sent: 27 January 2016 17:07
>>>> This patchset is merely a RFC for the moment. There are some
>>>> controversial points that I'd like to discuss before actually proposing
>>>> the patches.
>>> 
>>> You also need to look at how a 'user' can actually get SCTP to
>>> merge data chunks in the first place.
>>> 
>>> With Nagle disabled (and it probably has to be since the data flow
>>> is unlikely to be 'command-response' or 'unidirectional bulk')
>>> it is currently almost impossible to get more than one chunk
>>> into an ethernet frame.
>>> 
>>> Support for MSG_MORE would help.
>> What about adding support for the explicit EOR mode as specified in
>> https://tools.ietf.org/html/rfc6458#section-8.1.26
> 
> Seizing the moment to clarify my understanding on that. :)
> Such multiple calls to send system calls will result in a single data
> chunk. Is that so? That's what I get from that text and also from this
No. It results in a single user message. This means you can send
a user message larger than the send buffer size. How the user message
is fragmented in DATA chunks is transparent to the upper layer.

Does this make things clearer?

Best regards
Michael
> snippet:
> "Sending a message using sendmsg() is atomic unless explicit end of
> record (EOR) marking is enabled on the socket specified by sd (see
> Section 8.1.26)."
> 
> Best regards,
> Marcelo
> 
>> Best regards
>> Michael
>>> 
>>> Given the current implementation you can get almost the required
>>> behaviour by turning nagle off and on repeatedly.
>>> 
>>> I did wonder whether the queued data could actually be picked up
>>> be a Heartbeat chunk that is probing a different remote address
>>> (which would be bad news).
>>> 
>>> 	David
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 23:36         ` Michael Tuexen
@ 2016-01-29  1:18           ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-29  1:18 UTC (permalink / raw)
  To: Michael Tuexen
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp

On Fri, Jan 29, 2016 at 12:36:05AM +0100, Michael Tuexen wrote:
> 
> > On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> > 
> > On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
> >>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
> >>> 
> >>> From: Marcelo Ricardo Leitner
> >>>> Sent: 27 January 2016 17:07
> >>>> This patchset is merely a RFC for the moment. There are some
> >>>> controversial points that I'd like to discuss before actually proposing
> >>>> the patches.
> >>> 
> >>> You also need to look at how a 'user' can actually get SCTP to
> >>> merge data chunks in the first place.
> >>> 
> >>> With Nagle disabled (and it probably has to be since the data flow
> >>> is unlikely to be 'command-response' or 'unidirectional bulk')
> >>> it is currently almost impossible to get more than one chunk
> >>> into an ethernet frame.
> >>> 
> >>> Support for MSG_MORE would help.
> >> What about adding support for the explicit EOR mode as specified in
> >> https://tools.ietf.org/html/rfc6458#section-8.1.26
> > 
> > Seizing the moment to clarify my understanding on that. :)
> > Such multiple calls to send system calls will result in a single data
> > chunk. Is that so? That's what I get from that text and also from this
> No. It results in a single user message. This means you can send
> a user message larger than the send buffer size. How the user message
> is fragmented in DATA chunks is transparent to the upper layer.
> 
> Does this make things clearer?

I think so, yes. So it allows delaying setting the Ending fragment bit
until the application set SCTP_EOR. All the rest before this stays as
before: first send() will generate a chunk with Beginning bit set and
may generate some other middle-fragments (no B nor E bit set) if
necessary, second to N-1 call to send will generate only middle
fragments, while the last send, with SCTP_EOF, will then set the Ending
fragment in the last one. Right?

Thanks,
Marcelo

> 
> Best regards
> Michael
> > snippet:
> > "Sending a message using sendmsg() is atomic unless explicit end of
> > record (EOR) marking is enabled on the socket specified by sd (see
> > Section 8.1.26)."
> > 
> > Best regards,
> > Marcelo
> > 
> >> Best regards
> >> Michael
> >>> 
> >>> Given the current implementation you can get almost the required
> >>> behaviour by turning nagle off and on repeatedly.
> >>> 
> >>> I did wonder whether the queued data could actually be picked up
> >>> be a Heartbeat chunk that is probing a different remote address
> >>> (which would be bad news).
> >>> 
> >>> 	David
> >>> 
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> 
> >> 
> > 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-29  1:18           ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-29  1:18 UTC (permalink / raw)
  To: Michael Tuexen
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp

On Fri, Jan 29, 2016 at 12:36:05AM +0100, Michael Tuexen wrote:
> 
> > On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> > 
> > On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
> >>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
> >>> 
> >>> From: Marcelo Ricardo Leitner
> >>>> Sent: 27 January 2016 17:07
> >>>> This patchset is merely a RFC for the moment. There are some
> >>>> controversial points that I'd like to discuss before actually proposing
> >>>> the patches.
> >>> 
> >>> You also need to look at how a 'user' can actually get SCTP to
> >>> merge data chunks in the first place.
> >>> 
> >>> With Nagle disabled (and it probably has to be since the data flow
> >>> is unlikely to be 'command-response' or 'unidirectional bulk')
> >>> it is currently almost impossible to get more than one chunk
> >>> into an ethernet frame.
> >>> 
> >>> Support for MSG_MORE would help.
> >> What about adding support for the explicit EOR mode as specified in
> >> https://tools.ietf.org/html/rfc6458#section-8.1.26
> > 
> > Seizing the moment to clarify my understanding on that. :)
> > Such multiple calls to send system calls will result in a single data
> > chunk. Is that so? That's what I get from that text and also from this
> No. It results in a single user message. This means you can send
> a user message larger than the send buffer size. How the user message
> is fragmented in DATA chunks is transparent to the upper layer.
> 
> Does this make things clearer?

I think so, yes. So it allows delaying setting the Ending fragment bit
until the application set SCTP_EOR. All the rest before this stays as
before: first send() will generate a chunk with Beginning bit set and
may generate some other middle-fragments (no B nor E bit set) if
necessary, second to N-1 call to send will generate only middle
fragments, while the last send, with SCTP_EOF, will then set the Ending
fragment in the last one. Right?

Thanks,
Marcelo

> 
> Best regards
> Michael
> > snippet:
> > "Sending a message using sendmsg() is atomic unless explicit end of
> > record (EOR) marking is enabled on the socket specified by sd (see
> > Section 8.1.26)."
> > 
> > Best regards,
> > Marcelo
> > 
> >> Best regards
> >> Michael
> >>> 
> >>> Given the current implementation you can get almost the required
> >>> behaviour by turning nagle off and on repeatedly.
> >>> 
> >>> I did wonder whether the queued data could actually be picked up
> >>> be a Heartbeat chunk that is probing a different remote address
> >>> (which would be bad news).
> >>> 
> >>> 	David
> >>> 
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> 
> >> 
> > 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-29  1:18           ` Marcelo Ricardo Leitner
@ 2016-01-29 10:57             ` Michael Tuexen
  -1 siblings, 0 replies; 49+ messages in thread
From: Michael Tuexen @ 2016-01-29 10:57 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp

> On 29 Jan 2016, at 02:18, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> 
> On Fri, Jan 29, 2016 at 12:36:05AM +0100, Michael Tuexen wrote:
>> 
>>> On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>>> 
>>> On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
>>>>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
>>>>> 
>>>>> From: Marcelo Ricardo Leitner
>>>>>> Sent: 27 January 2016 17:07
>>>>>> This patchset is merely a RFC for the moment. There are some
>>>>>> controversial points that I'd like to discuss before actually proposing
>>>>>> the patches.
>>>>> 
>>>>> You also need to look at how a 'user' can actually get SCTP to
>>>>> merge data chunks in the first place.
>>>>> 
>>>>> With Nagle disabled (and it probably has to be since the data flow
>>>>> is unlikely to be 'command-response' or 'unidirectional bulk')
>>>>> it is currently almost impossible to get more than one chunk
>>>>> into an ethernet frame.
>>>>> 
>>>>> Support for MSG_MORE would help.
>>>> What about adding support for the explicit EOR mode as specified in
>>>> https://tools.ietf.org/html/rfc6458#section-8.1.26
>>> 
>>> Seizing the moment to clarify my understanding on that. :)
>>> Such multiple calls to send system calls will result in a single data
>>> chunk. Is that so? That's what I get from that text and also from this
>> No. It results in a single user message. This means you can send
>> a user message larger than the send buffer size. How the user message
>> is fragmented in DATA chunks is transparent to the upper layer.
>> 
>> Does this make things clearer?
> 
> I think so, yes. So it allows delaying setting the Ending fragment bit
> until the application set SCTP_EOR. All the rest before this stays as
> before: first send() will generate a chunk with Beginning bit set and
> may generate some other middle-fragments (no B nor E bit set) if
> necessary, second to N-1 call to send will generate only middle
> fragments, while the last send, with SCTP_EOF, will then set the Ending
> fragment in the last one. Right?
Yes. But there are no restrictions on the user data provided in send()
calls and DATA chunks. So you can
send(100000 byte, no SCTP_EOR)
resulting in one DATA chunk with the B bit, several with no B and no E bit.
send(100000 byte, no SCTP_EOR)
resulting in several chunks with no B and no E bit.
send(100000 byte, SCTP_EOR)
resulting in several chunks with no B and no E bit and one (the last) chunk
with the E bit.

On the other hand you can do
send(1 byte, no SCTP_EOR)
resulting in a single DATA chunk with the E bit set.
send(1 byte, no SCTP_EOR)
send(1 byte, no SCTP_EOR)
send(1 byte, no SCTP_EOR)
send(1 byte, no SCTP_EOR)
send(1 byte, no SCTP_EOR)
All resulting in a single DATA chunk with 5 bytes user data and no B or E bit.
(For example if Nagle is enabled and only after the last send call the SACK arrives).
send(1 byte, SCTP_EOR)
results in a single DATA chunk with the E bist set.

Best regards
Michael
> 
> Thanks,
> Marcelo
> 
>> 
>> Best regards
>> Michael
>>> snippet:
>>> "Sending a message using sendmsg() is atomic unless explicit end of
>>> record (EOR) marking is enabled on the socket specified by sd (see
>>> Section 8.1.26)."
>>> 
>>> Best regards,
>>> Marcelo
>>> 
>>>> Best regards
>>>> Michael
>>>>> 
>>>>> Given the current implementation you can get almost the required
>>>>> behaviour by turning nagle off and on repeatedly.
>>>>> 
>>>>> I did wonder whether the queued data could actually be picked up
>>>>> be a Heartbeat chunk that is probing a different remote address
>>>>> (which would be bad news).
>>>>> 
>>>>> 	David
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>> 
>>> 
>> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-29 10:57             ` Michael Tuexen
  0 siblings, 0 replies; 49+ messages in thread
From: Michael Tuexen @ 2016-01-29 10:57 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp

> On 29 Jan 2016, at 02:18, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> 
> On Fri, Jan 29, 2016 at 12:36:05AM +0100, Michael Tuexen wrote:
>> 
>>> On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>>> 
>>> On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
>>>>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
>>>>> 
>>>>> From: Marcelo Ricardo Leitner
>>>>>> Sent: 27 January 2016 17:07
>>>>>> This patchset is merely a RFC for the moment. There are some
>>>>>> controversial points that I'd like to discuss before actually proposing
>>>>>> the patches.
>>>>> 
>>>>> You also need to look at how a 'user' can actually get SCTP to
>>>>> merge data chunks in the first place.
>>>>> 
>>>>> With Nagle disabled (and it probably has to be since the data flow
>>>>> is unlikely to be 'command-response' or 'unidirectional bulk')
>>>>> it is currently almost impossible to get more than one chunk
>>>>> into an ethernet frame.
>>>>> 
>>>>> Support for MSG_MORE would help.
>>>> What about adding support for the explicit EOR mode as specified in
>>>> https://tools.ietf.org/html/rfc6458#section-8.1.26
>>> 
>>> Seizing the moment to clarify my understanding on that. :)
>>> Such multiple calls to send system calls will result in a single data
>>> chunk. Is that so? That's what I get from that text and also from this
>> No. It results in a single user message. This means you can send
>> a user message larger than the send buffer size. How the user message
>> is fragmented in DATA chunks is transparent to the upper layer.
>> 
>> Does this make things clearer?
> 
> I think so, yes. So it allows delaying setting the Ending fragment bit
> until the application set SCTP_EOR. All the rest before this stays as
> before: first send() will generate a chunk with Beginning bit set and
> may generate some other middle-fragments (no B nor E bit set) if
> necessary, second to N-1 call to send will generate only middle
> fragments, while the last send, with SCTP_EOF, will then set the Ending
> fragment in the last one. Right?
Yes. But there are no restrictions on the user data provided in send()
calls and DATA chunks. So you can
send(100000 byte, no SCTP_EOR)
resulting in one DATA chunk with the B bit, several with no B and no E bit.
send(100000 byte, no SCTP_EOR)
resulting in several chunks with no B and no E bit.
send(100000 byte, SCTP_EOR)
resulting in several chunks with no B and no E bit and one (the last) chunk
with the E bit.

On the other hand you can do
send(1 byte, no SCTP_EOR)
resulting in a single DATA chunk with the E bit set.
send(1 byte, no SCTP_EOR)
send(1 byte, no SCTP_EOR)
send(1 byte, no SCTP_EOR)
send(1 byte, no SCTP_EOR)
send(1 byte, no SCTP_EOR)
All resulting in a single DATA chunk with 5 bytes user data and no B or E bit.
(For example if Nagle is enabled and only after the last send call the SACK arrives).
send(1 byte, SCTP_EOR)
results in a single DATA chunk with the E bist set.

Best regards
Michael
> 
> Thanks,
> Marcelo
> 
>> 
>> Best regards
>> Michael
>>> snippet:
>>> "Sending a message using sendmsg() is atomic unless explicit end of
>>> record (EOR) marking is enabled on the socket specified by sd (see
>>> Section 8.1.26)."
>>> 
>>> Best regards,
>>> Marcelo
>>> 
>>>> Best regards
>>>> Michael
>>>>> 
>>>>> Given the current implementation you can get almost the required
>>>>> behaviour by turning nagle off and on repeatedly.
>>>>> 
>>>>> I did wonder whether the queued data could actually be picked up
>>>>> be a Heartbeat chunk that is probing a different remote address
>>>>> (which would be bad news).
>>>>> 
>>>>> 	David
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>> 
>>> 
>> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-29 10:57             ` Michael Tuexen
@ 2016-01-29 11:26               ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-29 11:26 UTC (permalink / raw)
  To: Michael Tuexen
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp

On Fri, Jan 29, 2016 at 11:57:46AM +0100, Michael Tuexen wrote:
> > On 29 Jan 2016, at 02:18, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> > 
> > On Fri, Jan 29, 2016 at 12:36:05AM +0100, Michael Tuexen wrote:
> >> 
> >>> On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> >>> 
> >>> On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
> >>>>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
> >>>>> 
> >>>>> From: Marcelo Ricardo Leitner
> >>>>>> Sent: 27 January 2016 17:07
> >>>>>> This patchset is merely a RFC for the moment. There are some
> >>>>>> controversial points that I'd like to discuss before actually proposing
> >>>>>> the patches.
> >>>>> 
> >>>>> You also need to look at how a 'user' can actually get SCTP to
> >>>>> merge data chunks in the first place.
> >>>>> 
> >>>>> With Nagle disabled (and it probably has to be since the data flow
> >>>>> is unlikely to be 'command-response' or 'unidirectional bulk')
> >>>>> it is currently almost impossible to get more than one chunk
> >>>>> into an ethernet frame.
> >>>>> 
> >>>>> Support for MSG_MORE would help.
> >>>> What about adding support for the explicit EOR mode as specified in
> >>>> https://tools.ietf.org/html/rfc6458#section-8.1.26
> >>> 
> >>> Seizing the moment to clarify my understanding on that. :)
> >>> Such multiple calls to send system calls will result in a single data
> >>> chunk. Is that so? That's what I get from that text and also from this
> >> No. It results in a single user message. This means you can send
> >> a user message larger than the send buffer size. How the user message
> >> is fragmented in DATA chunks is transparent to the upper layer.
> >> 
> >> Does this make things clearer?
> > 
> > I think so, yes. So it allows delaying setting the Ending fragment bit
> > until the application set SCTP_EOR. All the rest before this stays as
> > before: first send() will generate a chunk with Beginning bit set and
> > may generate some other middle-fragments (no B nor E bit set) if
> > necessary, second to N-1 call to send will generate only middle
> > fragments, while the last send, with SCTP_EOF, will then set the Ending
> > fragment in the last one. Right?
> Yes. But there are no restrictions on the user data provided in send()
> calls and DATA chunks. So you can
> send(100000 byte, no SCTP_EOR)
> resulting in one DATA chunk with the B bit, several with no B and no E bit.
> send(100000 byte, no SCTP_EOR)
> resulting in several chunks with no B and no E bit.
> send(100000 byte, SCTP_EOR)
> resulting in several chunks with no B and no E bit and one (the last) chunk
> with the E bit.
> 
> On the other hand you can do
> send(1 byte, no SCTP_EOR)
> resulting in a single DATA chunk with the E bit set.
> send(1 byte, no SCTP_EOR)
> send(1 byte, no SCTP_EOR)
> send(1 byte, no SCTP_EOR)
> send(1 byte, no SCTP_EOR)
> send(1 byte, no SCTP_EOR)
> All resulting in a single DATA chunk with 5 bytes user data and no B or E bit.
> (For example if Nagle is enabled and only after the last send call the SACK arrives).
> send(1 byte, SCTP_EOR)
> results in a single DATA chunk with the E bist set.

Cool, thanks Michael. It will be quite fun to mix this with MSG_MORE
logic, I think :)

Best regards,
Marcelo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-29 11:26               ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-29 11:26 UTC (permalink / raw)
  To: Michael Tuexen
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp

On Fri, Jan 29, 2016 at 11:57:46AM +0100, Michael Tuexen wrote:
> > On 29 Jan 2016, at 02:18, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> > 
> > On Fri, Jan 29, 2016 at 12:36:05AM +0100, Michael Tuexen wrote:
> >> 
> >>> On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> >>> 
> >>> On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
> >>>>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
> >>>>> 
> >>>>> From: Marcelo Ricardo Leitner
> >>>>>> Sent: 27 January 2016 17:07
> >>>>>> This patchset is merely a RFC for the moment. There are some
> >>>>>> controversial points that I'd like to discuss before actually proposing
> >>>>>> the patches.
> >>>>> 
> >>>>> You also need to look at how a 'user' can actually get SCTP to
> >>>>> merge data chunks in the first place.
> >>>>> 
> >>>>> With Nagle disabled (and it probably has to be since the data flow
> >>>>> is unlikely to be 'command-response' or 'unidirectional bulk')
> >>>>> it is currently almost impossible to get more than one chunk
> >>>>> into an ethernet frame.
> >>>>> 
> >>>>> Support for MSG_MORE would help.
> >>>> What about adding support for the explicit EOR mode as specified in
> >>>> https://tools.ietf.org/html/rfc6458#section-8.1.26
> >>> 
> >>> Seizing the moment to clarify my understanding on that. :)
> >>> Such multiple calls to send system calls will result in a single data
> >>> chunk. Is that so? That's what I get from that text and also from this
> >> No. It results in a single user message. This means you can send
> >> a user message larger than the send buffer size. How the user message
> >> is fragmented in DATA chunks is transparent to the upper layer.
> >> 
> >> Does this make things clearer?
> > 
> > I think so, yes. So it allows delaying setting the Ending fragment bit
> > until the application set SCTP_EOR. All the rest before this stays as
> > before: first send() will generate a chunk with Beginning bit set and
> > may generate some other middle-fragments (no B nor E bit set) if
> > necessary, second to N-1 call to send will generate only middle
> > fragments, while the last send, with SCTP_EOF, will then set the Ending
> > fragment in the last one. Right?
> Yes. But there are no restrictions on the user data provided in send()
> calls and DATA chunks. So you can
> send(100000 byte, no SCTP_EOR)
> resulting in one DATA chunk with the B bit, several with no B and no E bit.
> send(100000 byte, no SCTP_EOR)
> resulting in several chunks with no B and no E bit.
> send(100000 byte, SCTP_EOR)
> resulting in several chunks with no B and no E bit and one (the last) chunk
> with the E bit.
> 
> On the other hand you can do
> send(1 byte, no SCTP_EOR)
> resulting in a single DATA chunk with the E bit set.
> send(1 byte, no SCTP_EOR)
> send(1 byte, no SCTP_EOR)
> send(1 byte, no SCTP_EOR)
> send(1 byte, no SCTP_EOR)
> send(1 byte, no SCTP_EOR)
> All resulting in a single DATA chunk with 5 bytes user data and no B or E bit.
> (For example if Nagle is enabled and only after the last send call the SACK arrives).
> send(1 byte, SCTP_EOR)
> results in a single DATA chunk with the E bist set.

Cool, thanks Michael. It will be quite fun to mix this with MSG_MORE
logic, I think :)

Best regards,
Marcelo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-29 11:26               ` Marcelo Ricardo Leitner
@ 2016-01-29 12:25                 ` Michael Tuexen
  -1 siblings, 0 replies; 49+ messages in thread
From: Michael Tuexen @ 2016-01-29 12:25 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp


> On 29 Jan 2016, at 12:26, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> 
> On Fri, Jan 29, 2016 at 11:57:46AM +0100, Michael Tuexen wrote:
>>> On 29 Jan 2016, at 02:18, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>>> 
>>> On Fri, Jan 29, 2016 at 12:36:05AM +0100, Michael Tuexen wrote:
>>>> 
>>>>> On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>>>>> 
>>>>> On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
>>>>>>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
>>>>>>> 
>>>>>>> From: Marcelo Ricardo Leitner
>>>>>>>> Sent: 27 January 2016 17:07
>>>>>>>> This patchset is merely a RFC for the moment. There are some
>>>>>>>> controversial points that I'd like to discuss before actually proposing
>>>>>>>> the patches.
>>>>>>> 
>>>>>>> You also need to look at how a 'user' can actually get SCTP to
>>>>>>> merge data chunks in the first place.
>>>>>>> 
>>>>>>> With Nagle disabled (and it probably has to be since the data flow
>>>>>>> is unlikely to be 'command-response' or 'unidirectional bulk')
>>>>>>> it is currently almost impossible to get more than one chunk
>>>>>>> into an ethernet frame.
>>>>>>> 
>>>>>>> Support for MSG_MORE would help.
>>>>>> What about adding support for the explicit EOR mode as specified in
>>>>>> https://tools.ietf.org/html/rfc6458#section-8.1.26
>>>>> 
>>>>> Seizing the moment to clarify my understanding on that. :)
>>>>> Such multiple calls to send system calls will result in a single data
>>>>> chunk. Is that so? That's what I get from that text and also from this
>>>> No. It results in a single user message. This means you can send
>>>> a user message larger than the send buffer size. How the user message
>>>> is fragmented in DATA chunks is transparent to the upper layer.
>>>> 
>>>> Does this make things clearer?
>>> 
>>> I think so, yes. So it allows delaying setting the Ending fragment bit
>>> until the application set SCTP_EOR. All the rest before this stays as
>>> before: first send() will generate a chunk with Beginning bit set and
>>> may generate some other middle-fragments (no B nor E bit set) if
>>> necessary, second to N-1 call to send will generate only middle
>>> fragments, while the last send, with SCTP_EOF, will then set the Ending
>>> fragment in the last one. Right?
>> Yes. But there are no restrictions on the user data provided in send()
>> calls and DATA chunks. So you can
>> send(100000 byte, no SCTP_EOR)
>> resulting in one DATA chunk with the B bit, several with no B and no E bit.
>> send(100000 byte, no SCTP_EOR)
>> resulting in several chunks with no B and no E bit.
>> send(100000 byte, SCTP_EOR)
>> resulting in several chunks with no B and no E bit and one (the last) chunk
>> with the E bit.
>> 
>> On the other hand you can do
>> send(1 byte, no SCTP_EOR)
>> resulting in a single DATA chunk with the E bit set.
>> send(1 byte, no SCTP_EOR)
>> send(1 byte, no SCTP_EOR)
>> send(1 byte, no SCTP_EOR)
>> send(1 byte, no SCTP_EOR)
>> send(1 byte, no SCTP_EOR)
>> All resulting in a single DATA chunk with 5 bytes user data and no B or E bit.
>> (For example if Nagle is enabled and only after the last send call the SACK arrives).
>> send(1 byte, SCTP_EOR)
>> results in a single DATA chunk with the E bist set.
> 
> Cool, thanks Michael. It will be quite fun to mix this with MSG_MORE
> logic, I think :)
Don't know. In FreeBSD we do support SCTP_EOR, but not MSG_MORE, which seems
to be Linux specific.

Best regards
Michael
> 
> Best regards,
> Marcelo
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-29 12:25                 ` Michael Tuexen
  0 siblings, 0 replies; 49+ messages in thread
From: Michael Tuexen @ 2016-01-29 12:25 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: David Laight, netdev, Neil Horman, Vlad Yasevich, David Miller,
	brouer, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, linux-sctp


> On 29 Jan 2016, at 12:26, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> 
> On Fri, Jan 29, 2016 at 11:57:46AM +0100, Michael Tuexen wrote:
>>> On 29 Jan 2016, at 02:18, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>>> 
>>> On Fri, Jan 29, 2016 at 12:36:05AM +0100, Michael Tuexen wrote:
>>>> 
>>>>> On 28 Jan 2016, at 22:03, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>>>>> 
>>>>> On Thu, Jan 28, 2016 at 06:54:06PM +0100, Michael Tuexen wrote:
>>>>>>> On 28 Jan 2016, at 14:51, David Laight <David.Laight@ACULAB.COM> wrote:
>>>>>>> 
>>>>>>> From: Marcelo Ricardo Leitner
>>>>>>>> Sent: 27 January 2016 17:07
>>>>>>>> This patchset is merely a RFC for the moment. There are some
>>>>>>>> controversial points that I'd like to discuss before actually proposing
>>>>>>>> the patches.
>>>>>>> 
>>>>>>> You also need to look at how a 'user' can actually get SCTP to
>>>>>>> merge data chunks in the first place.
>>>>>>> 
>>>>>>> With Nagle disabled (and it probably has to be since the data flow
>>>>>>> is unlikely to be 'command-response' or 'unidirectional bulk')
>>>>>>> it is currently almost impossible to get more than one chunk
>>>>>>> into an ethernet frame.
>>>>>>> 
>>>>>>> Support for MSG_MORE would help.
>>>>>> What about adding support for the explicit EOR mode as specified in
>>>>>> https://tools.ietf.org/html/rfc6458#section-8.1.26
>>>>> 
>>>>> Seizing the moment to clarify my understanding on that. :)
>>>>> Such multiple calls to send system calls will result in a single data
>>>>> chunk. Is that so? That's what I get from that text and also from this
>>>> No. It results in a single user message. This means you can send
>>>> a user message larger than the send buffer size. How the user message
>>>> is fragmented in DATA chunks is transparent to the upper layer.
>>>> 
>>>> Does this make things clearer?
>>> 
>>> I think so, yes. So it allows delaying setting the Ending fragment bit
>>> until the application set SCTP_EOR. All the rest before this stays as
>>> before: first send() will generate a chunk with Beginning bit set and
>>> may generate some other middle-fragments (no B nor E bit set) if
>>> necessary, second to N-1 call to send will generate only middle
>>> fragments, while the last send, with SCTP_EOF, will then set the Ending
>>> fragment in the last one. Right?
>> Yes. But there are no restrictions on the user data provided in send()
>> calls and DATA chunks. So you can
>> send(100000 byte, no SCTP_EOR)
>> resulting in one DATA chunk with the B bit, several with no B and no E bit.
>> send(100000 byte, no SCTP_EOR)
>> resulting in several chunks with no B and no E bit.
>> send(100000 byte, SCTP_EOR)
>> resulting in several chunks with no B and no E bit and one (the last) chunk
>> with the E bit.
>> 
>> On the other hand you can do
>> send(1 byte, no SCTP_EOR)
>> resulting in a single DATA chunk with the E bit set.
>> send(1 byte, no SCTP_EOR)
>> send(1 byte, no SCTP_EOR)
>> send(1 byte, no SCTP_EOR)
>> send(1 byte, no SCTP_EOR)
>> send(1 byte, no SCTP_EOR)
>> All resulting in a single DATA chunk with 5 bytes user data and no B or E bit.
>> (For example if Nagle is enabled and only after the last send call the SACK arrives).
>> send(1 byte, SCTP_EOR)
>> results in a single DATA chunk with the E bist set.
> 
> Cool, thanks Michael. It will be quite fun to mix this with MSG_MORE
> logic, I think :)
Don't know. In FreeBSD we do support SCTP_EOR, but not MSG_MORE, which seems
to be Linux specific.

Best regards
Michael
> 
> Best regards,
> Marcelo
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 20:55         ` 'Marcelo Ricardo Leitner'
  (?)
@ 2016-01-29 15:51         ` David Laight
  2016-01-29 18:53             ` 'Marcelo Ricardo Leitner'
  -1 siblings, 1 reply; 49+ messages in thread
From: David Laight @ 2016-01-29 15:51 UTC (permalink / raw)
  To: 'Marcelo Ricardo Leitner'
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

From: 'Marcelo Ricardo Leitner'
> Sent: 28 January 2016 20:56
> On Thu, Jan 28, 2016 at 05:30:24PM +0000, David Laight wrote:
> > From: 'Marcelo Ricardo Leitner'
> > > Sent: 28 January 2016 15:53
> > > On Thu, Jan 28, 2016 at 01:51:02PM +0000, David Laight wrote:
...
> > > > With Nagle disabled (and it probably has to be since the data flow
> > > > is unlikely to be 'command-response' or 'unidirectional bulk')
> > > > it is currently almost impossible to get more than one chunk
> > > > into an ethernet frame.
> > > >
> > > > Support for MSG_MORE would help.
> > > >
> > > > Given the current implementation you can get almost the required
> > > > behaviour by turning nagle off and on repeatedly.
> > >
> > > That's pretty much expected, I think. Without Nagle, if bandwidth and
> > > cwnd allow, segment will be sent. GSO by itself shouldn't cause a
> > > buffering to protect from that.
> > >
> > > If something causes a bottleneck, tx may get queue up.  Like if I do a
> > > stress test in my system, generally receiver side is slower than sender,
> > > so I end up having tx buffers pretty easily. It mimics bandwidth
> > > restrictions.
> >
> > Imagine a using M2UA to connect local machines (one running mtp3, the other mtp2).
> > Configure two linksets of 16 signalling links and perform a double-reflect
> > loopback test.
> > The SCTP connection won't ever saturate, so every msu ends up in its own
> > ethernet packet.
> > It is easy to generate 1000's of ethernet frames/sec on a single connection.
> >
> > (We do this with something not entirely quite like M2UA over TCP,
> > even then it is very hard to get multiple message into a single
> > ethernet frame.)
> >
> 
> Agreed, GSO won't help much in there without a corking feature like
> MSG_MORE or that on/off switch on Nagle you mentioned.
> 
> Thing about this (and also GRO) is on identifying how much can be spent
> on waiting for the next chunk/packet without causing issues to the
> application. Nagle is there and helps quite a lot, but timing-sensitive
> applications will turn it off.

Nagle is only any good for unidirectional data (think file transfer)
and command-response (think telnet or rlogin), for anything else
it generates 200ms+ delays.
The SIGTRAN protocols (M3UA etc) are all specified to use SCTP
and have no real relationship between the send and receive data.
SCTP is being used to replace 64kb/s links so a few milliseconds
of delay (1ms is 8 byte times) probably don't matter.
The Nagle timeout is, however far too long - so has to be disabled.

So never mind GSO to improve 'bulk' output, the big SCTP performance
issue (as I see it) is the inability for a lot of workloads to
ever get more than one small data chunk into an ethernet frame.

Which other (application) protocols are using SCTP?
It doesn't seem an appropriate protocol for bulk data.

	David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 20:55         ` 'Marcelo Ricardo Leitner'
  (?)
  (?)
@ 2016-01-29 15:57         ` David Laight
  -1 siblings, 0 replies; 49+ messages in thread
From: David Laight @ 2016-01-29 15:57 UTC (permalink / raw)
  To: 'Marcelo Ricardo Leitner'
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

From: 'Marcelo Ricardo Leitner'
> Sent: 28 January 2016 20:56
...
> > > > I did wonder whether the queued data could actually be picked up
> > > > be a Heartbeat chunk that is probing a different remote address
> > > > (which would be bad news).
> > >
> > > I don't follow. You mean if a heartbeat may get stuck in queue or if
> > > sending of a heartbeat can end up carrying additional data by accident?
> >
> > My suspicion was that the heartbeat would carry the queued data.
> 
> I'm afraid I'm  still not following, sorry. You mean that this GSO patch
> would cause the heartbeat to carry queued data? If yes, no, because for
> SCTP side of it it mangles the packet size and make it look bigger
> instead of handling multiple packets. It will then break this large
> sctp_packet into several sk_buff and glue them together as if they were
> GROed, allowing skb_segment to just split them back. The reason the
> sctp_packet is generated, being it due to user data or control chunks
> like heartbeats, is not modified.

I'm thinking of the code prior to your GSO changes.

IIRC with nagle enabled data chunks are built and put onto an
internal queue to be sent later (with nagle disabled they are
sent at the end of the tx processing provided (IIRC) there is window).
Any message that forces a transmit picks up the queued data - and
I think this might include heartbeats....

I didn't read the code closely enough to find out whether this
was true or not.

	David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-28 20:55         ` 'Marcelo Ricardo Leitner'
                           ` (2 preceding siblings ...)
  (?)
@ 2016-01-29 16:07         ` David Laight
  -1 siblings, 0 replies; 49+ messages in thread
From: David Laight @ 2016-01-29 16:07 UTC (permalink / raw)
  To: 'Marcelo Ricardo Leitner'
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

> From: 'Marcelo Ricardo Leitner' [mailto:marcelo.leitner@gmail.com]
> Sent: 28 January 2016 20:56
...
> > > But yes, agreed, MSG_MORE is at least a welcomed compliment here,
> > > specially for applications generating a train of chunks. Will put that in
> > > my ToDo here, thanks.
> >
> > I've posted a patch in the past for MSG_MORE, didn't quite work.
> 
> Ahh cool. Can you share the archive link please? Maybe I can take it
> from there then.

I think the last record is:
https://patchwork.ozlabs.org/patch/372404

	David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
  2016-01-29 15:51         ` David Laight
@ 2016-01-29 18:53             ` 'Marcelo Ricardo Leitner'
  0 siblings, 0 replies; 49+ messages in thread
From: 'Marcelo Ricardo Leitner' @ 2016-01-29 18:53 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

On Fri, Jan 29, 2016 at 03:51:52PM +0000, David Laight wrote:
> From: 'Marcelo Ricardo Leitner'
> > Sent: 28 January 2016 20:56
> > On Thu, Jan 28, 2016 at 05:30:24PM +0000, David Laight wrote:
> > > From: 'Marcelo Ricardo Leitner'
> > > > Sent: 28 January 2016 15:53
> > > > On Thu, Jan 28, 2016 at 01:51:02PM +0000, David Laight wrote:
> ...
> > > > > With Nagle disabled (and it probably has to be since the data flow
> > > > > is unlikely to be 'command-response' or 'unidirectional bulk')
> > > > > it is currently almost impossible to get more than one chunk
> > > > > into an ethernet frame.
> > > > >
> > > > > Support for MSG_MORE would help.
> > > > >
> > > > > Given the current implementation you can get almost the required
> > > > > behaviour by turning nagle off and on repeatedly.
> > > >
> > > > That's pretty much expected, I think. Without Nagle, if bandwidth and
> > > > cwnd allow, segment will be sent. GSO by itself shouldn't cause a
> > > > buffering to protect from that.
> > > >
> > > > If something causes a bottleneck, tx may get queue up.  Like if I do a
> > > > stress test in my system, generally receiver side is slower than sender,
> > > > so I end up having tx buffers pretty easily. It mimics bandwidth
> > > > restrictions.
> > >
> > > Imagine a using M2UA to connect local machines (one running mtp3, the other mtp2).
> > > Configure two linksets of 16 signalling links and perform a double-reflect
> > > loopback test.
> > > The SCTP connection won't ever saturate, so every msu ends up in its own
> > > ethernet packet.
> > > It is easy to generate 1000's of ethernet frames/sec on a single connection.
> > >
> > > (We do this with something not entirely quite like M2UA over TCP,
> > > even then it is very hard to get multiple message into a single
> > > ethernet frame.)
> > >
> > 
> > Agreed, GSO won't help much in there without a corking feature like
> > MSG_MORE or that on/off switch on Nagle you mentioned.
> > 
> > Thing about this (and also GRO) is on identifying how much can be spent
> > on waiting for the next chunk/packet without causing issues to the
> > application. Nagle is there and helps quite a lot, but timing-sensitive
> > applications will turn it off.
> 
> Nagle is only any good for unidirectional data (think file transfer)
> and command-response (think telnet or rlogin), for anything else
> it generates 200ms+ delays.
> The SIGTRAN protocols (M3UA etc) are all specified to use SCTP
> and have no real relationship between the send and receive data.
> SCTP is being used to replace 64kb/s links so a few milliseconds
> of delay (1ms is 8 byte times) probably don't matter.
> The Nagle timeout is, however far too long - so has to be disabled.
> 
> So never mind GSO to improve 'bulk' output, the big SCTP performance
> issue (as I see it) is the inability for a lot of workloads to
> ever get more than one small data chunk into an ethernet frame.
> 
> Which other (application) protocols are using SCTP?
> It doesn't seem an appropriate protocol for bulk data.

David I agree with most of what you said here and before, sorry if I
didn't make it clear before. I put MSG_MORE on my ToDo and I'll check
what I can do about it. I got your link from the other email, thanks.

Just the usage you mention on SCTP, is limited to SIGTRAN work. DLM,
distributed lock manager, filesystem stuff, uses SCTP just because it
has multi-homing out of the box and it improves the cluster resilience.
I'd like SCTP to be more adopted in projects like DLM did. It has plenty
of features that are very often re-impemented by applications, like
message boundaries and ordering. 

GSO should not, in any way, objects to MSG_MORE. I don't see we having
conflicting goals here. Do you?

  Marcelo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 0/3] sctp: add GSO support
@ 2016-01-29 18:53             ` 'Marcelo Ricardo Leitner'
  0 siblings, 0 replies; 49+ messages in thread
From: 'Marcelo Ricardo Leitner' @ 2016-01-29 18:53 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, Neil Horman, Vlad Yasevich, David Miller, brouer,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, linux-sctp

On Fri, Jan 29, 2016 at 03:51:52PM +0000, David Laight wrote:
> From: 'Marcelo Ricardo Leitner'
> > Sent: 28 January 2016 20:56
> > On Thu, Jan 28, 2016 at 05:30:24PM +0000, David Laight wrote:
> > > From: 'Marcelo Ricardo Leitner'
> > > > Sent: 28 January 2016 15:53
> > > > On Thu, Jan 28, 2016 at 01:51:02PM +0000, David Laight wrote:
> ...
> > > > > With Nagle disabled (and it probably has to be since the data flow
> > > > > is unlikely to be 'command-response' or 'unidirectional bulk')
> > > > > it is currently almost impossible to get more than one chunk
> > > > > into an ethernet frame.
> > > > >
> > > > > Support for MSG_MORE would help.
> > > > >
> > > > > Given the current implementation you can get almost the required
> > > > > behaviour by turning nagle off and on repeatedly.
> > > >
> > > > That's pretty much expected, I think. Without Nagle, if bandwidth and
> > > > cwnd allow, segment will be sent. GSO by itself shouldn't cause a
> > > > buffering to protect from that.
> > > >
> > > > If something causes a bottleneck, tx may get queue up.  Like if I do a
> > > > stress test in my system, generally receiver side is slower than sender,
> > > > so I end up having tx buffers pretty easily. It mimics bandwidth
> > > > restrictions.
> > >
> > > Imagine a using M2UA to connect local machines (one running mtp3, the other mtp2).
> > > Configure two linksets of 16 signalling links and perform a double-reflect
> > > loopback test.
> > > The SCTP connection won't ever saturate, so every msu ends up in its own
> > > ethernet packet.
> > > It is easy to generate 1000's of ethernet frames/sec on a single connection.
> > >
> > > (We do this with something not entirely quite like M2UA over TCP,
> > > even then it is very hard to get multiple message into a single
> > > ethernet frame.)
> > >
> > 
> > Agreed, GSO won't help much in there without a corking feature like
> > MSG_MORE or that on/off switch on Nagle you mentioned.
> > 
> > Thing about this (and also GRO) is on identifying how much can be spent
> > on waiting for the next chunk/packet without causing issues to the
> > application. Nagle is there and helps quite a lot, but timing-sensitive
> > applications will turn it off.
> 
> Nagle is only any good for unidirectional data (think file transfer)
> and command-response (think telnet or rlogin), for anything else
> it generates 200ms+ delays.
> The SIGTRAN protocols (M3UA etc) are all specified to use SCTP
> and have no real relationship between the send and receive data.
> SCTP is being used to replace 64kb/s links so a few milliseconds
> of delay (1ms is 8 byte times) probably don't matter.
> The Nagle timeout is, however far too long - so has to be disabled.
> 
> So never mind GSO to improve 'bulk' output, the big SCTP performance
> issue (as I see it) is the inability for a lot of workloads to
> ever get more than one small data chunk into an ethernet frame.
> 
> Which other (application) protocols are using SCTP?
> It doesn't seem an appropriate protocol for bulk data.

David I agree with most of what you said here and before, sorry if I
didn't make it clear before. I put MSG_MORE on my ToDo and I'll check
what I can do about it. I got your link from the other email, thanks.

Just the usage you mention on SCTP, is limited to SIGTRAN work. DLM,
distributed lock manager, filesystem stuff, uses SCTP just because it
has multi-homing out of the box and it improves the cluster resilience.
I'd like SCTP to be more adopted in projects like DLM did. It has plenty
of features that are very often re-impemented by applications, like
message boundaries and ordering. 

GSO should not, in any way, objects to MSG_MORE. I don't see we having
conflicting goals here. Do you?

  Marcelo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
  2016-01-27 17:06   ` Marcelo Ricardo Leitner
@ 2016-01-29 19:15     ` Alexander Duyck
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexander Duyck @ 2016-01-29 19:15 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	marek, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> This patch enables SCTP to do GSO.
>
> SCTP has this pecualiarty that its packets cannot be just segmented to
> (P)MTU. Its chunks must be contained in IP segments, padding respected.
> So we can't just generate a big skb, set gso_size to the fragmentation
> point and deliver it to IP layer.
>
> Instead, this patch proposes that SCTP build a skb as it would be if it
> was received using GRO. That is, there will be a cover skb with the
> headers (incluing SCTP one) and children ones containing the actual SCTP
> chunks, already segmented in a way that respects SCTP RFCs and MTU.
>
> This way SCTP can benefit from GSO and instead of passing several
> packets through the stack, it can pass a single large packet if there
> are enough data queued and cwnd allows.
>
> Main points that need help:
> - Usage of skb_gro_receive()
>   It fits nicely in there and properly handles offsets/lens, though the
>   name means another thing. If you agree with this usage, we can rename
>   it to something like skb_coalesce
>
> - Checksum handling
>   Why only packets with checksum offloaded can be GSOed? Most of the
>   NICs doesn't support SCTP CRC offloading and this will nearly defeat
>   this feature. If checksum is being computed in sw, it doesn't really
>   matter if it's earlier or later, right?
>   This patch hacks skb_needs_check() to allow using GSO with sw-computed
>   checksums.
>   Also the meaning of UNNECESSARY and NONE are quite foggy to me yet and
>   its usage may be wrong.
>
> - gso_size = 1
>   There is skb_is_gso() all over the stack and it basically checks for
>   non-zero skb_shinfo(skb)->gso_size. Setting it to 1 is the hacky way I
>   found to keep skb_is_gso() working while being able to signal to
>   skb_segment() that it shouldn't use gso_size but instead the fragment
>   sizes themselves. skb_segment() will mainly just unpack the skb then.

Instead of 1 why not use 0xFFFF?  It is a value that can never be used
for a legitimate segment size since IP total length is a 16 bit value
and includes the IP header in the size.

> - socket / gso max values
>   usage of sk_setup_caps() still needs a review
>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  include/linux/netdev_features.h |   7 +-
>  include/linux/netdevice.h       |   1 +
>  net/core/dev.c                  |   6 +-
>  net/core/skbuff.c               |  12 +-
>  net/ipv4/af_inet.c              |   1 +
>  net/sctp/offload.c              |  53 +++++++
>  net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
>  net/sctp/socket.c               |   2 +
>  8 files changed, 292 insertions(+), 128 deletions(-)
>
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index d9654f0eecb3519383441afa6b131ff9a5898485..f678998841f1800e0f2fe416a79935197d4ed305 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -48,8 +48,9 @@ enum {
>         NETIF_F_GSO_UDP_TUNNEL_BIT,     /* ... UDP TUNNEL with TSO */
>         NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
>         NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
> +       NETIF_F_GSO_SCTP_BIT,           /* ... SCTP fragmentation */
>         /**/NETIF_F_GSO_LAST =          /* last bit, see GSO_MASK */
> -               NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
> +               NETIF_F_GSO_SCTP_BIT,
>
>         NETIF_F_FCOE_CRC_BIT,           /* FCoE CRC32 */
>         NETIF_F_SCTP_CRC_BIT,           /* SCTP checksum offload */
> @@ -119,6 +120,7 @@ enum {
>  #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL)
>  #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
>  #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
> +#define NETIF_F_GSO_SCTP       __NETIF_F(GSO_SCTP)
>  #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
>  #define NETIF_F_HW_VLAN_STAG_RX        __NETIF_F(HW_VLAN_STAG_RX)
>  #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
> @@ -144,7 +146,8 @@ enum {
>
>  /* List of features with software fallbacks. */
>  #define NETIF_F_GSO_SOFTWARE   (NETIF_F_TSO | NETIF_F_TSO_ECN | \
> -                                NETIF_F_TSO6 | NETIF_F_UFO)
> +                                NETIF_F_TSO6 | NETIF_F_UFO | \
> +                                NETIF_F_GSO_SCTP)
>
>  /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
>   * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 289c2314d76668b8357728382bb33d6828617458..ce14fab858bf96dd0f85aca237350c8d8317756e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3928,6 +3928,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
>         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
>         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
>         BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
> +       BUILD_BUG_ON(SKB_GSO_SCTP    != (NETIF_F_GSO_SCTP >> NETIF_F_GSO_SHIFT));
>
>         return (features & feature) == feature;
>  }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
>  static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
>  {
>         if (tx_path)
> -               return skb->ip_summed != CHECKSUM_PARTIAL;
> +               /* FIXME: Why only packets with checksum offloading are
> +                * supported for GSO?
> +                */
> +               return skb->ip_summed != CHECKSUM_PARTIAL &&
> +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
>         else
>                 return skb->ip_summed == CHECKSUM_NONE;
>  }

Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
transmit path a little while ago.  Please don't reintroduce it.

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 704b69682085dec77f3d0f990aaf0024afd705b9..96f223f8d769d2765fd64348830c76cb222906c8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3017,8 +3017,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>                 int size;
>
>                 len = head_skb->len - offset;
> -               if (len > mss)
> -                       len = mss;
> +               if (len > mss) {
> +                       /* FIXME: A define is surely welcomed, but maybe
> +                        * shinfo->txflags is better for this flag, but
> +                        * we need to expand it then
> +                        */
> +                       if (mss == 1)
> +                               len = list_skb->len;
> +                       else
> +                               len = mss;
> +               }
>

Using 0xFFFF here as a flag with the MSS value would likely be much
more readable.

>                 hsize = skb_headlen(head_skb) - offset;
>                 if (hsize < 0)
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 5c5db6636704daa0c49fc13e84b2c5b282a44ed3..ec1c779bb664d1399d74f2bd7016e30b648ce47d 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1220,6 +1220,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>                        SKB_GSO_UDP_TUNNEL |
>                        SKB_GSO_UDP_TUNNEL_CSUM |
>                        SKB_GSO_TUNNEL_REMCSUM |
> +                      SKB_GSO_SCTP |
>                        0)))
>                 goto out;
>
> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
> index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
> --- a/net/sctp/offload.c
> +++ b/net/sctp/offload.c
> @@ -36,8 +36,61 @@
>  #include <net/sctp/checksum.h>
>  #include <net/protocol.h>
>
> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
> +{
> +       skb->ip_summed = CHECKSUM_NONE;
> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
> +}
> +

I really despise the naming of this bit here.  SCTP does not use a
checksum.  It uses a CRC.  Please don't call this a checksum as it
will just make the code really confusing.   I think the name should be
something like gso_make_crc32c.

I think we need to address the CRC issues before we can really get
into segmentation.  Specifically we need to be able to offload SCTP
and FCoE in software since they both use the CHECKSUM_PARTIAL value
and then we can start cleaning up more of this mess and move onto
segmentation.

> +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
> +                                       netdev_features_t features)
> +{
> +       struct sk_buff *segs = ERR_PTR(-EINVAL);
> +       struct sctphdr *sh;
> +
> +       sh = sctp_hdr(skb);
> +       if (!pskb_may_pull(skb, sizeof(*sh)))
> +               goto out;
> +
> +       __skb_pull(skb, sizeof(*sh));
> +
> +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
> +               /* Packet is from an untrusted source, reset gso_segs. */
> +               int type = skb_shinfo(skb)->gso_type;
> +
> +               if (unlikely(type &
> +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
> +                              0) ||
> +                            !(type & (SKB_GSO_SCTP))))
> +                       goto out;
> +
> +               /* This should not happen as no NIC has SCTP GSO
> +                * offloading, it's always via software and thus we
> +                * won't send a large packet down the stack.
> +                */
> +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
> +               goto out;
> +       }
> +

So what you are going to end up needing here is some way to tell the
hardware that you are doing the checksum no matter what.  There is no
value in you computing a 1's compliment checksum for the payload if
you aren't going to use it.  What you can probably do is just clear
the standard checksum flags and then OR in NETIF_F_HW_CSUM if
NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
offloading the checksum.

One other bit that will make this more complicated is if we ever get
around to supporting SCTP in tunnels.  Then we will need to sort out
how things like remote checksum offload should impact SCTP, and how to
deal with needing to compute both a CRC and 1's compliment checksum.
What we would probably need to do is check for encap_hdr_csum and if
it is set and we are doing SCTP then we would need to clear the
NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.

> +       segs = skb_segment(skb, features);
> +       if (IS_ERR(segs))
> +               goto out;
> +
> +       /* All that is left is update SCTP CRC if necessary */
> +       for (skb = segs; skb; skb = skb->next) {
> +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
> +                       sh = sctp_hdr(skb);
> +                       sh->checksum = sctp_gso_make_checksum(skb);
> +               }
> +       }
> +

Okay, so it looks like you are doing the right thing here and leaving
this as CHECKSUM_PARTIAL.

> +out:
> +       return segs;
> +}
> +
>  static const struct net_offload sctp_offload = {
>         .callbacks = {
> +               .gso_segment = sctp_gso_segment,
>         },
>  };
>
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index 9d610eddd19ef2320fc34ae9d91e7426ae5f50f9..5e619b1b7b47737447bce746b2420bac3427fde4 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -381,12 +381,14 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>         struct sctp_transport *tp = packet->transport;
>         struct sctp_association *asoc = tp->asoc;
>         struct sctphdr *sh;
> -       struct sk_buff *nskb;
> +       struct sk_buff *nskb = NULL, *head = NULL;
>         struct sctp_chunk *chunk, *tmp;
> -       struct sock *sk;
> +       struct sock *sk = asoc->base.sk;
>         int err = 0;
>         int padding;            /* How much padding do we need?  */
> +       int pkt_size;
>         __u8 has_data = 0;
> +       int gso = 0;
>         struct dst_entry *dst;
>         unsigned char *auth = NULL;     /* pointer to auth in skb data */
>
> @@ -396,37 +398,44 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>         if (list_empty(&packet->chunk_list))
>                 return err;
>
> -       /* Set up convenience variables... */
> +       /* TODO: double check this */
>         chunk = list_entry(packet->chunk_list.next, struct sctp_chunk, list);
>         sk = chunk->skb->sk;
> +       dst_hold(tp->dst);
> +       sk_setup_caps(sk, tp->dst);
> +
> +       if (packet->size > tp->pathmtu) {
> +               WARN_ON(packet->ipfragok);
> +               if (sk_can_gso(sk)) {
> +                       gso = 1;
> +                       pkt_size = packet->overhead;
> +               } else {
> +                       /* Something nasty happened */
> +                       /* FIXME */
> +                       printk("Damn, we can't GSO and packet is too big %d for pmtu %d.\n",
> +                              packet->size, tp->pathmtu);
> +                       goto nomem;
> +               }
> +       } else {
> +               pkt_size = packet->size;
> +       }
>
> -       /* Allocate the new skb.  */
> -       nskb = alloc_skb(packet->size + MAX_HEADER, GFP_ATOMIC);
> -       if (!nskb)
> +       /* Allocate the head skb, or main one if not in GSO */
> +       head = alloc_skb(pkt_size + MAX_HEADER, GFP_ATOMIC);
> +       if (!head)
>                 goto nomem;
> +       if (gso) {
> +               NAPI_GRO_CB(head)->last = head;
> +       } else {
> +               nskb = head;
> +       }
>
>         /* Make sure the outbound skb has enough header room reserved. */
> -       skb_reserve(nskb, packet->overhead + MAX_HEADER);
> -
> -       /* Set the owning socket so that we know where to get the
> -        * destination IP address.
> -        */
> -       sctp_packet_set_owner_w(nskb, sk);
> -
> -       if (!sctp_transport_dst_check(tp)) {
> -               sctp_transport_route(tp, NULL, sctp_sk(sk));
> -               if (asoc && (asoc->param_flags & SPP_PMTUD_ENABLE)) {
> -                       sctp_assoc_sync_pmtu(sk, asoc);
> -               }
> -       }
> -       dst = dst_clone(tp->dst);
> -       if (!dst)
> -               goto no_route;
> -       skb_dst_set(nskb, dst);
> +       skb_reserve(head, packet->overhead + MAX_HEADER);
>
>         /* Build the SCTP header.  */
> -       sh = (struct sctphdr *)skb_push(nskb, sizeof(struct sctphdr));
> -       skb_reset_transport_header(nskb);
> +       sh = (struct sctphdr *)skb_push(head, sizeof(struct sctphdr));
> +       skb_reset_transport_header(head);
>         sh->source = htons(packet->source_port);
>         sh->dest   = htons(packet->destination_port);
>
> @@ -441,90 +450,164 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>         sh->vtag     = htonl(packet->vtag);
>         sh->checksum = 0;
>
> -       /**
> -        * 6.10 Bundling
> -        *
> -        *    An endpoint bundles chunks by simply including multiple
> -        *    chunks in one outbound SCTP packet.  ...
> +       /* Set the owning socket so that we know where to get the
> +        * destination IP address.
>          */
> +       sctp_packet_set_owner_w(head, sk);
>
> -       /**
> -        * 3.2  Chunk Field Descriptions
> -        *
> -        * The total length of a chunk (including Type, Length and
> -        * Value fields) MUST be a multiple of 4 bytes.  If the length
> -        * of the chunk is not a multiple of 4 bytes, the sender MUST
> -        * pad the chunk with all zero bytes and this padding is not
> -        * included in the chunk length field.  The sender should
> -        * never pad with more than 3 bytes.
> -        *
> -        * [This whole comment explains WORD_ROUND() below.]
> -        */
> +       if (!sctp_transport_dst_check(tp)) {
> +               sctp_transport_route(tp, NULL, sctp_sk(sk));
> +               if (asoc && (asoc->param_flags & SPP_PMTUD_ENABLE)) {
> +                       sctp_assoc_sync_pmtu(sk, asoc);
> +               }
> +       }
> +       dst = dst_clone(tp->dst);
> +       if (!dst)
> +               goto no_route;
> +       skb_dst_set(head, dst);
>
>         pr_debug("***sctp_transmit_packet***\n");
>
> -       list_for_each_entry_safe(chunk, tmp, &packet->chunk_list, list) {
> -               list_del_init(&chunk->list);
> -               if (sctp_chunk_is_data(chunk)) {
> -                       /* 6.3.1 C4) When data is in flight and when allowed
> -                        * by rule C5, a new RTT measurement MUST be made each
> -                        * round trip.  Furthermore, new RTT measurements
> -                        * SHOULD be made no more than once per round-trip
> -                        * for a given destination transport address.
> -                        */
> -
> -                       if (!chunk->resent && !tp->rto_pending) {
> -                               chunk->rtt_in_progress = 1;
> -                               tp->rto_pending = 1;
> +       do {
> +               /* Set up convenience variables... */
> +               chunk = list_entry(packet->chunk_list.next, struct sctp_chunk, list);
> +               WARN_ON(sk != chunk->skb->sk); /* XXX */
> +
> +               /* Calculate packet size, so it fits in PMTU. Leave
> +                * other chunks for the next packets. */
> +               if (gso) {
> +                       pkt_size = packet->overhead;
> +                       list_for_each_entry(chunk, &packet->chunk_list, list) {
> +                               int padded = WORD_ROUND(chunk->skb->len);
> +                               if (pkt_size + padded > tp->pathmtu)
> +                                       break;
> +                               pkt_size += padded;
>                         }
>
> -                       has_data = 1;
> +                       /* Allocate the new skb.  */
> +                       nskb = alloc_skb(pkt_size + MAX_HEADER, GFP_ATOMIC);
> +
> +                       /* Make sure the outbound skb has enough header room reserved. */
> +                       if (nskb)
> +                               skb_reserve(nskb, packet->overhead + MAX_HEADER);
>                 }
> +               if (!nskb)
> +                       goto nomem;
> +
> +               /**
> +                * 3.2  Chunk Field Descriptions
> +                *
> +                * The total length of a chunk (including Type, Length and
> +                * Value fields) MUST be a multiple of 4 bytes.  If the length
> +                * of the chunk is not a multiple of 4 bytes, the sender MUST
> +                * pad the chunk with all zero bytes and this padding is not
> +                * included in the chunk length field.  The sender should
> +                * never pad with more than 3 bytes.
> +                *
> +                * [This whole comment explains WORD_ROUND() below.]
> +                */
> +
> +               pkt_size -= packet->overhead;
> +               list_for_each_entry_safe(chunk, tmp, &packet->chunk_list, list) {
> +                       list_del_init(&chunk->list);
> +                       if (sctp_chunk_is_data(chunk)) {
> +                               /* 6.3.1 C4) When data is in flight and when allowed
> +                                * by rule C5, a new RTT measurement MUST be made each
> +                                * round trip.  Furthermore, new RTT measurements
> +                                * SHOULD be made no more than once per round-trip
> +                                * for a given destination transport address.
> +                                */
> +
> +                               if (!chunk->resent && !tp->rto_pending) {
> +                                       chunk->rtt_in_progress = 1;
> +                                       tp->rto_pending = 1;
> +                               }
> +
> +                               has_data = 1;
> +                       }
> +
> +                       padding = WORD_ROUND(chunk->skb->len) - chunk->skb->len;
> +                       if (padding)
> +                               memset(skb_put(chunk->skb, padding), 0, padding);
>
> -               padding = WORD_ROUND(chunk->skb->len) - chunk->skb->len;
> -               if (padding)
> -                       memset(skb_put(chunk->skb, padding), 0, padding);
> +                       /* if this is the auth chunk that we are adding,
> +                        * store pointer where it will be added and put
> +                        * the auth into the packet.
> +                        */
> +                       if (chunk == packet->auth) {
> +                               auth = skb_tail_pointer(nskb);
> +                       }
> +
> +                       memcpy(skb_put(nskb, chunk->skb->len),
> +                                      chunk->skb->data, chunk->skb->len);
> +
> +                       pr_debug("*** Chunk:%p[%s] %s 0x%x, length:%d, chunk->skb->len:%d, "
> +                                "rtt_in_progress:%d\n", chunk,
> +                                sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)),
> +                                chunk->has_tsn ? "TSN" : "No TSN",
> +                                chunk->has_tsn ? ntohl(chunk->subh.data_hdr->tsn) : 0,
> +                                ntohs(chunk->chunk_hdr->length), chunk->skb->len,
> +                                chunk->rtt_in_progress);
> +
> +                       /*
> +                        * If this is a control chunk, this is our last
> +                        * reference. Free data chunks after they've been
> +                        * acknowledged or have failed.
> +                        * Re-queue auth chunks if needed.
> +                        */
> +                       pkt_size -= WORD_ROUND(chunk->skb->len);
> +
> +                       if (chunk == packet->auth && !list_empty(&packet->chunk_list))
> +                               list_add(&chunk->list, &packet->chunk_list);
> +                       else if (!sctp_chunk_is_data(chunk))
> +                               sctp_chunk_free(chunk);
>
> -               /* if this is the auth chunk that we are adding,
> -                * store pointer where it will be added and put
> -                * the auth into the packet.
> +                       if (!pkt_size)
> +                               break;
> +               }
> +
> +               /* SCTP-AUTH, Section 6.2
> +                *    The sender MUST calculate the MAC as described in RFC2104 [2]
> +                *    using the hash function H as described by the MAC Identifier and
> +                *    the shared association key K based on the endpoint pair shared key
> +                *    described by the shared key identifier.  The 'data' used for the
> +                *    computation of the AUTH-chunk is given by the AUTH chunk with its
> +                *    HMAC field set to zero (as shown in Figure 6) followed by all
> +                *    chunks that are placed after the AUTH chunk in the SCTP packet.
>                  */
> -               if (chunk == packet->auth)
> -                       auth = skb_tail_pointer(nskb);
> -
> -               memcpy(skb_put(nskb, chunk->skb->len),
> -                              chunk->skb->data, chunk->skb->len);
> -
> -               pr_debug("*** Chunk:%p[%s] %s 0x%x, length:%d, chunk->skb->len:%d, "
> -                        "rtt_in_progress:%d\n", chunk,
> -                        sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)),
> -                        chunk->has_tsn ? "TSN" : "No TSN",
> -                        chunk->has_tsn ? ntohl(chunk->subh.data_hdr->tsn) : 0,
> -                        ntohs(chunk->chunk_hdr->length), chunk->skb->len,
> -                        chunk->rtt_in_progress);
> -
> -               /*
> -                * If this is a control chunk, this is our last
> -                * reference. Free data chunks after they've been
> -                * acknowledged or have failed.
> +               if (auth)
> +                       sctp_auth_calculate_hmac(asoc, nskb,
> +                                               (struct sctp_auth_chunk *)auth,
> +                                               GFP_ATOMIC);
> +
> +               /* Set up the IP options.  */
> +               /* BUG: not implemented
> +                * For v4 this all lives somewhere in sk->sk_opt...
>                  */
> -               if (!sctp_chunk_is_data(chunk))
> -                       sctp_chunk_free(chunk);
> -       }
>
> -       /* SCTP-AUTH, Section 6.2
> -        *    The sender MUST calculate the MAC as described in RFC2104 [2]
> -        *    using the hash function H as described by the MAC Identifier and
> -        *    the shared association key K based on the endpoint pair shared key
> -        *    described by the shared key identifier.  The 'data' used for the
> -        *    computation of the AUTH-chunk is given by the AUTH chunk with its
> -        *    HMAC field set to zero (as shown in Figure 6) followed by all
> -        *    chunks that are placed after the AUTH chunk in the SCTP packet.
> -        */
> -       if (auth)
> -               sctp_auth_calculate_hmac(asoc, nskb,
> -                                       (struct sctp_auth_chunk *)auth,
> -                                       GFP_ATOMIC);
> +               /* Dump that on IP!  */
> +               if (asoc) {
> +                       asoc->stats.opackets++;
> +                       if (asoc->peer.last_sent_to != tp)
> +                               /* Considering the multiple CPU scenario, this is a
> +                                * "correcter" place for last_sent_to.  --xguo
> +                                */
> +                               asoc->peer.last_sent_to = tp;
> +               }
> +
> +
> +               if (!gso ||
> +                   skb_shinfo(head)->gso_segs >= sk->sk_gso_max_segs)
> +//                 head->len + asoc->pathmtu >= sk->sk_gso_max_size)
> +                       break;
> +
> +               if (skb_gro_receive(&head, nskb))
> +                       goto nomem;
> +               skb_shinfo(head)->gso_segs++;
> +               /* FIXME: below is a lie */
> +               skb_shinfo(head)->gso_size = 1;
> +               nskb = NULL;
> +       } while (!list_empty(&packet->chunk_list));
>
>         /* 2) Calculate the Adler-32 checksum of the whole packet,
>          *    including the SCTP common header and all the
> @@ -532,16 +615,21 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>          *
>          * Note: Adler-32 is no longer applicable, as has been replaced
>          * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
> +        *
> +        * If it's a GSO packet, it's postponed to sctp_skb_segment.
>          */
> -       if (!sctp_checksum_disable) {
> +       if (!sctp_checksum_disable || gso) {
>                 if (!(dst->dev->features & NETIF_F_SCTP_CRC) ||
>                     (dst_xfrm(dst) != NULL) || packet->ipfragok) {
> -                       sh->checksum = sctp_compute_cksum(nskb, 0);
> +                       if (!gso)
> +                               sh->checksum = sctp_compute_cksum(head, 0);
> +                       else
> +                               head->ip_summed = CHECKSUM_UNNECESSARY;
>                 } else {
>                         /* no need to seed pseudo checksum for SCTP */
> -                       nskb->ip_summed = CHECKSUM_PARTIAL;
> -                       nskb->csum_start = skb_transport_header(nskb) - nskb->head;
> -                       nskb->csum_offset = offsetof(struct sctphdr, checksum);
> +                       head->ip_summed = CHECKSUM_PARTIAL;
> +                       head->csum_start = skb_transport_header(head) - head->head;
> +                       head->csum_offset = offsetof(struct sctphdr, checksum);
>                 }
>         }
>
> @@ -557,22 +645,7 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>          * Note: The works for IPv6 layer checks this bit too later
>          * in transmission.  See IP6_ECN_flow_xmit().
>          */
> -       tp->af_specific->ecn_capable(nskb->sk);
> -
> -       /* Set up the IP options.  */
> -       /* BUG: not implemented
> -        * For v4 this all lives somewhere in sk->sk_opt...
> -        */
> -
> -       /* Dump that on IP!  */
> -       if (asoc) {
> -               asoc->stats.opackets++;
> -               if (asoc->peer.last_sent_to != tp)
> -                       /* Considering the multiple CPU scenario, this is a
> -                        * "correcter" place for last_sent_to.  --xguo
> -                        */
> -                       asoc->peer.last_sent_to = tp;
> -       }
> +       tp->af_specific->ecn_capable(head->sk);
>
>         if (has_data) {
>                 struct timer_list *timer;
> @@ -589,16 +662,23 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>                 }
>         }
>
> -       pr_debug("***sctp_transmit_packet*** skb->len:%d\n", nskb->len);
> +       pr_debug("***sctp_transmit_packet*** skb->len:%d\n", head->len);
>
> -       nskb->ignore_df = packet->ipfragok;
> -       tp->af_specific->sctp_xmit(nskb, tp);
> +       head->ignore_df = packet->ipfragok;
> +       printk("%s %d %d %d\n", __func__, head->len,
> +              packet->transport->pathmtu,
> +              packet->transport->pathmtu - packet->overhead);
> +       if (gso)
> +               skb_shinfo(head)->gso_type = SKB_GSO_SCTP;
> +       tp->af_specific->sctp_xmit(head, tp);
>
>  out:
>         sctp_packet_reset(packet);
> +       sk_dst_reset(sk); /* FIXME: double check */
>         return err;
>  no_route:
>         kfree_skb(nskb);
> +       kfree_skb(head);
>
>         if (asoc)
>                 IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
> @@ -635,7 +715,7 @@ nomem:
>  static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
>                                            struct sctp_chunk *chunk)
>  {
> -       size_t datasize, rwnd, inflight, flight_size;
> +       size_t datasize, rwnd, inflight, flight_size, maxsize;
>         struct sctp_transport *transport = packet->transport;
>         struct sctp_association *asoc = transport->asoc;
>         struct sctp_outq *q = &asoc->outqueue;
> @@ -705,7 +785,15 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
>         /* Check whether this chunk and all the rest of pending data will fit
>          * or delay in hopes of bundling a full sized packet.
>          */
> -       if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead)
> +       if (packet->ipfragok) {
> +               /* Means chunk won't fit and needs fragmentation at
> +                * transport level, so we can't do GSO.
> +                */
> +               maxsize = transport->pathmtu;
> +       } else {
> +               maxsize = transport->dst->dev->gso_max_size;
> +       }
> +       if (chunk->skb->len + q->out_qlen >= maxsize - packet->overhead)
>                 /* Enough data queued to fill a packet */
>                 return SCTP_XMIT_OK;
>
> @@ -764,6 +852,8 @@ static sctp_xmit_t sctp_packet_will_fit(struct sctp_packet *packet,
>
>         /* Decide if we need to fragment or resubmit later. */
>         if (too_big) {
> +               struct net_device *dev = packet->transport->dst->dev;
> +
>                 /* It's OK to fragmet at IP level if any one of the following
>                  * is true:
>                  *      1. The packet is empty (meaning this chunk is greater
> @@ -779,9 +869,11 @@ static sctp_xmit_t sctp_packet_will_fit(struct sctp_packet *packet,
>                          * actually hit this condition
>                          */
>                         packet->ipfragok = 1;
> -               } else {
> +               } else if (psize + chunk_len > dev->gso_max_size - packet->overhead) {
> +                       /* Hit GSO limit, gotta flush */
>                         retval = SCTP_XMIT_PMTU_FULL;
>                 }
> +               /* Otherwise it will fit in the GSO packet */
>         }
>
>         return retval;
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 5ca2ebfe0be83882fcb841de6fa8029b6455ef85..064e5d375e612f2ec745f384d35f0e4c6b96212c 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4001,6 +4001,8 @@ static int sctp_init_sock(struct sock *sk)
>                 return -ESOCKTNOSUPPORT;
>         }
>
> +       sk->sk_gso_type = SKB_GSO_SCTP;
> +
>         /* Initialize default send parameters. These parameters can be
>          * modified with the SCTP_DEFAULT_SEND_PARAM socket option.
>          */
> --
> 2.5.0
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
@ 2016-01-29 19:15     ` Alexander Duyck
  0 siblings, 0 replies; 49+ messages in thread
From: Alexander Duyck @ 2016-01-29 19:15 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	marek, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> This patch enables SCTP to do GSO.
>
> SCTP has this pecualiarty that its packets cannot be just segmented to
> (P)MTU. Its chunks must be contained in IP segments, padding respected.
> So we can't just generate a big skb, set gso_size to the fragmentation
> point and deliver it to IP layer.
>
> Instead, this patch proposes that SCTP build a skb as it would be if it
> was received using GRO. That is, there will be a cover skb with the
> headers (incluing SCTP one) and children ones containing the actual SCTP
> chunks, already segmented in a way that respects SCTP RFCs and MTU.
>
> This way SCTP can benefit from GSO and instead of passing several
> packets through the stack, it can pass a single large packet if there
> are enough data queued and cwnd allows.
>
> Main points that need help:
> - Usage of skb_gro_receive()
>   It fits nicely in there and properly handles offsets/lens, though the
>   name means another thing. If you agree with this usage, we can rename
>   it to something like skb_coalesce
>
> - Checksum handling
>   Why only packets with checksum offloaded can be GSOed? Most of the
>   NICs doesn't support SCTP CRC offloading and this will nearly defeat
>   this feature. If checksum is being computed in sw, it doesn't really
>   matter if it's earlier or later, right?
>   This patch hacks skb_needs_check() to allow using GSO with sw-computed
>   checksums.
>   Also the meaning of UNNECESSARY and NONE are quite foggy to me yet and
>   its usage may be wrong.
>
> - gso_size = 1
>   There is skb_is_gso() all over the stack and it basically checks for
>   non-zero skb_shinfo(skb)->gso_size. Setting it to 1 is the hacky way I
>   found to keep skb_is_gso() working while being able to signal to
>   skb_segment() that it shouldn't use gso_size but instead the fragment
>   sizes themselves. skb_segment() will mainly just unpack the skb then.

Instead of 1 why not use 0xFFFF?  It is a value that can never be used
for a legitimate segment size since IP total length is a 16 bit value
and includes the IP header in the size.

> - socket / gso max values
>   usage of sk_setup_caps() still needs a review
>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  include/linux/netdev_features.h |   7 +-
>  include/linux/netdevice.h       |   1 +
>  net/core/dev.c                  |   6 +-
>  net/core/skbuff.c               |  12 +-
>  net/ipv4/af_inet.c              |   1 +
>  net/sctp/offload.c              |  53 +++++++
>  net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
>  net/sctp/socket.c               |   2 +
>  8 files changed, 292 insertions(+), 128 deletions(-)
>
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index d9654f0eecb3519383441afa6b131ff9a5898485..f678998841f1800e0f2fe416a79935197d4ed305 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -48,8 +48,9 @@ enum {
>         NETIF_F_GSO_UDP_TUNNEL_BIT,     /* ... UDP TUNNEL with TSO */
>         NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
>         NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
> +       NETIF_F_GSO_SCTP_BIT,           /* ... SCTP fragmentation */
>         /**/NETIF_F_GSO_LAST =          /* last bit, see GSO_MASK */
> -               NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
> +               NETIF_F_GSO_SCTP_BIT,
>
>         NETIF_F_FCOE_CRC_BIT,           /* FCoE CRC32 */
>         NETIF_F_SCTP_CRC_BIT,           /* SCTP checksum offload */
> @@ -119,6 +120,7 @@ enum {
>  #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL)
>  #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
>  #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
> +#define NETIF_F_GSO_SCTP       __NETIF_F(GSO_SCTP)
>  #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
>  #define NETIF_F_HW_VLAN_STAG_RX        __NETIF_F(HW_VLAN_STAG_RX)
>  #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
> @@ -144,7 +146,8 @@ enum {
>
>  /* List of features with software fallbacks. */
>  #define NETIF_F_GSO_SOFTWARE   (NETIF_F_TSO | NETIF_F_TSO_ECN | \
> -                                NETIF_F_TSO6 | NETIF_F_UFO)
> +                                NETIF_F_TSO6 | NETIF_F_UFO | \
> +                                NETIF_F_GSO_SCTP)
>
>  /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
>   * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 289c2314d76668b8357728382bb33d6828617458..ce14fab858bf96dd0f85aca237350c8d8317756e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3928,6 +3928,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
>         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
>         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
>         BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
> +       BUILD_BUG_ON(SKB_GSO_SCTP    != (NETIF_F_GSO_SCTP >> NETIF_F_GSO_SHIFT));
>
>         return (features & feature) = feature;
>  }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
>  static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
>  {
>         if (tx_path)
> -               return skb->ip_summed != CHECKSUM_PARTIAL;
> +               /* FIXME: Why only packets with checksum offloading are
> +                * supported for GSO?
> +                */
> +               return skb->ip_summed != CHECKSUM_PARTIAL &&
> +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
>         else
>                 return skb->ip_summed = CHECKSUM_NONE;
>  }

Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
transmit path a little while ago.  Please don't reintroduce it.

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 704b69682085dec77f3d0f990aaf0024afd705b9..96f223f8d769d2765fd64348830c76cb222906c8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3017,8 +3017,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>                 int size;
>
>                 len = head_skb->len - offset;
> -               if (len > mss)
> -                       len = mss;
> +               if (len > mss) {
> +                       /* FIXME: A define is surely welcomed, but maybe
> +                        * shinfo->txflags is better for this flag, but
> +                        * we need to expand it then
> +                        */
> +                       if (mss = 1)
> +                               len = list_skb->len;
> +                       else
> +                               len = mss;
> +               }
>

Using 0xFFFF here as a flag with the MSS value would likely be much
more readable.

>                 hsize = skb_headlen(head_skb) - offset;
>                 if (hsize < 0)
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 5c5db6636704daa0c49fc13e84b2c5b282a44ed3..ec1c779bb664d1399d74f2bd7016e30b648ce47d 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1220,6 +1220,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>                        SKB_GSO_UDP_TUNNEL |
>                        SKB_GSO_UDP_TUNNEL_CSUM |
>                        SKB_GSO_TUNNEL_REMCSUM |
> +                      SKB_GSO_SCTP |
>                        0)))
>                 goto out;
>
> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
> index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
> --- a/net/sctp/offload.c
> +++ b/net/sctp/offload.c
> @@ -36,8 +36,61 @@
>  #include <net/sctp/checksum.h>
>  #include <net/protocol.h>
>
> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
> +{
> +       skb->ip_summed = CHECKSUM_NONE;
> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
> +}
> +

I really despise the naming of this bit here.  SCTP does not use a
checksum.  It uses a CRC.  Please don't call this a checksum as it
will just make the code really confusing.   I think the name should be
something like gso_make_crc32c.

I think we need to address the CRC issues before we can really get
into segmentation.  Specifically we need to be able to offload SCTP
and FCoE in software since they both use the CHECKSUM_PARTIAL value
and then we can start cleaning up more of this mess and move onto
segmentation.

> +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
> +                                       netdev_features_t features)
> +{
> +       struct sk_buff *segs = ERR_PTR(-EINVAL);
> +       struct sctphdr *sh;
> +
> +       sh = sctp_hdr(skb);
> +       if (!pskb_may_pull(skb, sizeof(*sh)))
> +               goto out;
> +
> +       __skb_pull(skb, sizeof(*sh));
> +
> +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
> +               /* Packet is from an untrusted source, reset gso_segs. */
> +               int type = skb_shinfo(skb)->gso_type;
> +
> +               if (unlikely(type &
> +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
> +                              0) ||
> +                            !(type & (SKB_GSO_SCTP))))
> +                       goto out;
> +
> +               /* This should not happen as no NIC has SCTP GSO
> +                * offloading, it's always via software and thus we
> +                * won't send a large packet down the stack.
> +                */
> +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
> +               goto out;
> +       }
> +

So what you are going to end up needing here is some way to tell the
hardware that you are doing the checksum no matter what.  There is no
value in you computing a 1's compliment checksum for the payload if
you aren't going to use it.  What you can probably do is just clear
the standard checksum flags and then OR in NETIF_F_HW_CSUM if
NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
offloading the checksum.

One other bit that will make this more complicated is if we ever get
around to supporting SCTP in tunnels.  Then we will need to sort out
how things like remote checksum offload should impact SCTP, and how to
deal with needing to compute both a CRC and 1's compliment checksum.
What we would probably need to do is check for encap_hdr_csum and if
it is set and we are doing SCTP then we would need to clear the
NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.

> +       segs = skb_segment(skb, features);
> +       if (IS_ERR(segs))
> +               goto out;
> +
> +       /* All that is left is update SCTP CRC if necessary */
> +       for (skb = segs; skb; skb = skb->next) {
> +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
> +                       sh = sctp_hdr(skb);
> +                       sh->checksum = sctp_gso_make_checksum(skb);
> +               }
> +       }
> +

Okay, so it looks like you are doing the right thing here and leaving
this as CHECKSUM_PARTIAL.

> +out:
> +       return segs;
> +}
> +
>  static const struct net_offload sctp_offload = {
>         .callbacks = {
> +               .gso_segment = sctp_gso_segment,
>         },
>  };
>
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index 9d610eddd19ef2320fc34ae9d91e7426ae5f50f9..5e619b1b7b47737447bce746b2420bac3427fde4 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -381,12 +381,14 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>         struct sctp_transport *tp = packet->transport;
>         struct sctp_association *asoc = tp->asoc;
>         struct sctphdr *sh;
> -       struct sk_buff *nskb;
> +       struct sk_buff *nskb = NULL, *head = NULL;
>         struct sctp_chunk *chunk, *tmp;
> -       struct sock *sk;
> +       struct sock *sk = asoc->base.sk;
>         int err = 0;
>         int padding;            /* How much padding do we need?  */
> +       int pkt_size;
>         __u8 has_data = 0;
> +       int gso = 0;
>         struct dst_entry *dst;
>         unsigned char *auth = NULL;     /* pointer to auth in skb data */
>
> @@ -396,37 +398,44 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>         if (list_empty(&packet->chunk_list))
>                 return err;
>
> -       /* Set up convenience variables... */
> +       /* TODO: double check this */
>         chunk = list_entry(packet->chunk_list.next, struct sctp_chunk, list);
>         sk = chunk->skb->sk;
> +       dst_hold(tp->dst);
> +       sk_setup_caps(sk, tp->dst);
> +
> +       if (packet->size > tp->pathmtu) {
> +               WARN_ON(packet->ipfragok);
> +               if (sk_can_gso(sk)) {
> +                       gso = 1;
> +                       pkt_size = packet->overhead;
> +               } else {
> +                       /* Something nasty happened */
> +                       /* FIXME */
> +                       printk("Damn, we can't GSO and packet is too big %d for pmtu %d.\n",
> +                              packet->size, tp->pathmtu);
> +                       goto nomem;
> +               }
> +       } else {
> +               pkt_size = packet->size;
> +       }
>
> -       /* Allocate the new skb.  */
> -       nskb = alloc_skb(packet->size + MAX_HEADER, GFP_ATOMIC);
> -       if (!nskb)
> +       /* Allocate the head skb, or main one if not in GSO */
> +       head = alloc_skb(pkt_size + MAX_HEADER, GFP_ATOMIC);
> +       if (!head)
>                 goto nomem;
> +       if (gso) {
> +               NAPI_GRO_CB(head)->last = head;
> +       } else {
> +               nskb = head;
> +       }
>
>         /* Make sure the outbound skb has enough header room reserved. */
> -       skb_reserve(nskb, packet->overhead + MAX_HEADER);
> -
> -       /* Set the owning socket so that we know where to get the
> -        * destination IP address.
> -        */
> -       sctp_packet_set_owner_w(nskb, sk);
> -
> -       if (!sctp_transport_dst_check(tp)) {
> -               sctp_transport_route(tp, NULL, sctp_sk(sk));
> -               if (asoc && (asoc->param_flags & SPP_PMTUD_ENABLE)) {
> -                       sctp_assoc_sync_pmtu(sk, asoc);
> -               }
> -       }
> -       dst = dst_clone(tp->dst);
> -       if (!dst)
> -               goto no_route;
> -       skb_dst_set(nskb, dst);
> +       skb_reserve(head, packet->overhead + MAX_HEADER);
>
>         /* Build the SCTP header.  */
> -       sh = (struct sctphdr *)skb_push(nskb, sizeof(struct sctphdr));
> -       skb_reset_transport_header(nskb);
> +       sh = (struct sctphdr *)skb_push(head, sizeof(struct sctphdr));
> +       skb_reset_transport_header(head);
>         sh->source = htons(packet->source_port);
>         sh->dest   = htons(packet->destination_port);
>
> @@ -441,90 +450,164 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>         sh->vtag     = htonl(packet->vtag);
>         sh->checksum = 0;
>
> -       /**
> -        * 6.10 Bundling
> -        *
> -        *    An endpoint bundles chunks by simply including multiple
> -        *    chunks in one outbound SCTP packet.  ...
> +       /* Set the owning socket so that we know where to get the
> +        * destination IP address.
>          */
> +       sctp_packet_set_owner_w(head, sk);
>
> -       /**
> -        * 3.2  Chunk Field Descriptions
> -        *
> -        * The total length of a chunk (including Type, Length and
> -        * Value fields) MUST be a multiple of 4 bytes.  If the length
> -        * of the chunk is not a multiple of 4 bytes, the sender MUST
> -        * pad the chunk with all zero bytes and this padding is not
> -        * included in the chunk length field.  The sender should
> -        * never pad with more than 3 bytes.
> -        *
> -        * [This whole comment explains WORD_ROUND() below.]
> -        */
> +       if (!sctp_transport_dst_check(tp)) {
> +               sctp_transport_route(tp, NULL, sctp_sk(sk));
> +               if (asoc && (asoc->param_flags & SPP_PMTUD_ENABLE)) {
> +                       sctp_assoc_sync_pmtu(sk, asoc);
> +               }
> +       }
> +       dst = dst_clone(tp->dst);
> +       if (!dst)
> +               goto no_route;
> +       skb_dst_set(head, dst);
>
>         pr_debug("***sctp_transmit_packet***\n");
>
> -       list_for_each_entry_safe(chunk, tmp, &packet->chunk_list, list) {
> -               list_del_init(&chunk->list);
> -               if (sctp_chunk_is_data(chunk)) {
> -                       /* 6.3.1 C4) When data is in flight and when allowed
> -                        * by rule C5, a new RTT measurement MUST be made each
> -                        * round trip.  Furthermore, new RTT measurements
> -                        * SHOULD be made no more than once per round-trip
> -                        * for a given destination transport address.
> -                        */
> -
> -                       if (!chunk->resent && !tp->rto_pending) {
> -                               chunk->rtt_in_progress = 1;
> -                               tp->rto_pending = 1;
> +       do {
> +               /* Set up convenience variables... */
> +               chunk = list_entry(packet->chunk_list.next, struct sctp_chunk, list);
> +               WARN_ON(sk != chunk->skb->sk); /* XXX */
> +
> +               /* Calculate packet size, so it fits in PMTU. Leave
> +                * other chunks for the next packets. */
> +               if (gso) {
> +                       pkt_size = packet->overhead;
> +                       list_for_each_entry(chunk, &packet->chunk_list, list) {
> +                               int padded = WORD_ROUND(chunk->skb->len);
> +                               if (pkt_size + padded > tp->pathmtu)
> +                                       break;
> +                               pkt_size += padded;
>                         }
>
> -                       has_data = 1;
> +                       /* Allocate the new skb.  */
> +                       nskb = alloc_skb(pkt_size + MAX_HEADER, GFP_ATOMIC);
> +
> +                       /* Make sure the outbound skb has enough header room reserved. */
> +                       if (nskb)
> +                               skb_reserve(nskb, packet->overhead + MAX_HEADER);
>                 }
> +               if (!nskb)
> +                       goto nomem;
> +
> +               /**
> +                * 3.2  Chunk Field Descriptions
> +                *
> +                * The total length of a chunk (including Type, Length and
> +                * Value fields) MUST be a multiple of 4 bytes.  If the length
> +                * of the chunk is not a multiple of 4 bytes, the sender MUST
> +                * pad the chunk with all zero bytes and this padding is not
> +                * included in the chunk length field.  The sender should
> +                * never pad with more than 3 bytes.
> +                *
> +                * [This whole comment explains WORD_ROUND() below.]
> +                */
> +
> +               pkt_size -= packet->overhead;
> +               list_for_each_entry_safe(chunk, tmp, &packet->chunk_list, list) {
> +                       list_del_init(&chunk->list);
> +                       if (sctp_chunk_is_data(chunk)) {
> +                               /* 6.3.1 C4) When data is in flight and when allowed
> +                                * by rule C5, a new RTT measurement MUST be made each
> +                                * round trip.  Furthermore, new RTT measurements
> +                                * SHOULD be made no more than once per round-trip
> +                                * for a given destination transport address.
> +                                */
> +
> +                               if (!chunk->resent && !tp->rto_pending) {
> +                                       chunk->rtt_in_progress = 1;
> +                                       tp->rto_pending = 1;
> +                               }
> +
> +                               has_data = 1;
> +                       }
> +
> +                       padding = WORD_ROUND(chunk->skb->len) - chunk->skb->len;
> +                       if (padding)
> +                               memset(skb_put(chunk->skb, padding), 0, padding);
>
> -               padding = WORD_ROUND(chunk->skb->len) - chunk->skb->len;
> -               if (padding)
> -                       memset(skb_put(chunk->skb, padding), 0, padding);
> +                       /* if this is the auth chunk that we are adding,
> +                        * store pointer where it will be added and put
> +                        * the auth into the packet.
> +                        */
> +                       if (chunk = packet->auth) {
> +                               auth = skb_tail_pointer(nskb);
> +                       }
> +
> +                       memcpy(skb_put(nskb, chunk->skb->len),
> +                                      chunk->skb->data, chunk->skb->len);
> +
> +                       pr_debug("*** Chunk:%p[%s] %s 0x%x, length:%d, chunk->skb->len:%d, "
> +                                "rtt_in_progress:%d\n", chunk,
> +                                sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)),
> +                                chunk->has_tsn ? "TSN" : "No TSN",
> +                                chunk->has_tsn ? ntohl(chunk->subh.data_hdr->tsn) : 0,
> +                                ntohs(chunk->chunk_hdr->length), chunk->skb->len,
> +                                chunk->rtt_in_progress);
> +
> +                       /*
> +                        * If this is a control chunk, this is our last
> +                        * reference. Free data chunks after they've been
> +                        * acknowledged or have failed.
> +                        * Re-queue auth chunks if needed.
> +                        */
> +                       pkt_size -= WORD_ROUND(chunk->skb->len);
> +
> +                       if (chunk = packet->auth && !list_empty(&packet->chunk_list))
> +                               list_add(&chunk->list, &packet->chunk_list);
> +                       else if (!sctp_chunk_is_data(chunk))
> +                               sctp_chunk_free(chunk);
>
> -               /* if this is the auth chunk that we are adding,
> -                * store pointer where it will be added and put
> -                * the auth into the packet.
> +                       if (!pkt_size)
> +                               break;
> +               }
> +
> +               /* SCTP-AUTH, Section 6.2
> +                *    The sender MUST calculate the MAC as described in RFC2104 [2]
> +                *    using the hash function H as described by the MAC Identifier and
> +                *    the shared association key K based on the endpoint pair shared key
> +                *    described by the shared key identifier.  The 'data' used for the
> +                *    computation of the AUTH-chunk is given by the AUTH chunk with its
> +                *    HMAC field set to zero (as shown in Figure 6) followed by all
> +                *    chunks that are placed after the AUTH chunk in the SCTP packet.
>                  */
> -               if (chunk = packet->auth)
> -                       auth = skb_tail_pointer(nskb);
> -
> -               memcpy(skb_put(nskb, chunk->skb->len),
> -                              chunk->skb->data, chunk->skb->len);
> -
> -               pr_debug("*** Chunk:%p[%s] %s 0x%x, length:%d, chunk->skb->len:%d, "
> -                        "rtt_in_progress:%d\n", chunk,
> -                        sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)),
> -                        chunk->has_tsn ? "TSN" : "No TSN",
> -                        chunk->has_tsn ? ntohl(chunk->subh.data_hdr->tsn) : 0,
> -                        ntohs(chunk->chunk_hdr->length), chunk->skb->len,
> -                        chunk->rtt_in_progress);
> -
> -               /*
> -                * If this is a control chunk, this is our last
> -                * reference. Free data chunks after they've been
> -                * acknowledged or have failed.
> +               if (auth)
> +                       sctp_auth_calculate_hmac(asoc, nskb,
> +                                               (struct sctp_auth_chunk *)auth,
> +                                               GFP_ATOMIC);
> +
> +               /* Set up the IP options.  */
> +               /* BUG: not implemented
> +                * For v4 this all lives somewhere in sk->sk_opt...
>                  */
> -               if (!sctp_chunk_is_data(chunk))
> -                       sctp_chunk_free(chunk);
> -       }
>
> -       /* SCTP-AUTH, Section 6.2
> -        *    The sender MUST calculate the MAC as described in RFC2104 [2]
> -        *    using the hash function H as described by the MAC Identifier and
> -        *    the shared association key K based on the endpoint pair shared key
> -        *    described by the shared key identifier.  The 'data' used for the
> -        *    computation of the AUTH-chunk is given by the AUTH chunk with its
> -        *    HMAC field set to zero (as shown in Figure 6) followed by all
> -        *    chunks that are placed after the AUTH chunk in the SCTP packet.
> -        */
> -       if (auth)
> -               sctp_auth_calculate_hmac(asoc, nskb,
> -                                       (struct sctp_auth_chunk *)auth,
> -                                       GFP_ATOMIC);
> +               /* Dump that on IP!  */
> +               if (asoc) {
> +                       asoc->stats.opackets++;
> +                       if (asoc->peer.last_sent_to != tp)
> +                               /* Considering the multiple CPU scenario, this is a
> +                                * "correcter" place for last_sent_to.  --xguo
> +                                */
> +                               asoc->peer.last_sent_to = tp;
> +               }
> +
> +
> +               if (!gso ||
> +                   skb_shinfo(head)->gso_segs >= sk->sk_gso_max_segs)
> +//                 head->len + asoc->pathmtu >= sk->sk_gso_max_size)
> +                       break;
> +
> +               if (skb_gro_receive(&head, nskb))
> +                       goto nomem;
> +               skb_shinfo(head)->gso_segs++;
> +               /* FIXME: below is a lie */
> +               skb_shinfo(head)->gso_size = 1;
> +               nskb = NULL;
> +       } while (!list_empty(&packet->chunk_list));
>
>         /* 2) Calculate the Adler-32 checksum of the whole packet,
>          *    including the SCTP common header and all the
> @@ -532,16 +615,21 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>          *
>          * Note: Adler-32 is no longer applicable, as has been replaced
>          * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
> +        *
> +        * If it's a GSO packet, it's postponed to sctp_skb_segment.
>          */
> -       if (!sctp_checksum_disable) {
> +       if (!sctp_checksum_disable || gso) {
>                 if (!(dst->dev->features & NETIF_F_SCTP_CRC) ||
>                     (dst_xfrm(dst) != NULL) || packet->ipfragok) {
> -                       sh->checksum = sctp_compute_cksum(nskb, 0);
> +                       if (!gso)
> +                               sh->checksum = sctp_compute_cksum(head, 0);
> +                       else
> +                               head->ip_summed = CHECKSUM_UNNECESSARY;
>                 } else {
>                         /* no need to seed pseudo checksum for SCTP */
> -                       nskb->ip_summed = CHECKSUM_PARTIAL;
> -                       nskb->csum_start = skb_transport_header(nskb) - nskb->head;
> -                       nskb->csum_offset = offsetof(struct sctphdr, checksum);
> +                       head->ip_summed = CHECKSUM_PARTIAL;
> +                       head->csum_start = skb_transport_header(head) - head->head;
> +                       head->csum_offset = offsetof(struct sctphdr, checksum);
>                 }
>         }
>
> @@ -557,22 +645,7 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>          * Note: The works for IPv6 layer checks this bit too later
>          * in transmission.  See IP6_ECN_flow_xmit().
>          */
> -       tp->af_specific->ecn_capable(nskb->sk);
> -
> -       /* Set up the IP options.  */
> -       /* BUG: not implemented
> -        * For v4 this all lives somewhere in sk->sk_opt...
> -        */
> -
> -       /* Dump that on IP!  */
> -       if (asoc) {
> -               asoc->stats.opackets++;
> -               if (asoc->peer.last_sent_to != tp)
> -                       /* Considering the multiple CPU scenario, this is a
> -                        * "correcter" place for last_sent_to.  --xguo
> -                        */
> -                       asoc->peer.last_sent_to = tp;
> -       }
> +       tp->af_specific->ecn_capable(head->sk);
>
>         if (has_data) {
>                 struct timer_list *timer;
> @@ -589,16 +662,23 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>                 }
>         }
>
> -       pr_debug("***sctp_transmit_packet*** skb->len:%d\n", nskb->len);
> +       pr_debug("***sctp_transmit_packet*** skb->len:%d\n", head->len);
>
> -       nskb->ignore_df = packet->ipfragok;
> -       tp->af_specific->sctp_xmit(nskb, tp);
> +       head->ignore_df = packet->ipfragok;
> +       printk("%s %d %d %d\n", __func__, head->len,
> +              packet->transport->pathmtu,
> +              packet->transport->pathmtu - packet->overhead);
> +       if (gso)
> +               skb_shinfo(head)->gso_type = SKB_GSO_SCTP;
> +       tp->af_specific->sctp_xmit(head, tp);
>
>  out:
>         sctp_packet_reset(packet);
> +       sk_dst_reset(sk); /* FIXME: double check */
>         return err;
>  no_route:
>         kfree_skb(nskb);
> +       kfree_skb(head);
>
>         if (asoc)
>                 IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
> @@ -635,7 +715,7 @@ nomem:
>  static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
>                                            struct sctp_chunk *chunk)
>  {
> -       size_t datasize, rwnd, inflight, flight_size;
> +       size_t datasize, rwnd, inflight, flight_size, maxsize;
>         struct sctp_transport *transport = packet->transport;
>         struct sctp_association *asoc = transport->asoc;
>         struct sctp_outq *q = &asoc->outqueue;
> @@ -705,7 +785,15 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
>         /* Check whether this chunk and all the rest of pending data will fit
>          * or delay in hopes of bundling a full sized packet.
>          */
> -       if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead)
> +       if (packet->ipfragok) {
> +               /* Means chunk won't fit and needs fragmentation at
> +                * transport level, so we can't do GSO.
> +                */
> +               maxsize = transport->pathmtu;
> +       } else {
> +               maxsize = transport->dst->dev->gso_max_size;
> +       }
> +       if (chunk->skb->len + q->out_qlen >= maxsize - packet->overhead)
>                 /* Enough data queued to fill a packet */
>                 return SCTP_XMIT_OK;
>
> @@ -764,6 +852,8 @@ static sctp_xmit_t sctp_packet_will_fit(struct sctp_packet *packet,
>
>         /* Decide if we need to fragment or resubmit later. */
>         if (too_big) {
> +               struct net_device *dev = packet->transport->dst->dev;
> +
>                 /* It's OK to fragmet at IP level if any one of the following
>                  * is true:
>                  *      1. The packet is empty (meaning this chunk is greater
> @@ -779,9 +869,11 @@ static sctp_xmit_t sctp_packet_will_fit(struct sctp_packet *packet,
>                          * actually hit this condition
>                          */
>                         packet->ipfragok = 1;
> -               } else {
> +               } else if (psize + chunk_len > dev->gso_max_size - packet->overhead) {
> +                       /* Hit GSO limit, gotta flush */
>                         retval = SCTP_XMIT_PMTU_FULL;
>                 }
> +               /* Otherwise it will fit in the GSO packet */
>         }
>
>         return retval;
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 5ca2ebfe0be83882fcb841de6fa8029b6455ef85..064e5d375e612f2ec745f384d35f0e4c6b96212c 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4001,6 +4001,8 @@ static int sctp_init_sock(struct sock *sk)
>                 return -ESOCKTNOSUPPORT;
>         }
>
> +       sk->sk_gso_type = SKB_GSO_SCTP;
> +
>         /* Initialize default send parameters. These parameters can be
>          * modified with the SCTP_DEFAULT_SEND_PARAM socket option.
>          */
> --
> 2.5.0
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
  2016-01-29 19:15     ` Alexander Duyck
@ 2016-01-29 19:42       ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-29 19:42 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	marek, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > This patch enables SCTP to do GSO.
> >
> > SCTP has this pecualiarty that its packets cannot be just segmented to
> > (P)MTU. Its chunks must be contained in IP segments, padding respected.
> > So we can't just generate a big skb, set gso_size to the fragmentation
> > point and deliver it to IP layer.
> >
> > Instead, this patch proposes that SCTP build a skb as it would be if it
> > was received using GRO. That is, there will be a cover skb with the
> > headers (incluing SCTP one) and children ones containing the actual SCTP
> > chunks, already segmented in a way that respects SCTP RFCs and MTU.
> >
> > This way SCTP can benefit from GSO and instead of passing several
> > packets through the stack, it can pass a single large packet if there
> > are enough data queued and cwnd allows.
> >
> > Main points that need help:
> > - Usage of skb_gro_receive()
> >   It fits nicely in there and properly handles offsets/lens, though the
> >   name means another thing. If you agree with this usage, we can rename
> >   it to something like skb_coalesce
> >
> > - Checksum handling
> >   Why only packets with checksum offloaded can be GSOed? Most of the
> >   NICs doesn't support SCTP CRC offloading and this will nearly defeat
> >   this feature. If checksum is being computed in sw, it doesn't really
> >   matter if it's earlier or later, right?
> >   This patch hacks skb_needs_check() to allow using GSO with sw-computed
> >   checksums.
> >   Also the meaning of UNNECESSARY and NONE are quite foggy to me yet and
> >   its usage may be wrong.
> >
> > - gso_size = 1
> >   There is skb_is_gso() all over the stack and it basically checks for
> >   non-zero skb_shinfo(skb)->gso_size. Setting it to 1 is the hacky way I
> >   found to keep skb_is_gso() working while being able to signal to
> >   skb_segment() that it shouldn't use gso_size but instead the fragment
> >   sizes themselves. skb_segment() will mainly just unpack the skb then.
> 
> Instead of 1 why not use 0xFFFF?  It is a value that can never be used
> for a legitimate segment size since IP total length is a 16 bit value
> and includes the IP header in the size.

Just felt that 1 was unpractical. But perhaps with no hard restriction
like the one for 0xFFFF you said. I can replace it, 0xFFFF is better.

> > - socket / gso max values
> >   usage of sk_setup_caps() still needs a review
> >
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  include/linux/netdev_features.h |   7 +-
> >  include/linux/netdevice.h       |   1 +
> >  net/core/dev.c                  |   6 +-
> >  net/core/skbuff.c               |  12 +-
> >  net/ipv4/af_inet.c              |   1 +
> >  net/sctp/offload.c              |  53 +++++++
> >  net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
> >  net/sctp/socket.c               |   2 +
> >  8 files changed, 292 insertions(+), 128 deletions(-)
> >
> > diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> > index d9654f0eecb3519383441afa6b131ff9a5898485..f678998841f1800e0f2fe416a79935197d4ed305 100644
> > --- a/include/linux/netdev_features.h
> > +++ b/include/linux/netdev_features.h
> > @@ -48,8 +48,9 @@ enum {
> >         NETIF_F_GSO_UDP_TUNNEL_BIT,     /* ... UDP TUNNEL with TSO */
> >         NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
> >         NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
> > +       NETIF_F_GSO_SCTP_BIT,           /* ... SCTP fragmentation */
> >         /**/NETIF_F_GSO_LAST =          /* last bit, see GSO_MASK */
> > -               NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
> > +               NETIF_F_GSO_SCTP_BIT,
> >
> >         NETIF_F_FCOE_CRC_BIT,           /* FCoE CRC32 */
> >         NETIF_F_SCTP_CRC_BIT,           /* SCTP checksum offload */
> > @@ -119,6 +120,7 @@ enum {
> >  #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL)
> >  #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
> >  #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
> > +#define NETIF_F_GSO_SCTP       __NETIF_F(GSO_SCTP)
> >  #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
> >  #define NETIF_F_HW_VLAN_STAG_RX        __NETIF_F(HW_VLAN_STAG_RX)
> >  #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
> > @@ -144,7 +146,8 @@ enum {
> >
> >  /* List of features with software fallbacks. */
> >  #define NETIF_F_GSO_SOFTWARE   (NETIF_F_TSO | NETIF_F_TSO_ECN | \
> > -                                NETIF_F_TSO6 | NETIF_F_UFO)
> > +                                NETIF_F_TSO6 | NETIF_F_UFO | \
> > +                                NETIF_F_GSO_SCTP)
> >
> >  /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
> >   * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 289c2314d76668b8357728382bb33d6828617458..ce14fab858bf96dd0f85aca237350c8d8317756e 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3928,6 +3928,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
> >         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
> >         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
> >         BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
> > +       BUILD_BUG_ON(SKB_GSO_SCTP    != (NETIF_F_GSO_SCTP >> NETIF_F_GSO_SHIFT));
> >
> >         return (features & feature) == feature;
> >  }
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
> >  static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
> >  {
> >         if (tx_path)
> > -               return skb->ip_summed != CHECKSUM_PARTIAL;
> > +               /* FIXME: Why only packets with checksum offloading are
> > +                * supported for GSO?
> > +                */
> > +               return skb->ip_summed != CHECKSUM_PARTIAL &&
> > +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
> >         else
> >                 return skb->ip_summed == CHECKSUM_NONE;
> >  }
> 
> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
> transmit path a little while ago.  Please don't reintroduce it.

Can you give me some pointers on that? I cannot find such change.
skb_needs_check() seems to be like that since beginning.

> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 704b69682085dec77f3d0f990aaf0024afd705b9..96f223f8d769d2765fd64348830c76cb222906c8 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -3017,8 +3017,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> >                 int size;
> >
> >                 len = head_skb->len - offset;
> > -               if (len > mss)
> > -                       len = mss;
> > +               if (len > mss) {
> > +                       /* FIXME: A define is surely welcomed, but maybe
> > +                        * shinfo->txflags is better for this flag, but
> > +                        * we need to expand it then
> > +                        */
> > +                       if (mss == 1)
> > +                               len = list_skb->len;
> > +                       else
> > +                               len = mss;
> > +               }
> >
> 
> Using 0xFFFF here as a flag with the MSS value would likely be much
> more readable.

Either way it will be replaced by a define/name instead.

> >                 hsize = skb_headlen(head_skb) - offset;
> >                 if (hsize < 0)
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index 5c5db6636704daa0c49fc13e84b2c5b282a44ed3..ec1c779bb664d1399d74f2bd7016e30b648ce47d 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1220,6 +1220,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
> >                        SKB_GSO_UDP_TUNNEL |
> >                        SKB_GSO_UDP_TUNNEL_CSUM |
> >                        SKB_GSO_TUNNEL_REMCSUM |
> > +                      SKB_GSO_SCTP |
> >                        0)))
> >                 goto out;
> >
> > diff --git a/net/sctp/offload.c b/net/sctp/offload.c
> > index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
> > --- a/net/sctp/offload.c
> > +++ b/net/sctp/offload.c
> > @@ -36,8 +36,61 @@
> >  #include <net/sctp/checksum.h>
> >  #include <net/protocol.h>
> >
> > +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
> > +{
> > +       skb->ip_summed = CHECKSUM_NONE;
> > +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
> > +}
> > +
> 
> I really despise the naming of this bit here.  SCTP does not use a
> checksum.  It uses a CRC.  Please don't call this a checksum as it
> will just make the code really confusing.   I think the name should be
> something like gso_make_crc32c.

Agreed. SCTP code still references it as 'cksum'. I'll change that in
another patch.

> I think we need to address the CRC issues before we can really get
> into segmentation.  Specifically we need to be able to offload SCTP
> and FCoE in software since they both use the CHECKSUM_PARTIAL value
> and then we can start cleaning up more of this mess and move onto
> segmentation.

Hm? The mess on CRC issues here is caused by this patch alone. It's good
as it is today. And a good part of this mess is caused by trying to GSO
without offloading CRC too.

Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?

> > +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
> > +                                       netdev_features_t features)
> > +{
> > +       struct sk_buff *segs = ERR_PTR(-EINVAL);
> > +       struct sctphdr *sh;
> > +
> > +       sh = sctp_hdr(skb);
> > +       if (!pskb_may_pull(skb, sizeof(*sh)))
> > +               goto out;
> > +
> > +       __skb_pull(skb, sizeof(*sh));
> > +
> > +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
> > +               /* Packet is from an untrusted source, reset gso_segs. */
> > +               int type = skb_shinfo(skb)->gso_type;
> > +
> > +               if (unlikely(type &
> > +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
> > +                              0) ||
> > +                            !(type & (SKB_GSO_SCTP))))
> > +                       goto out;
> > +
> > +               /* This should not happen as no NIC has SCTP GSO
> > +                * offloading, it's always via software and thus we
> > +                * won't send a large packet down the stack.
> > +                */
> > +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
> > +               goto out;
> > +       }
> > +
> 
> So what you are going to end up needing here is some way to tell the
> hardware that you are doing the checksum no matter what.  There is no
> value in you computing a 1's compliment checksum for the payload if
> you aren't going to use it.  What you can probably do is just clear
> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
> offloading the checksum.

Interesting, ok

> One other bit that will make this more complicated is if we ever get
> around to supporting SCTP in tunnels.  Then we will need to sort out
> how things like remote checksum offload should impact SCTP, and how to
> deal with needing to compute both a CRC and 1's compliment checksum.
> What we would probably need to do is check for encap_hdr_csum and if
> it is set and we are doing SCTP then we would need to clear the
> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.

Yup. And that includes on storing pointers to where to store each of it.

> > +       segs = skb_segment(skb, features);
> > +       if (IS_ERR(segs))
> > +               goto out;
> > +
> > +       /* All that is left is update SCTP CRC if necessary */
> > +       for (skb = segs; skb; skb = skb->next) {
> > +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
> > +                       sh = sctp_hdr(skb);
> > +                       sh->checksum = sctp_gso_make_checksum(skb);
> > +               }
> > +       }
> > +
> 
> Okay, so it looks like you are doing the right thing here and leaving
> this as CHECKSUM_PARTIAL.

Actually no then. sctp_gso_make_checksum() replaces it:
+static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
+{
+       skb->ip_summed = CHECKSUM_NONE;
+       return sctp_compute_cksum(skb, skb_transport_offset(skb));

Why again would have to leave it as CHECKSUM_PARTIAL? IP header?

> > +out:
> > +       return segs;
> > +}
> > +
> >  static const struct net_offload sctp_offload = {
> >         .callbacks = {
> > +               .gso_segment = sctp_gso_segment,
> >         },
> >  };

Thanks,
Marcelo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
@ 2016-01-29 19:42       ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-01-29 19:42 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	marek, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > This patch enables SCTP to do GSO.
> >
> > SCTP has this pecualiarty that its packets cannot be just segmented to
> > (P)MTU. Its chunks must be contained in IP segments, padding respected.
> > So we can't just generate a big skb, set gso_size to the fragmentation
> > point and deliver it to IP layer.
> >
> > Instead, this patch proposes that SCTP build a skb as it would be if it
> > was received using GRO. That is, there will be a cover skb with the
> > headers (incluing SCTP one) and children ones containing the actual SCTP
> > chunks, already segmented in a way that respects SCTP RFCs and MTU.
> >
> > This way SCTP can benefit from GSO and instead of passing several
> > packets through the stack, it can pass a single large packet if there
> > are enough data queued and cwnd allows.
> >
> > Main points that need help:
> > - Usage of skb_gro_receive()
> >   It fits nicely in there and properly handles offsets/lens, though the
> >   name means another thing. If you agree with this usage, we can rename
> >   it to something like skb_coalesce
> >
> > - Checksum handling
> >   Why only packets with checksum offloaded can be GSOed? Most of the
> >   NICs doesn't support SCTP CRC offloading and this will nearly defeat
> >   this feature. If checksum is being computed in sw, it doesn't really
> >   matter if it's earlier or later, right?
> >   This patch hacks skb_needs_check() to allow using GSO with sw-computed
> >   checksums.
> >   Also the meaning of UNNECESSARY and NONE are quite foggy to me yet and
> >   its usage may be wrong.
> >
> > - gso_size = 1
> >   There is skb_is_gso() all over the stack and it basically checks for
> >   non-zero skb_shinfo(skb)->gso_size. Setting it to 1 is the hacky way I
> >   found to keep skb_is_gso() working while being able to signal to
> >   skb_segment() that it shouldn't use gso_size but instead the fragment
> >   sizes themselves. skb_segment() will mainly just unpack the skb then.
> 
> Instead of 1 why not use 0xFFFF?  It is a value that can never be used
> for a legitimate segment size since IP total length is a 16 bit value
> and includes the IP header in the size.

Just felt that 1 was unpractical. But perhaps with no hard restriction
like the one for 0xFFFF you said. I can replace it, 0xFFFF is better.

> > - socket / gso max values
> >   usage of sk_setup_caps() still needs a review
> >
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  include/linux/netdev_features.h |   7 +-
> >  include/linux/netdevice.h       |   1 +
> >  net/core/dev.c                  |   6 +-
> >  net/core/skbuff.c               |  12 +-
> >  net/ipv4/af_inet.c              |   1 +
> >  net/sctp/offload.c              |  53 +++++++
> >  net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
> >  net/sctp/socket.c               |   2 +
> >  8 files changed, 292 insertions(+), 128 deletions(-)
> >
> > diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> > index d9654f0eecb3519383441afa6b131ff9a5898485..f678998841f1800e0f2fe416a79935197d4ed305 100644
> > --- a/include/linux/netdev_features.h
> > +++ b/include/linux/netdev_features.h
> > @@ -48,8 +48,9 @@ enum {
> >         NETIF_F_GSO_UDP_TUNNEL_BIT,     /* ... UDP TUNNEL with TSO */
> >         NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
> >         NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
> > +       NETIF_F_GSO_SCTP_BIT,           /* ... SCTP fragmentation */
> >         /**/NETIF_F_GSO_LAST =          /* last bit, see GSO_MASK */
> > -               NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
> > +               NETIF_F_GSO_SCTP_BIT,
> >
> >         NETIF_F_FCOE_CRC_BIT,           /* FCoE CRC32 */
> >         NETIF_F_SCTP_CRC_BIT,           /* SCTP checksum offload */
> > @@ -119,6 +120,7 @@ enum {
> >  #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL)
> >  #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
> >  #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
> > +#define NETIF_F_GSO_SCTP       __NETIF_F(GSO_SCTP)
> >  #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
> >  #define NETIF_F_HW_VLAN_STAG_RX        __NETIF_F(HW_VLAN_STAG_RX)
> >  #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
> > @@ -144,7 +146,8 @@ enum {
> >
> >  /* List of features with software fallbacks. */
> >  #define NETIF_F_GSO_SOFTWARE   (NETIF_F_TSO | NETIF_F_TSO_ECN | \
> > -                                NETIF_F_TSO6 | NETIF_F_UFO)
> > +                                NETIF_F_TSO6 | NETIF_F_UFO | \
> > +                                NETIF_F_GSO_SCTP)
> >
> >  /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
> >   * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 289c2314d76668b8357728382bb33d6828617458..ce14fab858bf96dd0f85aca237350c8d8317756e 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3928,6 +3928,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
> >         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
> >         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
> >         BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
> > +       BUILD_BUG_ON(SKB_GSO_SCTP    != (NETIF_F_GSO_SCTP >> NETIF_F_GSO_SHIFT));
> >
> >         return (features & feature) = feature;
> >  }
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
> >  static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
> >  {
> >         if (tx_path)
> > -               return skb->ip_summed != CHECKSUM_PARTIAL;
> > +               /* FIXME: Why only packets with checksum offloading are
> > +                * supported for GSO?
> > +                */
> > +               return skb->ip_summed != CHECKSUM_PARTIAL &&
> > +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
> >         else
> >                 return skb->ip_summed = CHECKSUM_NONE;
> >  }
> 
> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
> transmit path a little while ago.  Please don't reintroduce it.

Can you give me some pointers on that? I cannot find such change.
skb_needs_check() seems to be like that since beginning.

> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 704b69682085dec77f3d0f990aaf0024afd705b9..96f223f8d769d2765fd64348830c76cb222906c8 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -3017,8 +3017,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> >                 int size;
> >
> >                 len = head_skb->len - offset;
> > -               if (len > mss)
> > -                       len = mss;
> > +               if (len > mss) {
> > +                       /* FIXME: A define is surely welcomed, but maybe
> > +                        * shinfo->txflags is better for this flag, but
> > +                        * we need to expand it then
> > +                        */
> > +                       if (mss = 1)
> > +                               len = list_skb->len;
> > +                       else
> > +                               len = mss;
> > +               }
> >
> 
> Using 0xFFFF here as a flag with the MSS value would likely be much
> more readable.

Either way it will be replaced by a define/name instead.

> >                 hsize = skb_headlen(head_skb) - offset;
> >                 if (hsize < 0)
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index 5c5db6636704daa0c49fc13e84b2c5b282a44ed3..ec1c779bb664d1399d74f2bd7016e30b648ce47d 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1220,6 +1220,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
> >                        SKB_GSO_UDP_TUNNEL |
> >                        SKB_GSO_UDP_TUNNEL_CSUM |
> >                        SKB_GSO_TUNNEL_REMCSUM |
> > +                      SKB_GSO_SCTP |
> >                        0)))
> >                 goto out;
> >
> > diff --git a/net/sctp/offload.c b/net/sctp/offload.c
> > index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
> > --- a/net/sctp/offload.c
> > +++ b/net/sctp/offload.c
> > @@ -36,8 +36,61 @@
> >  #include <net/sctp/checksum.h>
> >  #include <net/protocol.h>
> >
> > +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
> > +{
> > +       skb->ip_summed = CHECKSUM_NONE;
> > +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
> > +}
> > +
> 
> I really despise the naming of this bit here.  SCTP does not use a
> checksum.  It uses a CRC.  Please don't call this a checksum as it
> will just make the code really confusing.   I think the name should be
> something like gso_make_crc32c.

Agreed. SCTP code still references it as 'cksum'. I'll change that in
another patch.

> I think we need to address the CRC issues before we can really get
> into segmentation.  Specifically we need to be able to offload SCTP
> and FCoE in software since they both use the CHECKSUM_PARTIAL value
> and then we can start cleaning up more of this mess and move onto
> segmentation.

Hm? The mess on CRC issues here is caused by this patch alone. It's good
as it is today. And a good part of this mess is caused by trying to GSO
without offloading CRC too.

Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?

> > +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
> > +                                       netdev_features_t features)
> > +{
> > +       struct sk_buff *segs = ERR_PTR(-EINVAL);
> > +       struct sctphdr *sh;
> > +
> > +       sh = sctp_hdr(skb);
> > +       if (!pskb_may_pull(skb, sizeof(*sh)))
> > +               goto out;
> > +
> > +       __skb_pull(skb, sizeof(*sh));
> > +
> > +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
> > +               /* Packet is from an untrusted source, reset gso_segs. */
> > +               int type = skb_shinfo(skb)->gso_type;
> > +
> > +               if (unlikely(type &
> > +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
> > +                              0) ||
> > +                            !(type & (SKB_GSO_SCTP))))
> > +                       goto out;
> > +
> > +               /* This should not happen as no NIC has SCTP GSO
> > +                * offloading, it's always via software and thus we
> > +                * won't send a large packet down the stack.
> > +                */
> > +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
> > +               goto out;
> > +       }
> > +
> 
> So what you are going to end up needing here is some way to tell the
> hardware that you are doing the checksum no matter what.  There is no
> value in you computing a 1's compliment checksum for the payload if
> you aren't going to use it.  What you can probably do is just clear
> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
> offloading the checksum.

Interesting, ok

> One other bit that will make this more complicated is if we ever get
> around to supporting SCTP in tunnels.  Then we will need to sort out
> how things like remote checksum offload should impact SCTP, and how to
> deal with needing to compute both a CRC and 1's compliment checksum.
> What we would probably need to do is check for encap_hdr_csum and if
> it is set and we are doing SCTP then we would need to clear the
> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.

Yup. And that includes on storing pointers to where to store each of it.

> > +       segs = skb_segment(skb, features);
> > +       if (IS_ERR(segs))
> > +               goto out;
> > +
> > +       /* All that is left is update SCTP CRC if necessary */
> > +       for (skb = segs; skb; skb = skb->next) {
> > +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
> > +                       sh = sctp_hdr(skb);
> > +                       sh->checksum = sctp_gso_make_checksum(skb);
> > +               }
> > +       }
> > +
> 
> Okay, so it looks like you are doing the right thing here and leaving
> this as CHECKSUM_PARTIAL.

Actually no then. sctp_gso_make_checksum() replaces it:
+static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
+{
+       skb->ip_summed = CHECKSUM_NONE;
+       return sctp_compute_cksum(skb, skb_transport_offset(skb));

Why again would have to leave it as CHECKSUM_PARTIAL? IP header?

> > +out:
> > +       return segs;
> > +}
> > +
> >  static const struct net_offload sctp_offload = {
> >         .callbacks = {
> > +               .gso_segment = sctp_gso_segment,
> >         },
> >  };

Thanks,
Marcelo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
  2016-01-29 19:42       ` Marcelo Ricardo Leitner
@ 2016-01-30  4:07         ` Alexander Duyck
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexander Duyck @ 2016-01-30  4:07 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	marek, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

On Fri, Jan 29, 2016 at 11:42 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
>> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
>> <marcelo.leitner@gmail.com> wrote:
>> > This patch enables SCTP to do GSO.
>> >
>> > SCTP has this pecualiarty that its packets cannot be just segmented to
>> > (P)MTU. Its chunks must be contained in IP segments, padding respected.
>> > So we can't just generate a big skb, set gso_size to the fragmentation
>> > point and deliver it to IP layer.
>> >
>> > Instead, this patch proposes that SCTP build a skb as it would be if it
>> > was received using GRO. That is, there will be a cover skb with the
>> > headers (incluing SCTP one) and children ones containing the actual SCTP
>> > chunks, already segmented in a way that respects SCTP RFCs and MTU.
>> >
>> > This way SCTP can benefit from GSO and instead of passing several
>> > packets through the stack, it can pass a single large packet if there
>> > are enough data queued and cwnd allows.
>> >
>> > Main points that need help:
>> > - Usage of skb_gro_receive()
>> >   It fits nicely in there and properly handles offsets/lens, though the
>> >   name means another thing. If you agree with this usage, we can rename
>> >   it to something like skb_coalesce
>> >
>> > - Checksum handling
>> >   Why only packets with checksum offloaded can be GSOed? Most of the
>> >   NICs doesn't support SCTP CRC offloading and this will nearly defeat
>> >   this feature. If checksum is being computed in sw, it doesn't really
>> >   matter if it's earlier or later, right?
>> >   This patch hacks skb_needs_check() to allow using GSO with sw-computed
>> >   checksums.
>> >   Also the meaning of UNNECESSARY and NONE are quite foggy to me yet and
>> >   its usage may be wrong.
>> >
>> > - gso_size = 1
>> >   There is skb_is_gso() all over the stack and it basically checks for
>> >   non-zero skb_shinfo(skb)->gso_size. Setting it to 1 is the hacky way I
>> >   found to keep skb_is_gso() working while being able to signal to
>> >   skb_segment() that it shouldn't use gso_size but instead the fragment
>> >   sizes themselves. skb_segment() will mainly just unpack the skb then.
>>
>> Instead of 1 why not use 0xFFFF?  It is a value that can never be used
>> for a legitimate segment size since IP total length is a 16 bit value
>> and includes the IP header in the size.
>
> Just felt that 1 was unpractical. But perhaps with no hard restriction
> like the one for 0xFFFF you said. I can replace it, 0xFFFF is better.
>
>> > - socket / gso max values
>> >   usage of sk_setup_caps() still needs a review
>> >
>> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
>> > ---
>> >  include/linux/netdev_features.h |   7 +-
>> >  include/linux/netdevice.h       |   1 +
>> >  net/core/dev.c                  |   6 +-
>> >  net/core/skbuff.c               |  12 +-
>> >  net/ipv4/af_inet.c              |   1 +
>> >  net/sctp/offload.c              |  53 +++++++
>> >  net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
>> >  net/sctp/socket.c               |   2 +
>> >  8 files changed, 292 insertions(+), 128 deletions(-)
>> >
>> > diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
>> > index d9654f0eecb3519383441afa6b131ff9a5898485..f678998841f1800e0f2fe416a79935197d4ed305 100644
>> > --- a/include/linux/netdev_features.h
>> > +++ b/include/linux/netdev_features.h
>> > @@ -48,8 +48,9 @@ enum {
>> >         NETIF_F_GSO_UDP_TUNNEL_BIT,     /* ... UDP TUNNEL with TSO */
>> >         NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
>> >         NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
>> > +       NETIF_F_GSO_SCTP_BIT,           /* ... SCTP fragmentation */
>> >         /**/NETIF_F_GSO_LAST =          /* last bit, see GSO_MASK */
>> > -               NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
>> > +               NETIF_F_GSO_SCTP_BIT,
>> >
>> >         NETIF_F_FCOE_CRC_BIT,           /* FCoE CRC32 */
>> >         NETIF_F_SCTP_CRC_BIT,           /* SCTP checksum offload */
>> > @@ -119,6 +120,7 @@ enum {
>> >  #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL)
>> >  #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
>> >  #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
>> > +#define NETIF_F_GSO_SCTP       __NETIF_F(GSO_SCTP)
>> >  #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
>> >  #define NETIF_F_HW_VLAN_STAG_RX        __NETIF_F(HW_VLAN_STAG_RX)
>> >  #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
>> > @@ -144,7 +146,8 @@ enum {
>> >
>> >  /* List of features with software fallbacks. */
>> >  #define NETIF_F_GSO_SOFTWARE   (NETIF_F_TSO | NETIF_F_TSO_ECN | \
>> > -                                NETIF_F_TSO6 | NETIF_F_UFO)
>> > +                                NETIF_F_TSO6 | NETIF_F_UFO | \
>> > +                                NETIF_F_GSO_SCTP)
>> >
>> >  /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
>> >   * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
>> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > index 289c2314d76668b8357728382bb33d6828617458..ce14fab858bf96dd0f85aca237350c8d8317756e 100644
>> > --- a/include/linux/netdevice.h
>> > +++ b/include/linux/netdevice.h
>> > @@ -3928,6 +3928,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
>> >         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
>> >         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
>> >         BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
>> > +       BUILD_BUG_ON(SKB_GSO_SCTP    != (NETIF_F_GSO_SCTP >> NETIF_F_GSO_SHIFT));
>> >
>> >         return (features & feature) == feature;
>> >  }
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
>> >  static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
>> >  {
>> >         if (tx_path)
>> > -               return skb->ip_summed != CHECKSUM_PARTIAL;
>> > +               /* FIXME: Why only packets with checksum offloading are
>> > +                * supported for GSO?
>> > +                */
>> > +               return skb->ip_summed != CHECKSUM_PARTIAL &&
>> > +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
>> >         else
>> >                 return skb->ip_summed == CHECKSUM_NONE;
>> >  }
>>
>> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
>> transmit path a little while ago.  Please don't reintroduce it.
>
> Can you give me some pointers on that? I cannot find such change.
> skb_needs_check() seems to be like that since beginning.

Maybe you need to update your kernel.  All this stuff was changed in
December and has been this way for a little while now.

Commits:
7a6ae71b24905 "net: Elaborate on checksum offload interface description"
253aab0597d9e "fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload"
53692b1de419c "sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC"

The main reason I even noticed it is because of some of the work I did
on the Intel NIC offloads.

>> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> > index 704b69682085dec77f3d0f990aaf0024afd705b9..96f223f8d769d2765fd64348830c76cb222906c8 100644
>> > --- a/net/core/skbuff.c
>> > +++ b/net/core/skbuff.c
>> > @@ -3017,8 +3017,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> >                 int size;
>> >
>> >                 len = head_skb->len - offset;
>> > -               if (len > mss)
>> > -                       len = mss;
>> > +               if (len > mss) {
>> > +                       /* FIXME: A define is surely welcomed, but maybe
>> > +                        * shinfo->txflags is better for this flag, but
>> > +                        * we need to expand it then
>> > +                        */
>> > +                       if (mss == 1)
>> > +                               len = list_skb->len;
>> > +                       else
>> > +                               len = mss;
>> > +               }
>> >
>>
>> Using 0xFFFF here as a flag with the MSS value would likely be much
>> more readable.
>
> Either way it will be replaced by a define/name instead.

Yeah, that would probably be good.

>> >                 hsize = skb_headlen(head_skb) - offset;
>> >                 if (hsize < 0)
>> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>> > index 5c5db6636704daa0c49fc13e84b2c5b282a44ed3..ec1c779bb664d1399d74f2bd7016e30b648ce47d 100644
>> > --- a/net/ipv4/af_inet.c
>> > +++ b/net/ipv4/af_inet.c
>> > @@ -1220,6 +1220,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>> >                        SKB_GSO_UDP_TUNNEL |
>> >                        SKB_GSO_UDP_TUNNEL_CSUM |
>> >                        SKB_GSO_TUNNEL_REMCSUM |
>> > +                      SKB_GSO_SCTP |
>> >                        0)))
>> >                 goto out;
>> >
>> > diff --git a/net/sctp/offload.c b/net/sctp/offload.c
>> > index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
>> > --- a/net/sctp/offload.c
>> > +++ b/net/sctp/offload.c
>> > @@ -36,8 +36,61 @@
>> >  #include <net/sctp/checksum.h>
>> >  #include <net/protocol.h>
>> >
>> > +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>> > +{
>> > +       skb->ip_summed = CHECKSUM_NONE;
>> > +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>> > +}
>> > +
>>
>> I really despise the naming of this bit here.  SCTP does not use a
>> checksum.  It uses a CRC.  Please don't call this a checksum as it
>> will just make the code really confusing.   I think the name should be
>> something like gso_make_crc32c.
>
> Agreed. SCTP code still references it as 'cksum'. I'll change that in
> another patch.
>
>> I think we need to address the CRC issues before we can really get
>> into segmentation.  Specifically we need to be able to offload SCTP
>> and FCoE in software since they both use the CHECKSUM_PARTIAL value
>> and then we can start cleaning up more of this mess and move onto
>> segmentation.
>
> Hm? The mess on CRC issues here is caused by this patch alone. It's good
> as it is today. And a good part of this mess is caused by trying to GSO
> without offloading CRC too.
>
> Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?

Well after Tom's change both SCTP and FCoE use CHECKSUM_PARTIAL.
CHECKSUM_PARTIAL is what is used to indicate to the hardware that a
checksum offload has been requested so that is what is looked for at
the driver level.

My concern with all this is that we should probably be looking at
coming up with a means of offloading this in software when
skb_checksum_help is called.  Right now validate_xmit_skb doesn't have
any understanding of what to do with SCTP or FCoE and will try to just
compute a checksum for them.

>> > +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
>> > +                                       netdev_features_t features)
>> > +{
>> > +       struct sk_buff *segs = ERR_PTR(-EINVAL);
>> > +       struct sctphdr *sh;
>> > +
>> > +       sh = sctp_hdr(skb);
>> > +       if (!pskb_may_pull(skb, sizeof(*sh)))
>> > +               goto out;
>> > +
>> > +       __skb_pull(skb, sizeof(*sh));
>> > +
>> > +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
>> > +               /* Packet is from an untrusted source, reset gso_segs. */
>> > +               int type = skb_shinfo(skb)->gso_type;
>> > +
>> > +               if (unlikely(type &
>> > +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
>> > +                              0) ||
>> > +                            !(type & (SKB_GSO_SCTP))))
>> > +                       goto out;
>> > +
>> > +               /* This should not happen as no NIC has SCTP GSO
>> > +                * offloading, it's always via software and thus we
>> > +                * won't send a large packet down the stack.
>> > +                */
>> > +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
>> > +               goto out;
>> > +       }
>> > +
>>
>> So what you are going to end up needing here is some way to tell the
>> hardware that you are doing the checksum no matter what.  There is no
>> value in you computing a 1's compliment checksum for the payload if
>> you aren't going to use it.  What you can probably do is just clear
>> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
>> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
>> offloading the checksum.
>
> Interesting, ok
>
>> One other bit that will make this more complicated is if we ever get
>> around to supporting SCTP in tunnels.  Then we will need to sort out
>> how things like remote checksum offload should impact SCTP, and how to
>> deal with needing to compute both a CRC and 1's compliment checksum.
>> What we would probably need to do is check for encap_hdr_csum and if
>> it is set and we are doing SCTP then we would need to clear the
>> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.
>
> Yup. And that includes on storing pointers to where to store each of it.

Actually the pointers bit is easy.  The csum_start and csum_offset
values should be set up after you have segmented the skb and should be
updated after the skb has been segmented.  If nothing else you can
probably take a look at the TCP code tcp_gso_segment and
tcp4_gso_segment for inspiration.  Basically you need to make sure
that you set the ip_summed, csum_start, and csum_offset values for
your first frame before you start segmenting it into multiple frames.

>> > +       segs = skb_segment(skb, features);
>> > +       if (IS_ERR(segs))
>> > +               goto out;
>> > +
>> > +       /* All that is left is update SCTP CRC if necessary */
>> > +       for (skb = segs; skb; skb = skb->next) {
>> > +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
>> > +                       sh = sctp_hdr(skb);
>> > +                       sh->checksum = sctp_gso_make_checksum(skb);
>> > +               }
>> > +       }
>> > +
>>
>> Okay, so it looks like you are doing the right thing here and leaving
>> this as CHECKSUM_PARTIAL.
>
> Actually no then. sctp_gso_make_checksum() replaces it:
> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
> +{
> +       skb->ip_summed = CHECKSUM_NONE;
> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>
> Why again would have to leave it as CHECKSUM_PARTIAL? IP header?

My earlier comment is actually incorrect.  This section is pretty much
broken since CHECKSUM_PARTIAL only reflects a 1's compliment checksum
in the case of skb_segment so whatever the value it is worthless.
CHECKSUM_PARTIAL is used to indicate if a given frame needs to be
offloaded.  It is meant to let the device know that it still needs to
compute a checksum or CRC beginning at csum_start and then storing the
new value at csum_offset.  However for skb_segment it is actually
referring to a 1's compliment checksum and if it returns CHECKSUM_NONE
it means it is stored in skb->csum which would really wreck things for
you since that was your skb->csum_start and skb->csum_offset values.
I have a patch to change this so that we update a checksum in the
SKB_GSO_CB, but I wasn't planning on submitting that until net-next
opens.

In the case of SCTP you probably don't even need to bother checking
the value since it is meaningless as skb_segment doesn't know how to
do an SCTP checksum anyway.  To that end for now what you could do is
just set NETIF_F_HW_CSUM.  This way skb_segment won't go and try to
compute a 1's compliment checksum on the payload since there is no
actual need for it.

One other bit you will need to do is to check the value of SCTP_CRC
outside of skb_segment.  You might look at how
__skb_udp_tunnel_segment does this to populate its own offload_csum
boolean value, though you would want to use features, not
skb->dev->features as that is a bit of a workaround since features is
stripped by hw_enc_features in some paths if I recall correctly.

Once the frames are segmented and if you don't support the offload you
could then call gso_make_crc32c() or whatever you want to name it to
perform the CRC calculation and populate the field.  One question by
the way.  Don't you need to initialize the checksum value to 0 before
you compute it?  I think you might have missed that step when you were
setting this up.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
@ 2016-01-30  4:07         ` Alexander Duyck
  0 siblings, 0 replies; 49+ messages in thread
From: Alexander Duyck @ 2016-01-30  4:07 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	marek, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

On Fri, Jan 29, 2016 at 11:42 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
>> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
>> <marcelo.leitner@gmail.com> wrote:
>> > This patch enables SCTP to do GSO.
>> >
>> > SCTP has this pecualiarty that its packets cannot be just segmented to
>> > (P)MTU. Its chunks must be contained in IP segments, padding respected.
>> > So we can't just generate a big skb, set gso_size to the fragmentation
>> > point and deliver it to IP layer.
>> >
>> > Instead, this patch proposes that SCTP build a skb as it would be if it
>> > was received using GRO. That is, there will be a cover skb with the
>> > headers (incluing SCTP one) and children ones containing the actual SCTP
>> > chunks, already segmented in a way that respects SCTP RFCs and MTU.
>> >
>> > This way SCTP can benefit from GSO and instead of passing several
>> > packets through the stack, it can pass a single large packet if there
>> > are enough data queued and cwnd allows.
>> >
>> > Main points that need help:
>> > - Usage of skb_gro_receive()
>> >   It fits nicely in there and properly handles offsets/lens, though the
>> >   name means another thing. If you agree with this usage, we can rename
>> >   it to something like skb_coalesce
>> >
>> > - Checksum handling
>> >   Why only packets with checksum offloaded can be GSOed? Most of the
>> >   NICs doesn't support SCTP CRC offloading and this will nearly defeat
>> >   this feature. If checksum is being computed in sw, it doesn't really
>> >   matter if it's earlier or later, right?
>> >   This patch hacks skb_needs_check() to allow using GSO with sw-computed
>> >   checksums.
>> >   Also the meaning of UNNECESSARY and NONE are quite foggy to me yet and
>> >   its usage may be wrong.
>> >
>> > - gso_size = 1
>> >   There is skb_is_gso() all over the stack and it basically checks for
>> >   non-zero skb_shinfo(skb)->gso_size. Setting it to 1 is the hacky way I
>> >   found to keep skb_is_gso() working while being able to signal to
>> >   skb_segment() that it shouldn't use gso_size but instead the fragment
>> >   sizes themselves. skb_segment() will mainly just unpack the skb then.
>>
>> Instead of 1 why not use 0xFFFF?  It is a value that can never be used
>> for a legitimate segment size since IP total length is a 16 bit value
>> and includes the IP header in the size.
>
> Just felt that 1 was unpractical. But perhaps with no hard restriction
> like the one for 0xFFFF you said. I can replace it, 0xFFFF is better.
>
>> > - socket / gso max values
>> >   usage of sk_setup_caps() still needs a review
>> >
>> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
>> > ---
>> >  include/linux/netdev_features.h |   7 +-
>> >  include/linux/netdevice.h       |   1 +
>> >  net/core/dev.c                  |   6 +-
>> >  net/core/skbuff.c               |  12 +-
>> >  net/ipv4/af_inet.c              |   1 +
>> >  net/sctp/offload.c              |  53 +++++++
>> >  net/sctp/output.c               | 338 +++++++++++++++++++++++++---------------
>> >  net/sctp/socket.c               |   2 +
>> >  8 files changed, 292 insertions(+), 128 deletions(-)
>> >
>> > diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
>> > index d9654f0eecb3519383441afa6b131ff9a5898485..f678998841f1800e0f2fe416a79935197d4ed305 100644
>> > --- a/include/linux/netdev_features.h
>> > +++ b/include/linux/netdev_features.h
>> > @@ -48,8 +48,9 @@ enum {
>> >         NETIF_F_GSO_UDP_TUNNEL_BIT,     /* ... UDP TUNNEL with TSO */
>> >         NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
>> >         NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
>> > +       NETIF_F_GSO_SCTP_BIT,           /* ... SCTP fragmentation */
>> >         /**/NETIF_F_GSO_LAST =          /* last bit, see GSO_MASK */
>> > -               NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
>> > +               NETIF_F_GSO_SCTP_BIT,
>> >
>> >         NETIF_F_FCOE_CRC_BIT,           /* FCoE CRC32 */
>> >         NETIF_F_SCTP_CRC_BIT,           /* SCTP checksum offload */
>> > @@ -119,6 +120,7 @@ enum {
>> >  #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL)
>> >  #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
>> >  #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
>> > +#define NETIF_F_GSO_SCTP       __NETIF_F(GSO_SCTP)
>> >  #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
>> >  #define NETIF_F_HW_VLAN_STAG_RX        __NETIF_F(HW_VLAN_STAG_RX)
>> >  #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
>> > @@ -144,7 +146,8 @@ enum {
>> >
>> >  /* List of features with software fallbacks. */
>> >  #define NETIF_F_GSO_SOFTWARE   (NETIF_F_TSO | NETIF_F_TSO_ECN | \
>> > -                                NETIF_F_TSO6 | NETIF_F_UFO)
>> > +                                NETIF_F_TSO6 | NETIF_F_UFO | \
>> > +                                NETIF_F_GSO_SCTP)
>> >
>> >  /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
>> >   * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
>> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > index 289c2314d76668b8357728382bb33d6828617458..ce14fab858bf96dd0f85aca237350c8d8317756e 100644
>> > --- a/include/linux/netdevice.h
>> > +++ b/include/linux/netdevice.h
>> > @@ -3928,6 +3928,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
>> >         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
>> >         BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
>> >         BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
>> > +       BUILD_BUG_ON(SKB_GSO_SCTP    != (NETIF_F_GSO_SCTP >> NETIF_F_GSO_SHIFT));
>> >
>> >         return (features & feature) = feature;
>> >  }
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
>> >  static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
>> >  {
>> >         if (tx_path)
>> > -               return skb->ip_summed != CHECKSUM_PARTIAL;
>> > +               /* FIXME: Why only packets with checksum offloading are
>> > +                * supported for GSO?
>> > +                */
>> > +               return skb->ip_summed != CHECKSUM_PARTIAL &&
>> > +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
>> >         else
>> >                 return skb->ip_summed = CHECKSUM_NONE;
>> >  }
>>
>> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
>> transmit path a little while ago.  Please don't reintroduce it.
>
> Can you give me some pointers on that? I cannot find such change.
> skb_needs_check() seems to be like that since beginning.

Maybe you need to update your kernel.  All this stuff was changed in
December and has been this way for a little while now.

Commits:
7a6ae71b24905 "net: Elaborate on checksum offload interface description"
253aab0597d9e "fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload"
53692b1de419c "sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC"

The main reason I even noticed it is because of some of the work I did
on the Intel NIC offloads.

>> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> > index 704b69682085dec77f3d0f990aaf0024afd705b9..96f223f8d769d2765fd64348830c76cb222906c8 100644
>> > --- a/net/core/skbuff.c
>> > +++ b/net/core/skbuff.c
>> > @@ -3017,8 +3017,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> >                 int size;
>> >
>> >                 len = head_skb->len - offset;
>> > -               if (len > mss)
>> > -                       len = mss;
>> > +               if (len > mss) {
>> > +                       /* FIXME: A define is surely welcomed, but maybe
>> > +                        * shinfo->txflags is better for this flag, but
>> > +                        * we need to expand it then
>> > +                        */
>> > +                       if (mss = 1)
>> > +                               len = list_skb->len;
>> > +                       else
>> > +                               len = mss;
>> > +               }
>> >
>>
>> Using 0xFFFF here as a flag with the MSS value would likely be much
>> more readable.
>
> Either way it will be replaced by a define/name instead.

Yeah, that would probably be good.

>> >                 hsize = skb_headlen(head_skb) - offset;
>> >                 if (hsize < 0)
>> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>> > index 5c5db6636704daa0c49fc13e84b2c5b282a44ed3..ec1c779bb664d1399d74f2bd7016e30b648ce47d 100644
>> > --- a/net/ipv4/af_inet.c
>> > +++ b/net/ipv4/af_inet.c
>> > @@ -1220,6 +1220,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>> >                        SKB_GSO_UDP_TUNNEL |
>> >                        SKB_GSO_UDP_TUNNEL_CSUM |
>> >                        SKB_GSO_TUNNEL_REMCSUM |
>> > +                      SKB_GSO_SCTP |
>> >                        0)))
>> >                 goto out;
>> >
>> > diff --git a/net/sctp/offload.c b/net/sctp/offload.c
>> > index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
>> > --- a/net/sctp/offload.c
>> > +++ b/net/sctp/offload.c
>> > @@ -36,8 +36,61 @@
>> >  #include <net/sctp/checksum.h>
>> >  #include <net/protocol.h>
>> >
>> > +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>> > +{
>> > +       skb->ip_summed = CHECKSUM_NONE;
>> > +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>> > +}
>> > +
>>
>> I really despise the naming of this bit here.  SCTP does not use a
>> checksum.  It uses a CRC.  Please don't call this a checksum as it
>> will just make the code really confusing.   I think the name should be
>> something like gso_make_crc32c.
>
> Agreed. SCTP code still references it as 'cksum'. I'll change that in
> another patch.
>
>> I think we need to address the CRC issues before we can really get
>> into segmentation.  Specifically we need to be able to offload SCTP
>> and FCoE in software since they both use the CHECKSUM_PARTIAL value
>> and then we can start cleaning up more of this mess and move onto
>> segmentation.
>
> Hm? The mess on CRC issues here is caused by this patch alone. It's good
> as it is today. And a good part of this mess is caused by trying to GSO
> without offloading CRC too.
>
> Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?

Well after Tom's change both SCTP and FCoE use CHECKSUM_PARTIAL.
CHECKSUM_PARTIAL is what is used to indicate to the hardware that a
checksum offload has been requested so that is what is looked for at
the driver level.

My concern with all this is that we should probably be looking at
coming up with a means of offloading this in software when
skb_checksum_help is called.  Right now validate_xmit_skb doesn't have
any understanding of what to do with SCTP or FCoE and will try to just
compute a checksum for them.

>> > +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
>> > +                                       netdev_features_t features)
>> > +{
>> > +       struct sk_buff *segs = ERR_PTR(-EINVAL);
>> > +       struct sctphdr *sh;
>> > +
>> > +       sh = sctp_hdr(skb);
>> > +       if (!pskb_may_pull(skb, sizeof(*sh)))
>> > +               goto out;
>> > +
>> > +       __skb_pull(skb, sizeof(*sh));
>> > +
>> > +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
>> > +               /* Packet is from an untrusted source, reset gso_segs. */
>> > +               int type = skb_shinfo(skb)->gso_type;
>> > +
>> > +               if (unlikely(type &
>> > +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
>> > +                              0) ||
>> > +                            !(type & (SKB_GSO_SCTP))))
>> > +                       goto out;
>> > +
>> > +               /* This should not happen as no NIC has SCTP GSO
>> > +                * offloading, it's always via software and thus we
>> > +                * won't send a large packet down the stack.
>> > +                */
>> > +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
>> > +               goto out;
>> > +       }
>> > +
>>
>> So what you are going to end up needing here is some way to tell the
>> hardware that you are doing the checksum no matter what.  There is no
>> value in you computing a 1's compliment checksum for the payload if
>> you aren't going to use it.  What you can probably do is just clear
>> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
>> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
>> offloading the checksum.
>
> Interesting, ok
>
>> One other bit that will make this more complicated is if we ever get
>> around to supporting SCTP in tunnels.  Then we will need to sort out
>> how things like remote checksum offload should impact SCTP, and how to
>> deal with needing to compute both a CRC and 1's compliment checksum.
>> What we would probably need to do is check for encap_hdr_csum and if
>> it is set and we are doing SCTP then we would need to clear the
>> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.
>
> Yup. And that includes on storing pointers to where to store each of it.

Actually the pointers bit is easy.  The csum_start and csum_offset
values should be set up after you have segmented the skb and should be
updated after the skb has been segmented.  If nothing else you can
probably take a look at the TCP code tcp_gso_segment and
tcp4_gso_segment for inspiration.  Basically you need to make sure
that you set the ip_summed, csum_start, and csum_offset values for
your first frame before you start segmenting it into multiple frames.

>> > +       segs = skb_segment(skb, features);
>> > +       if (IS_ERR(segs))
>> > +               goto out;
>> > +
>> > +       /* All that is left is update SCTP CRC if necessary */
>> > +       for (skb = segs; skb; skb = skb->next) {
>> > +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
>> > +                       sh = sctp_hdr(skb);
>> > +                       sh->checksum = sctp_gso_make_checksum(skb);
>> > +               }
>> > +       }
>> > +
>>
>> Okay, so it looks like you are doing the right thing here and leaving
>> this as CHECKSUM_PARTIAL.
>
> Actually no then. sctp_gso_make_checksum() replaces it:
> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
> +{
> +       skb->ip_summed = CHECKSUM_NONE;
> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>
> Why again would have to leave it as CHECKSUM_PARTIAL? IP header?

My earlier comment is actually incorrect.  This section is pretty much
broken since CHECKSUM_PARTIAL only reflects a 1's compliment checksum
in the case of skb_segment so whatever the value it is worthless.
CHECKSUM_PARTIAL is used to indicate if a given frame needs to be
offloaded.  It is meant to let the device know that it still needs to
compute a checksum or CRC beginning at csum_start and then storing the
new value at csum_offset.  However for skb_segment it is actually
referring to a 1's compliment checksum and if it returns CHECKSUM_NONE
it means it is stored in skb->csum which would really wreck things for
you since that was your skb->csum_start and skb->csum_offset values.
I have a patch to change this so that we update a checksum in the
SKB_GSO_CB, but I wasn't planning on submitting that until net-next
opens.

In the case of SCTP you probably don't even need to bother checking
the value since it is meaningless as skb_segment doesn't know how to
do an SCTP checksum anyway.  To that end for now what you could do is
just set NETIF_F_HW_CSUM.  This way skb_segment won't go and try to
compute a 1's compliment checksum on the payload since there is no
actual need for it.

One other bit you will need to do is to check the value of SCTP_CRC
outside of skb_segment.  You might look at how
__skb_udp_tunnel_segment does this to populate its own offload_csum
boolean value, though you would want to use features, not
skb->dev->features as that is a bit of a workaround since features is
stripped by hw_enc_features in some paths if I recall correctly.

Once the frames are segmented and if you don't support the offload you
could then call gso_make_crc32c() or whatever you want to name it to
perform the CRC calculation and populate the field.  One question by
the way.  Don't you need to initialize the checksum value to 0 before
you compute it?  I think you might have missed that step when you were
setting this up.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
  2016-01-30  4:07         ` Alexander Duyck
@ 2016-02-01 16:22           ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-02-01 16:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	marek, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

Em 30-01-2016 02:07, Alexander Duyck escreveu:
> On Fri, Jan 29, 2016 at 11:42 AM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
>> On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
>>> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
...
>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>> index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
>>>> --- a/net/core/dev.c
>>>> +++ b/net/core/dev.c
>>>> @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
>>>>   static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
>>>>   {
>>>>          if (tx_path)
>>>> -               return skb->ip_summed != CHECKSUM_PARTIAL;
>>>> +               /* FIXME: Why only packets with checksum offloading are
>>>> +                * supported for GSO?
>>>> +                */
>>>> +               return skb->ip_summed != CHECKSUM_PARTIAL &&
>>>> +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
>>>>          else
>>>>                  return skb->ip_summed == CHECKSUM_NONE;
>>>>   }
>>>
>>> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
>>> transmit path a little while ago.  Please don't reintroduce it.
>>
>> Can you give me some pointers on that? I cannot find such change.
>> skb_needs_check() seems to be like that since beginning.
>
> Maybe you need to update your kernel.  All this stuff was changed in
> December and has been this way for a little while now.
>
> Commits:
> 7a6ae71b24905 "net: Elaborate on checksum offload interface description"
> 253aab0597d9e "fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload"
> 53692b1de419c "sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC"
>
> The main reason I even noticed it is because of some of the work I did
> on the Intel NIC offloads.

Ok I have those here, but my need here is different. I want to do GSO 
with packets that won't have CRC offloaded, so I shouldn't use 
CHECKSUM_PARTIAL but something else.

...
>>>> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
>>>> index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
>>>> --- a/net/sctp/offload.c
>>>> +++ b/net/sctp/offload.c
>>>> @@ -36,8 +36,61 @@
>>>>   #include <net/sctp/checksum.h>
>>>>   #include <net/protocol.h>
>>>>
>>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>>>> +{
>>>> +       skb->ip_summed = CHECKSUM_NONE;
>>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>>>> +}
>>>> +
>>>
>>> I really despise the naming of this bit here.  SCTP does not use a
>>> checksum.  It uses a CRC.  Please don't call this a checksum as it
>>> will just make the code really confusing.   I think the name should be
>>> something like gso_make_crc32c.
>>
>> Agreed. SCTP code still references it as 'cksum'. I'll change that in
>> another patch.
>>
>>> I think we need to address the CRC issues before we can really get
>>> into segmentation.  Specifically we need to be able to offload SCTP
>>> and FCoE in software since they both use the CHECKSUM_PARTIAL value
>>> and then we can start cleaning up more of this mess and move onto
>>> segmentation.
>>
>> Hm? The mess on CRC issues here is caused by this patch alone. It's good
>> as it is today. And a good part of this mess is caused by trying to GSO
>> without offloading CRC too.
>>
>> Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?
>
> Well after Tom's change both SCTP and FCoE use CHECKSUM_PARTIAL.
> CHECKSUM_PARTIAL is what is used to indicate to the hardware that a
> checksum offload has been requested so that is what is looked for at
> the driver level.

SCTP was actually already using CHECKSUM_PARTIAL. That patch was just a 
rename in an attempt to make this crc difference more evident. Yet I'll 
continue the rename within sctp code.

> My concern with all this is that we should probably be looking at
> coming up with a means of offloading this in software when
> skb_checksum_help is called.  Right now validate_xmit_skb doesn't have
> any understanding of what to do with SCTP or FCoE and will try to just
> compute a checksum for them.

My worry is placed a bit earlier than that, I think. Currently I just 
cannot do GSO with packets that doesn't have checksum/crc offloaded too 
because validate_xmit_skb() will complain.

As NICs hardly have sctp crc offloading capabilities, I'm thinking it 
makes sense to do GSO even without crc offloaded. After all, it doesn't 
matter much in which stage we are computing the crc as we are computing 
it anyway.

>>>> +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
>>>> +                                       netdev_features_t features)
>>>> +{
>>>> +       struct sk_buff *segs = ERR_PTR(-EINVAL);
>>>> +       struct sctphdr *sh;
>>>> +
>>>> +       sh = sctp_hdr(skb);
>>>> +       if (!pskb_may_pull(skb, sizeof(*sh)))
>>>> +               goto out;
>>>> +
>>>> +       __skb_pull(skb, sizeof(*sh));
>>>> +
>>>> +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
>>>> +               /* Packet is from an untrusted source, reset gso_segs. */
>>>> +               int type = skb_shinfo(skb)->gso_type;
>>>> +
>>>> +               if (unlikely(type &
>>>> +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
>>>> +                              0) ||
>>>> +                            !(type & (SKB_GSO_SCTP))))
>>>> +                       goto out;
>>>> +
>>>> +               /* This should not happen as no NIC has SCTP GSO
>>>> +                * offloading, it's always via software and thus we
>>>> +                * won't send a large packet down the stack.
>>>> +                */
>>>> +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
>>>> +               goto out;
>>>> +       }
>>>> +
>>>
>>> So what you are going to end up needing here is some way to tell the
>>> hardware that you are doing the checksum no matter what.  There is no
>>> value in you computing a 1's compliment checksum for the payload if
>>> you aren't going to use it.  What you can probably do is just clear
>>> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
>>> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
>>> offloading the checksum.
>>
>> Interesting, ok
>>
>>> One other bit that will make this more complicated is if we ever get
>>> around to supporting SCTP in tunnels.  Then we will need to sort out
>>> how things like remote checksum offload should impact SCTP, and how to
>>> deal with needing to compute both a CRC and 1's compliment checksum.
>>> What we would probably need to do is check for encap_hdr_csum and if
>>> it is set and we are doing SCTP then we would need to clear the
>>> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.
>>
>> Yup. And that includes on storing pointers to where to store each of it.
>
> Actually the pointers bit is easy.  The csum_start and csum_offset
> values should be set up after you have segmented the skb and should be
> updated after the skb has been segmented.  If nothing else you can
> probably take a look at the TCP code tcp_gso_segment and
> tcp4_gso_segment for inspiration.  Basically you need to make sure
> that you set the ip_summed, csum_start, and csum_offset values for
> your first frame before you start segmenting it into multiple frames.

Ah yes, ok, that's for now, when not doing crc offloading with some 
chksum offloading (tunnel) too.

>>>> +       segs = skb_segment(skb, features);
>>>> +       if (IS_ERR(segs))
>>>> +               goto out;
>>>> +
>>>> +       /* All that is left is update SCTP CRC if necessary */
>>>> +       for (skb = segs; skb; skb = skb->next) {
>>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
>>>> +                       sh = sctp_hdr(skb);
>>>> +                       sh->checksum = sctp_gso_make_checksum(skb);
>>>> +               }
>>>> +       }
>>>> +
>>>
>>> Okay, so it looks like you are doing the right thing here and leaving
>>> this as CHECKSUM_PARTIAL.
>>
>> Actually no then. sctp_gso_make_checksum() replaces it:
>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>> +{
>> +       skb->ip_summed = CHECKSUM_NONE;
>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>>
>> Why again would have to leave it as CHECKSUM_PARTIAL? IP header?
>
> My earlier comment is actually incorrect.  This section is pretty much
> broken since CHECKSUM_PARTIAL only reflects a 1's compliment checksum
> in the case of skb_segment so whatever the value it is worthless.
> CHECKSUM_PARTIAL is used to indicate if a given frame needs to be
> offloaded.  It is meant to let the device know that it still needs to
> compute a checksum or CRC beginning at csum_start and then storing the
> new value at csum_offset.  However for skb_segment it is actually
> referring to a 1's compliment checksum and if it returns CHECKSUM_NONE
> it means it is stored in skb->csum which would really wreck things for
> you since that was your skb->csum_start and skb->csum_offset values.
> I have a patch to change this so that we update a checksum in the
> SKB_GSO_CB, but I wasn't planning on submitting that until net-next
> opens.

sctp currently ignores skb->csum. It doesn't mess with the crc but 
computing it is at least not optimal, yes.

> In the case of SCTP you probably don't even need to bother checking
> the value since it is meaningless as skb_segment doesn't know how to
> do an SCTP checksum anyway.  To that end for now what you could do is
> just set NETIF_F_HW_CSUM.  This way skb_segment won't go and try to
> compute a 1's compliment checksum on the payload since there is no
> actual need for it.

Nice, ok.

> One other bit you will need to do is to check the value of SCTP_CRC
> outside of skb_segment.  You might look at how
> __skb_udp_tunnel_segment does this to populate its own offload_csum
> boolean value, though you would want to use features, not
> skb->dev->features as that is a bit of a workaround since features is
> stripped by hw_enc_features in some paths if I recall correctly.
 >
> Once the frames are segmented and if you don't support the offload you
> could then call gso_make_crc32c() or whatever you want to name it to
> perform the CRC calculation and populate the field.  One question by

Hmmm.. does it mean that we can use CHECKSUM_PARTIAL then even if CRC 
offloading is not possible then? Because the packet will not be 
offloaded in the end, yes, but this solves my questions above. Then 
while doing GSO, it re-evaluates if it can offload crc or not?

> the way.  Don't you need to initialize the checksum value to 0 before
> you compute it?  I think you might have missed that step when you were
> setting this up.

It's fine :) sctp_compute_cksum will replace it with zeroes, calculate, 
and put back the old value, which then we overwrite with the new one at 
sctp_gso_segment.

Thanks,
Marcelo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
@ 2016-02-01 16:22           ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-02-01 16:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	marek, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

Em 30-01-2016 02:07, Alexander Duyck escreveu:
> On Fri, Jan 29, 2016 at 11:42 AM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
>> On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
>>> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
...
>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>> index 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5 100644
>>>> --- a/net/core/dev.c
>>>> +++ b/net/core/dev.c
>>>> @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
>>>>   static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
>>>>   {
>>>>          if (tx_path)
>>>> -               return skb->ip_summed != CHECKSUM_PARTIAL;
>>>> +               /* FIXME: Why only packets with checksum offloading are
>>>> +                * supported for GSO?
>>>> +                */
>>>> +               return skb->ip_summed != CHECKSUM_PARTIAL &&
>>>> +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
>>>>          else
>>>>                  return skb->ip_summed = CHECKSUM_NONE;
>>>>   }
>>>
>>> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
>>> transmit path a little while ago.  Please don't reintroduce it.
>>
>> Can you give me some pointers on that? I cannot find such change.
>> skb_needs_check() seems to be like that since beginning.
>
> Maybe you need to update your kernel.  All this stuff was changed in
> December and has been this way for a little while now.
>
> Commits:
> 7a6ae71b24905 "net: Elaborate on checksum offload interface description"
> 253aab0597d9e "fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload"
> 53692b1de419c "sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC"
>
> The main reason I even noticed it is because of some of the work I did
> on the Intel NIC offloads.

Ok I have those here, but my need here is different. I want to do GSO 
with packets that won't have CRC offloaded, so I shouldn't use 
CHECKSUM_PARTIAL but something else.

...
>>>> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
>>>> index 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26 100644
>>>> --- a/net/sctp/offload.c
>>>> +++ b/net/sctp/offload.c
>>>> @@ -36,8 +36,61 @@
>>>>   #include <net/sctp/checksum.h>
>>>>   #include <net/protocol.h>
>>>>
>>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>>>> +{
>>>> +       skb->ip_summed = CHECKSUM_NONE;
>>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>>>> +}
>>>> +
>>>
>>> I really despise the naming of this bit here.  SCTP does not use a
>>> checksum.  It uses a CRC.  Please don't call this a checksum as it
>>> will just make the code really confusing.   I think the name should be
>>> something like gso_make_crc32c.
>>
>> Agreed. SCTP code still references it as 'cksum'. I'll change that in
>> another patch.
>>
>>> I think we need to address the CRC issues before we can really get
>>> into segmentation.  Specifically we need to be able to offload SCTP
>>> and FCoE in software since they both use the CHECKSUM_PARTIAL value
>>> and then we can start cleaning up more of this mess and move onto
>>> segmentation.
>>
>> Hm? The mess on CRC issues here is caused by this patch alone. It's good
>> as it is today. And a good part of this mess is caused by trying to GSO
>> without offloading CRC too.
>>
>> Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?
>
> Well after Tom's change both SCTP and FCoE use CHECKSUM_PARTIAL.
> CHECKSUM_PARTIAL is what is used to indicate to the hardware that a
> checksum offload has been requested so that is what is looked for at
> the driver level.

SCTP was actually already using CHECKSUM_PARTIAL. That patch was just a 
rename in an attempt to make this crc difference more evident. Yet I'll 
continue the rename within sctp code.

> My concern with all this is that we should probably be looking at
> coming up with a means of offloading this in software when
> skb_checksum_help is called.  Right now validate_xmit_skb doesn't have
> any understanding of what to do with SCTP or FCoE and will try to just
> compute a checksum for them.

My worry is placed a bit earlier than that, I think. Currently I just 
cannot do GSO with packets that doesn't have checksum/crc offloaded too 
because validate_xmit_skb() will complain.

As NICs hardly have sctp crc offloading capabilities, I'm thinking it 
makes sense to do GSO even without crc offloaded. After all, it doesn't 
matter much in which stage we are computing the crc as we are computing 
it anyway.

>>>> +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
>>>> +                                       netdev_features_t features)
>>>> +{
>>>> +       struct sk_buff *segs = ERR_PTR(-EINVAL);
>>>> +       struct sctphdr *sh;
>>>> +
>>>> +       sh = sctp_hdr(skb);
>>>> +       if (!pskb_may_pull(skb, sizeof(*sh)))
>>>> +               goto out;
>>>> +
>>>> +       __skb_pull(skb, sizeof(*sh));
>>>> +
>>>> +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
>>>> +               /* Packet is from an untrusted source, reset gso_segs. */
>>>> +               int type = skb_shinfo(skb)->gso_type;
>>>> +
>>>> +               if (unlikely(type &
>>>> +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
>>>> +                              0) ||
>>>> +                            !(type & (SKB_GSO_SCTP))))
>>>> +                       goto out;
>>>> +
>>>> +               /* This should not happen as no NIC has SCTP GSO
>>>> +                * offloading, it's always via software and thus we
>>>> +                * won't send a large packet down the stack.
>>>> +                */
>>>> +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is not supported.");
>>>> +               goto out;
>>>> +       }
>>>> +
>>>
>>> So what you are going to end up needing here is some way to tell the
>>> hardware that you are doing the checksum no matter what.  There is no
>>> value in you computing a 1's compliment checksum for the payload if
>>> you aren't going to use it.  What you can probably do is just clear
>>> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
>>> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
>>> offloading the checksum.
>>
>> Interesting, ok
>>
>>> One other bit that will make this more complicated is if we ever get
>>> around to supporting SCTP in tunnels.  Then we will need to sort out
>>> how things like remote checksum offload should impact SCTP, and how to
>>> deal with needing to compute both a CRC and 1's compliment checksum.
>>> What we would probably need to do is check for encap_hdr_csum and if
>>> it is set and we are doing SCTP then we would need to clear the
>>> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.
>>
>> Yup. And that includes on storing pointers to where to store each of it.
>
> Actually the pointers bit is easy.  The csum_start and csum_offset
> values should be set up after you have segmented the skb and should be
> updated after the skb has been segmented.  If nothing else you can
> probably take a look at the TCP code tcp_gso_segment and
> tcp4_gso_segment for inspiration.  Basically you need to make sure
> that you set the ip_summed, csum_start, and csum_offset values for
> your first frame before you start segmenting it into multiple frames.

Ah yes, ok, that's for now, when not doing crc offloading with some 
chksum offloading (tunnel) too.

>>>> +       segs = skb_segment(skb, features);
>>>> +       if (IS_ERR(segs))
>>>> +               goto out;
>>>> +
>>>> +       /* All that is left is update SCTP CRC if necessary */
>>>> +       for (skb = segs; skb; skb = skb->next) {
>>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
>>>> +                       sh = sctp_hdr(skb);
>>>> +                       sh->checksum = sctp_gso_make_checksum(skb);
>>>> +               }
>>>> +       }
>>>> +
>>>
>>> Okay, so it looks like you are doing the right thing here and leaving
>>> this as CHECKSUM_PARTIAL.
>>
>> Actually no then. sctp_gso_make_checksum() replaces it:
>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>> +{
>> +       skb->ip_summed = CHECKSUM_NONE;
>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>>
>> Why again would have to leave it as CHECKSUM_PARTIAL? IP header?
>
> My earlier comment is actually incorrect.  This section is pretty much
> broken since CHECKSUM_PARTIAL only reflects a 1's compliment checksum
> in the case of skb_segment so whatever the value it is worthless.
> CHECKSUM_PARTIAL is used to indicate if a given frame needs to be
> offloaded.  It is meant to let the device know that it still needs to
> compute a checksum or CRC beginning at csum_start and then storing the
> new value at csum_offset.  However for skb_segment it is actually
> referring to a 1's compliment checksum and if it returns CHECKSUM_NONE
> it means it is stored in skb->csum which would really wreck things for
> you since that was your skb->csum_start and skb->csum_offset values.
> I have a patch to change this so that we update a checksum in the
> SKB_GSO_CB, but I wasn't planning on submitting that until net-next
> opens.

sctp currently ignores skb->csum. It doesn't mess with the crc but 
computing it is at least not optimal, yes.

> In the case of SCTP you probably don't even need to bother checking
> the value since it is meaningless as skb_segment doesn't know how to
> do an SCTP checksum anyway.  To that end for now what you could do is
> just set NETIF_F_HW_CSUM.  This way skb_segment won't go and try to
> compute a 1's compliment checksum on the payload since there is no
> actual need for it.

Nice, ok.

> One other bit you will need to do is to check the value of SCTP_CRC
> outside of skb_segment.  You might look at how
> __skb_udp_tunnel_segment does this to populate its own offload_csum
> boolean value, though you would want to use features, not
> skb->dev->features as that is a bit of a workaround since features is
> stripped by hw_enc_features in some paths if I recall correctly.
 >
> Once the frames are segmented and if you don't support the offload you
> could then call gso_make_crc32c() or whatever you want to name it to
> perform the CRC calculation and populate the field.  One question by

Hmmm.. does it mean that we can use CHECKSUM_PARTIAL then even if CRC 
offloading is not possible then? Because the packet will not be 
offloaded in the end, yes, but this solves my questions above. Then 
while doing GSO, it re-evaluates if it can offload crc or not?

> the way.  Don't you need to initialize the checksum value to 0 before
> you compute it?  I think you might have missed that step when you were
> setting this up.

It's fine :) sctp_compute_cksum will replace it with zeroes, calculate, 
and put back the old value, which then we overwrite with the new one at 
sctp_gso_segment.

Thanks,
Marcelo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
  2016-02-01 16:22           ` Marcelo Ricardo Leitner
@ 2016-02-01 17:03             ` Alexander Duyck
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexander Duyck @ 2016-02-01 17:03 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

On Mon, Feb 1, 2016 at 8:22 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> Em 30-01-2016 02:07, Alexander Duyck escreveu:
>>
>> On Fri, Jan 29, 2016 at 11:42 AM, Marcelo Ricardo Leitner
>> <marcelo.leitner@gmail.com> wrote:
>>>
>>> On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
>>>>
>>>> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
>
> ...
>
>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>> index
>>>>> 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5
>>>>> 100644
>>>>> --- a/net/core/dev.c
>>>>> +++ b/net/core/dev.c
>>>>> @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
>>>>>   static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
>>>>>   {
>>>>>          if (tx_path)
>>>>> -               return skb->ip_summed != CHECKSUM_PARTIAL;
>>>>> +               /* FIXME: Why only packets with checksum offloading are
>>>>> +                * supported for GSO?
>>>>> +                */
>>>>> +               return skb->ip_summed != CHECKSUM_PARTIAL &&
>>>>> +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
>>>>>          else
>>>>>                  return skb->ip_summed == CHECKSUM_NONE;
>>>>>   }
>>>>
>>>>
>>>> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
>>>> transmit path a little while ago.  Please don't reintroduce it.
>>>
>>>
>>> Can you give me some pointers on that? I cannot find such change.
>>> skb_needs_check() seems to be like that since beginning.
>>
>>
>> Maybe you need to update your kernel.  All this stuff was changed in
>> December and has been this way for a little while now.
>>
>> Commits:
>> 7a6ae71b24905 "net: Elaborate on checksum offload interface description"
>> 253aab0597d9e "fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload"
>> 53692b1de419c "sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC"
>>
>> The main reason I even noticed it is because of some of the work I did
>> on the Intel NIC offloads.
>
>
> Ok I have those here, but my need here is different. I want to do GSO with
> packets that won't have CRC offloaded, so I shouldn't use CHECKSUM_PARTIAL
> but something else.

CHECKSUM_NONE if you don't want to have any of the CRC or checksums
offloaded.  However as I mentioned before you will want to fake it
then since skb_segment assumes it is doing a 1's compliment checksum
so you will want to pass NET_F_HW_CSUM as a feature flag and then set
CHECKSUM_NONE after the frame has been segmented.

> ...
>
>>>>> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
>>>>> index
>>>>> 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26
>>>>> 100644
>>>>> --- a/net/sctp/offload.c
>>>>> +++ b/net/sctp/offload.c
>>>>> @@ -36,8 +36,61 @@
>>>>>   #include <net/sctp/checksum.h>
>>>>>   #include <net/protocol.h>
>>>>>
>>>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>>>>> +{
>>>>> +       skb->ip_summed = CHECKSUM_NONE;
>>>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>>>>> +}
>>>>> +
>>>>
>>>>
>>>> I really despise the naming of this bit here.  SCTP does not use a
>>>> checksum.  It uses a CRC.  Please don't call this a checksum as it
>>>> will just make the code really confusing.   I think the name should be
>>>> something like gso_make_crc32c.
>>>
>>>
>>> Agreed. SCTP code still references it as 'cksum'. I'll change that in
>>> another patch.
>>>
>>>> I think we need to address the CRC issues before we can really get
>>>> into segmentation.  Specifically we need to be able to offload SCTP
>>>> and FCoE in software since they both use the CHECKSUM_PARTIAL value
>>>> and then we can start cleaning up more of this mess and move onto
>>>> segmentation.
>>>
>>>
>>> Hm? The mess on CRC issues here is caused by this patch alone. It's good
>>> as it is today. And a good part of this mess is caused by trying to GSO
>>> without offloading CRC too.
>>>
>>> Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?
>>
>>
>> Well after Tom's change both SCTP and FCoE use CHECKSUM_PARTIAL.
>> CHECKSUM_PARTIAL is what is used to indicate to the hardware that a
>> checksum offload has been requested so that is what is looked for at
>> the driver level.
>
>
> SCTP was actually already using CHECKSUM_PARTIAL. That patch was just a
> rename in an attempt to make this crc difference more evident. Yet I'll
> continue the rename within sctp code.

Yeah it was FCoE that was doing something different.

>> My concern with all this is that we should probably be looking at
>> coming up with a means of offloading this in software when
>> skb_checksum_help is called.  Right now validate_xmit_skb doesn't have
>> any understanding of what to do with SCTP or FCoE and will try to just
>> compute a checksum for them.
>
>
> My worry is placed a bit earlier than that, I think. Currently I just cannot
> do GSO with packets that doesn't have checksum/crc offloaded too because
> validate_xmit_skb() will complain.

That is probably because you are passing CHECKSUM_PARTIAL instead of
CHECKSUM_NONE.

> As NICs hardly have sctp crc offloading capabilities, I'm thinking it makes
> sense to do GSO even without crc offloaded. After all, it doesn't matter
> much in which stage we are computing the crc as we are computing it anyway.

Agreed.  You will need to support CHECKSUM_PARTIAL being passed to a
device that doesn't support SCTP first.  That way you can start
looking at just always setting CHECKSUM_PARTIAL in the transport layer
which is really needed if you want to do SCO (SCTP Segmentation
Offload) in the first place.  Once you have that you could then start
looking at doing the SCO since from that point on you should already
be in good shape to address those type of issues.  You should probably
use the csum_offset value in the skb in order to flag if this is
possibly SCTP.  As far as I know for now there shouldn't be any other
protocols that are using the same offset, and if needed you can
actually parse the headers to verify if the frame is actually SCTP.

>>>>> +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
>>>>> +                                       netdev_features_t features)
>>>>> +{
>>>>> +       struct sk_buff *segs = ERR_PTR(-EINVAL);
>>>>> +       struct sctphdr *sh;
>>>>> +
>>>>> +       sh = sctp_hdr(skb);
>>>>> +       if (!pskb_may_pull(skb, sizeof(*sh)))
>>>>> +               goto out;
>>>>> +
>>>>> +       __skb_pull(skb, sizeof(*sh));
>>>>> +
>>>>> +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
>>>>> +               /* Packet is from an untrusted source, reset gso_segs.
>>>>> */
>>>>> +               int type = skb_shinfo(skb)->gso_type;
>>>>> +
>>>>> +               if (unlikely(type &
>>>>> +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
>>>>> +                              0) ||
>>>>> +                            !(type & (SKB_GSO_SCTP))))
>>>>> +                       goto out;
>>>>> +
>>>>> +               /* This should not happen as no NIC has SCTP GSO
>>>>> +                * offloading, it's always via software and thus we
>>>>> +                * won't send a large packet down the stack.
>>>>> +                */
>>>>> +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is
>>>>> not supported.");
>>>>> +               goto out;
>>>>> +       }
>>>>> +
>>>>
>>>>
>>>> So what you are going to end up needing here is some way to tell the
>>>> hardware that you are doing the checksum no matter what.  There is no
>>>> value in you computing a 1's compliment checksum for the payload if
>>>> you aren't going to use it.  What you can probably do is just clear
>>>> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
>>>> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
>>>> offloading the checksum.
>>>
>>>
>>> Interesting, ok
>>>
>>>> One other bit that will make this more complicated is if we ever get
>>>> around to supporting SCTP in tunnels.  Then we will need to sort out
>>>> how things like remote checksum offload should impact SCTP, and how to
>>>> deal with needing to compute both a CRC and 1's compliment checksum.
>>>> What we would probably need to do is check for encap_hdr_csum and if
>>>> it is set and we are doing SCTP then we would need to clear the
>>>> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.
>>>
>>>
>>> Yup. And that includes on storing pointers to where to store each of it.
>>
>>
>> Actually the pointers bit is easy.  The csum_start and csum_offset
>> values should be set up after you have segmented the skb and should be
>> updated after the skb has been segmented.  If nothing else you can
>> probably take a look at the TCP code tcp_gso_segment and
>> tcp4_gso_segment for inspiration.  Basically you need to make sure
>> that you set the ip_summed, csum_start, and csum_offset values for
>> your first frame before you start segmenting it into multiple frames.
>
>
> Ah yes, ok, that's for now, when not doing crc offloading with some chksum
> offloading (tunnel) too.

Actually that would be regardless of tunnel offloading.  We don't
store the outer checksum offsets.  If we need outer checksum we
restore them after the fact since the inner checksum offsets are
needed as part of the inner header TCP checksum computation.

>>>>> +       segs = skb_segment(skb, features);
>>>>> +       if (IS_ERR(segs))
>>>>> +               goto out;
>>>>> +
>>>>> +       /* All that is left is update SCTP CRC if necessary */
>>>>> +       for (skb = segs; skb; skb = skb->next) {
>>>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
>>>>> +                       sh = sctp_hdr(skb);
>>>>> +                       sh->checksum = sctp_gso_make_checksum(skb);
>>>>> +               }
>>>>> +       }
>>>>> +
>>>>
>>>>
>>>> Okay, so it looks like you are doing the right thing here and leaving
>>>> this as CHECKSUM_PARTIAL.
>>>
>>>
>>> Actually no then. sctp_gso_make_checksum() replaces it:
>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>>> +{
>>> +       skb->ip_summed = CHECKSUM_NONE;
>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>>>
>>> Why again would have to leave it as CHECKSUM_PARTIAL? IP header?
>>
>>
>> My earlier comment is actually incorrect.  This section is pretty much
>> broken since CHECKSUM_PARTIAL only reflects a 1's compliment checksum
>> in the case of skb_segment so whatever the value it is worthless.
>> CHECKSUM_PARTIAL is used to indicate if a given frame needs to be
>> offloaded.  It is meant to let the device know that it still needs to
>> compute a checksum or CRC beginning at csum_start and then storing the
>> new value at csum_offset.  However for skb_segment it is actually
>> referring to a 1's compliment checksum and if it returns CHECKSUM_NONE
>> it means it is stored in skb->csum which would really wreck things for
>> you since that was your skb->csum_start and skb->csum_offset values.
>> I have a patch to change this so that we update a checksum in the
>> SKB_GSO_CB, but I wasn't planning on submitting that until net-next
>> opens.
>
>
> sctp currently ignores skb->csum. It doesn't mess with the crc but computing
> it is at least not optimal, yes.

Actually sctp sets csum_start and csum_offset if it sets
CHECKSUM_PARTIAL.  So it does mess with skb->csum since it is
contained in a union with those two fields.

>> In the case of SCTP you probably don't even need to bother checking
>> the value since it is meaningless as skb_segment doesn't know how to
>> do an SCTP checksum anyway.  To that end for now what you could do is
>> just set NETIF_F_HW_CSUM.  This way skb_segment won't go and try to
>> compute a 1's compliment checksum on the payload since there is no
>> actual need for it.
>
>
> Nice, ok.
>
>> One other bit you will need to do is to check the value of SCTP_CRC
>> outside of skb_segment.  You might look at how
>> __skb_udp_tunnel_segment does this to populate its own offload_csum
>> boolean value, though you would want to use features, not
>> skb->dev->features as that is a bit of a workaround since features is
>> stripped by hw_enc_features in some paths if I recall correctly.
>
>>
>>
>> Once the frames are segmented and if you don't support the offload you
>> could then call gso_make_crc32c() or whatever you want to name it to
>> perform the CRC calculation and populate the field.  One question by
>
>
> Hmmm.. does it mean that we can use CHECKSUM_PARTIAL then even if CRC
> offloading is not possible then? Because the packet will not be offloaded in
> the end, yes, but this solves my questions above. Then while doing GSO, it
> re-evaluates if it can offload crc or not?

If you compute the CRC you set CHECKSUM_NONE, if you want the device
to do it on transmit you should set CHECKSUM_PARTIAL.

>> the way.  Don't you need to initialize the checksum value to 0 before
>> you compute it?  I think you might have missed that step when you were
>> setting this up.
>
>
> It's fine :) sctp_compute_cksum will replace it with zeroes, calculate, and
> put back the old value, which then we overwrite with the new one at
> sctp_gso_segment.

Right but there are scenarios where this will be offloaded isn't
there?  You would probably be better off setting the CRC to 0 before
you start segmentation and then that way you can either just set
csum_offset, csum_start and ip_summed if the lower device supports
SCTP CRC offload, otherwise you can just compute it without the need
to write the 0 into the header.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
@ 2016-02-01 17:03             ` Alexander Duyck
  0 siblings, 0 replies; 49+ messages in thread
From: Alexander Duyck @ 2016-02-01 17:03 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

On Mon, Feb 1, 2016 at 8:22 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> Em 30-01-2016 02:07, Alexander Duyck escreveu:
>>
>> On Fri, Jan 29, 2016 at 11:42 AM, Marcelo Ricardo Leitner
>> <marcelo.leitner@gmail.com> wrote:
>>>
>>> On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
>>>>
>>>> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
>
> ...
>
>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>> index
>>>>> 8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5
>>>>> 100644
>>>>> --- a/net/core/dev.c
>>>>> +++ b/net/core/dev.c
>>>>> @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
>>>>>   static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
>>>>>   {
>>>>>          if (tx_path)
>>>>> -               return skb->ip_summed != CHECKSUM_PARTIAL;
>>>>> +               /* FIXME: Why only packets with checksum offloading are
>>>>> +                * supported for GSO?
>>>>> +                */
>>>>> +               return skb->ip_summed != CHECKSUM_PARTIAL &&
>>>>> +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
>>>>>          else
>>>>>                  return skb->ip_summed = CHECKSUM_NONE;
>>>>>   }
>>>>
>>>>
>>>> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
>>>> transmit path a little while ago.  Please don't reintroduce it.
>>>
>>>
>>> Can you give me some pointers on that? I cannot find such change.
>>> skb_needs_check() seems to be like that since beginning.
>>
>>
>> Maybe you need to update your kernel.  All this stuff was changed in
>> December and has been this way for a little while now.
>>
>> Commits:
>> 7a6ae71b24905 "net: Elaborate on checksum offload interface description"
>> 253aab0597d9e "fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload"
>> 53692b1de419c "sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC"
>>
>> The main reason I even noticed it is because of some of the work I did
>> on the Intel NIC offloads.
>
>
> Ok I have those here, but my need here is different. I want to do GSO with
> packets that won't have CRC offloaded, so I shouldn't use CHECKSUM_PARTIAL
> but something else.

CHECKSUM_NONE if you don't want to have any of the CRC or checksums
offloaded.  However as I mentioned before you will want to fake it
then since skb_segment assumes it is doing a 1's compliment checksum
so you will want to pass NET_F_HW_CSUM as a feature flag and then set
CHECKSUM_NONE after the frame has been segmented.

> ...
>
>>>>> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
>>>>> index
>>>>> 7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26
>>>>> 100644
>>>>> --- a/net/sctp/offload.c
>>>>> +++ b/net/sctp/offload.c
>>>>> @@ -36,8 +36,61 @@
>>>>>   #include <net/sctp/checksum.h>
>>>>>   #include <net/protocol.h>
>>>>>
>>>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>>>>> +{
>>>>> +       skb->ip_summed = CHECKSUM_NONE;
>>>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>>>>> +}
>>>>> +
>>>>
>>>>
>>>> I really despise the naming of this bit here.  SCTP does not use a
>>>> checksum.  It uses a CRC.  Please don't call this a checksum as it
>>>> will just make the code really confusing.   I think the name should be
>>>> something like gso_make_crc32c.
>>>
>>>
>>> Agreed. SCTP code still references it as 'cksum'. I'll change that in
>>> another patch.
>>>
>>>> I think we need to address the CRC issues before we can really get
>>>> into segmentation.  Specifically we need to be able to offload SCTP
>>>> and FCoE in software since they both use the CHECKSUM_PARTIAL value
>>>> and then we can start cleaning up more of this mess and move onto
>>>> segmentation.
>>>
>>>
>>> Hm? The mess on CRC issues here is caused by this patch alone. It's good
>>> as it is today. And a good part of this mess is caused by trying to GSO
>>> without offloading CRC too.
>>>
>>> Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?
>>
>>
>> Well after Tom's change both SCTP and FCoE use CHECKSUM_PARTIAL.
>> CHECKSUM_PARTIAL is what is used to indicate to the hardware that a
>> checksum offload has been requested so that is what is looked for at
>> the driver level.
>
>
> SCTP was actually already using CHECKSUM_PARTIAL. That patch was just a
> rename in an attempt to make this crc difference more evident. Yet I'll
> continue the rename within sctp code.

Yeah it was FCoE that was doing something different.

>> My concern with all this is that we should probably be looking at
>> coming up with a means of offloading this in software when
>> skb_checksum_help is called.  Right now validate_xmit_skb doesn't have
>> any understanding of what to do with SCTP or FCoE and will try to just
>> compute a checksum for them.
>
>
> My worry is placed a bit earlier than that, I think. Currently I just cannot
> do GSO with packets that doesn't have checksum/crc offloaded too because
> validate_xmit_skb() will complain.

That is probably because you are passing CHECKSUM_PARTIAL instead of
CHECKSUM_NONE.

> As NICs hardly have sctp crc offloading capabilities, I'm thinking it makes
> sense to do GSO even without crc offloaded. After all, it doesn't matter
> much in which stage we are computing the crc as we are computing it anyway.

Agreed.  You will need to support CHECKSUM_PARTIAL being passed to a
device that doesn't support SCTP first.  That way you can start
looking at just always setting CHECKSUM_PARTIAL in the transport layer
which is really needed if you want to do SCO (SCTP Segmentation
Offload) in the first place.  Once you have that you could then start
looking at doing the SCO since from that point on you should already
be in good shape to address those type of issues.  You should probably
use the csum_offset value in the skb in order to flag if this is
possibly SCTP.  As far as I know for now there shouldn't be any other
protocols that are using the same offset, and if needed you can
actually parse the headers to verify if the frame is actually SCTP.

>>>>> +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
>>>>> +                                       netdev_features_t features)
>>>>> +{
>>>>> +       struct sk_buff *segs = ERR_PTR(-EINVAL);
>>>>> +       struct sctphdr *sh;
>>>>> +
>>>>> +       sh = sctp_hdr(skb);
>>>>> +       if (!pskb_may_pull(skb, sizeof(*sh)))
>>>>> +               goto out;
>>>>> +
>>>>> +       __skb_pull(skb, sizeof(*sh));
>>>>> +
>>>>> +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
>>>>> +               /* Packet is from an untrusted source, reset gso_segs.
>>>>> */
>>>>> +               int type = skb_shinfo(skb)->gso_type;
>>>>> +
>>>>> +               if (unlikely(type &
>>>>> +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
>>>>> +                              0) ||
>>>>> +                            !(type & (SKB_GSO_SCTP))))
>>>>> +                       goto out;
>>>>> +
>>>>> +               /* This should not happen as no NIC has SCTP GSO
>>>>> +                * offloading, it's always via software and thus we
>>>>> +                * won't send a large packet down the stack.
>>>>> +                */
>>>>> +               WARN_ONCE(1, "SCTP segmentation offloading to NICs is
>>>>> not supported.");
>>>>> +               goto out;
>>>>> +       }
>>>>> +
>>>>
>>>>
>>>> So what you are going to end up needing here is some way to tell the
>>>> hardware that you are doing the checksum no matter what.  There is no
>>>> value in you computing a 1's compliment checksum for the payload if
>>>> you aren't going to use it.  What you can probably do is just clear
>>>> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
>>>> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
>>>> offloading the checksum.
>>>
>>>
>>> Interesting, ok
>>>
>>>> One other bit that will make this more complicated is if we ever get
>>>> around to supporting SCTP in tunnels.  Then we will need to sort out
>>>> how things like remote checksum offload should impact SCTP, and how to
>>>> deal with needing to compute both a CRC and 1's compliment checksum.
>>>> What we would probably need to do is check for encap_hdr_csum and if
>>>> it is set and we are doing SCTP then we would need to clear the
>>>> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.
>>>
>>>
>>> Yup. And that includes on storing pointers to where to store each of it.
>>
>>
>> Actually the pointers bit is easy.  The csum_start and csum_offset
>> values should be set up after you have segmented the skb and should be
>> updated after the skb has been segmented.  If nothing else you can
>> probably take a look at the TCP code tcp_gso_segment and
>> tcp4_gso_segment for inspiration.  Basically you need to make sure
>> that you set the ip_summed, csum_start, and csum_offset values for
>> your first frame before you start segmenting it into multiple frames.
>
>
> Ah yes, ok, that's for now, when not doing crc offloading with some chksum
> offloading (tunnel) too.

Actually that would be regardless of tunnel offloading.  We don't
store the outer checksum offsets.  If we need outer checksum we
restore them after the fact since the inner checksum offsets are
needed as part of the inner header TCP checksum computation.

>>>>> +       segs = skb_segment(skb, features);
>>>>> +       if (IS_ERR(segs))
>>>>> +               goto out;
>>>>> +
>>>>> +       /* All that is left is update SCTP CRC if necessary */
>>>>> +       for (skb = segs; skb; skb = skb->next) {
>>>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
>>>>> +                       sh = sctp_hdr(skb);
>>>>> +                       sh->checksum = sctp_gso_make_checksum(skb);
>>>>> +               }
>>>>> +       }
>>>>> +
>>>>
>>>>
>>>> Okay, so it looks like you are doing the right thing here and leaving
>>>> this as CHECKSUM_PARTIAL.
>>>
>>>
>>> Actually no then. sctp_gso_make_checksum() replaces it:
>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>>> +{
>>> +       skb->ip_summed = CHECKSUM_NONE;
>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
>>>
>>> Why again would have to leave it as CHECKSUM_PARTIAL? IP header?
>>
>>
>> My earlier comment is actually incorrect.  This section is pretty much
>> broken since CHECKSUM_PARTIAL only reflects a 1's compliment checksum
>> in the case of skb_segment so whatever the value it is worthless.
>> CHECKSUM_PARTIAL is used to indicate if a given frame needs to be
>> offloaded.  It is meant to let the device know that it still needs to
>> compute a checksum or CRC beginning at csum_start and then storing the
>> new value at csum_offset.  However for skb_segment it is actually
>> referring to a 1's compliment checksum and if it returns CHECKSUM_NONE
>> it means it is stored in skb->csum which would really wreck things for
>> you since that was your skb->csum_start and skb->csum_offset values.
>> I have a patch to change this so that we update a checksum in the
>> SKB_GSO_CB, but I wasn't planning on submitting that until net-next
>> opens.
>
>
> sctp currently ignores skb->csum. It doesn't mess with the crc but computing
> it is at least not optimal, yes.

Actually sctp sets csum_start and csum_offset if it sets
CHECKSUM_PARTIAL.  So it does mess with skb->csum since it is
contained in a union with those two fields.

>> In the case of SCTP you probably don't even need to bother checking
>> the value since it is meaningless as skb_segment doesn't know how to
>> do an SCTP checksum anyway.  To that end for now what you could do is
>> just set NETIF_F_HW_CSUM.  This way skb_segment won't go and try to
>> compute a 1's compliment checksum on the payload since there is no
>> actual need for it.
>
>
> Nice, ok.
>
>> One other bit you will need to do is to check the value of SCTP_CRC
>> outside of skb_segment.  You might look at how
>> __skb_udp_tunnel_segment does this to populate its own offload_csum
>> boolean value, though you would want to use features, not
>> skb->dev->features as that is a bit of a workaround since features is
>> stripped by hw_enc_features in some paths if I recall correctly.
>
>>
>>
>> Once the frames are segmented and if you don't support the offload you
>> could then call gso_make_crc32c() or whatever you want to name it to
>> perform the CRC calculation and populate the field.  One question by
>
>
> Hmmm.. does it mean that we can use CHECKSUM_PARTIAL then even if CRC
> offloading is not possible then? Because the packet will not be offloaded in
> the end, yes, but this solves my questions above. Then while doing GSO, it
> re-evaluates if it can offload crc or not?

If you compute the CRC you set CHECKSUM_NONE, if you want the device
to do it on transmit you should set CHECKSUM_PARTIAL.

>> the way.  Don't you need to initialize the checksum value to 0 before
>> you compute it?  I think you might have missed that step when you were
>> setting this up.
>
>
> It's fine :) sctp_compute_cksum will replace it with zeroes, calculate, and
> put back the old value, which then we overwrite with the new one at
> sctp_gso_segment.

Right but there are scenarios where this will be offloaded isn't
there?  You would probably be better off setting the CRC to 0 before
you start segmentation and then that way you can either just set
csum_offset, csum_start and ip_summed if the lower device supports
SCTP CRC offload, otherwise you can just compute it without the need
to write the 0 into the header.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
  2016-02-01 17:03             ` Alexander Duyck
@ 2016-02-01 17:41               ` Marcelo Ricardo Leitner
  -1 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-02-01 17:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

Em 01-02-2016 15:03, Alexander Duyck escreveu:
 > On Mon, Feb 1, 2016 at 8:22 AM, Marcelo Ricardo Leitner
 > <marcelo.leitner@gmail.com> wrote:
 >> Em 30-01-2016 02:07, Alexander Duyck escreveu:
 >>>
 >>> On Fri, Jan 29, 2016 at 11:42 AM, Marcelo Ricardo Leitner
 >>> <marcelo.leitner@gmail.com> wrote:
 >>>>
 >>>> On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
 >>>>>
 >>>>> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
 >>
 >> ...
 >>
 >>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
 >>>>>> index
 >>>>>> 
8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5
 >>>>>> 100644
 >>>>>> --- a/net/core/dev.c
 >>>>>> +++ b/net/core/dev.c
 >>>>>> @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
 >>>>>>    static inline bool skb_needs_check(struct sk_buff *skb, bool 
tx_path)
 >>>>>>    {
 >>>>>>           if (tx_path)
 >>>>>> -               return skb->ip_summed != CHECKSUM_PARTIAL;
 >>>>>> +               /* FIXME: Why only packets with checksum 
offloading are
 >>>>>> +                * supported for GSO?
 >>>>>> +                */
 >>>>>> +               return skb->ip_summed != CHECKSUM_PARTIAL &&
 >>>>>> +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
 >>>>>>           else
 >>>>>>                   return skb->ip_summed == CHECKSUM_NONE;
 >>>>>>    }
 >>>>>
 >>>>>
 >>>>> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
 >>>>> transmit path a little while ago.  Please don't reintroduce it.
 >>>>
 >>>>
 >>>> Can you give me some pointers on that? I cannot find such change.
 >>>> skb_needs_check() seems to be like that since beginning.
 >>>
 >>>
 >>> Maybe you need to update your kernel.  All this stuff was changed in
 >>> December and has been this way for a little while now.
 >>>
 >>> Commits:
 >>> 7a6ae71b24905 "net: Elaborate on checksum offload interface 
description"
 >>> 253aab0597d9e "fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload"
 >>> 53692b1de419c "sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC"
 >>>
 >>> The main reason I even noticed it is because of some of the work I did
 >>> on the Intel NIC offloads.
 >>
 >>
 >> Ok I have those here, but my need here is different. I want to do 
GSO with
 >> packets that won't have CRC offloaded, so I shouldn't use 
CHECKSUM_PARTIAL
 >> but something else.
 >
 > CHECKSUM_NONE if you don't want to have any of the CRC or checksums
 > offloaded.  However as I mentioned before you will want to fake it
 > then since skb_segment assumes it is doing a 1's compliment checksum
 > so you will want to pass NET_F_HW_CSUM as a feature flag and then set
 > CHECKSUM_NONE after the frame has been segmented.

Ok

 >> ...
 >>
 >>>>>> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
 >>>>>> index
 >>>>>> 
7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26
 >>>>>> 100644
 >>>>>> --- a/net/sctp/offload.c
 >>>>>> +++ b/net/sctp/offload.c
 >>>>>> @@ -36,8 +36,61 @@
 >>>>>>    #include <net/sctp/checksum.h>
 >>>>>>    #include <net/protocol.h>
 >>>>>>
 >>>>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
 >>>>>> +{
 >>>>>> +       skb->ip_summed = CHECKSUM_NONE;
 >>>>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
 >>>>>> +}
 >>>>>> +
 >>>>>
 >>>>>
 >>>>> I really despise the naming of this bit here.  SCTP does not use a
 >>>>> checksum.  It uses a CRC.  Please don't call this a checksum as it
 >>>>> will just make the code really confusing.   I think the name 
should be
 >>>>> something like gso_make_crc32c.
 >>>>
 >>>>
 >>>> Agreed. SCTP code still references it as 'cksum'. I'll change that in
 >>>> another patch.
 >>>>
 >>>>> I think we need to address the CRC issues before we can really get
 >>>>> into segmentation.  Specifically we need to be able to offload SCTP
 >>>>> and FCoE in software since they both use the CHECKSUM_PARTIAL value
 >>>>> and then we can start cleaning up more of this mess and move onto
 >>>>> segmentation.
 >>>>
 >>>>
 >>>> Hm? The mess on CRC issues here is caused by this patch alone. 
It's good
 >>>> as it is today. And a good part of this mess is caused by trying 
to GSO
 >>>> without offloading CRC too.
 >>>>
 >>>> Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?
 >>>
 >>>
 >>> Well after Tom's change both SCTP and FCoE use CHECKSUM_PARTIAL.
 >>> CHECKSUM_PARTIAL is what is used to indicate to the hardware that a
 >>> checksum offload has been requested so that is what is looked for at
 >>> the driver level.
 >>
 >>
 >> SCTP was actually already using CHECKSUM_PARTIAL. That patch was just a
 >> rename in an attempt to make this crc difference more evident. Yet I'll
 >> continue the rename within sctp code.
 >
 > Yeah it was FCoE that was doing something different.
 >
 >>> My concern with all this is that we should probably be looking at
 >>> coming up with a means of offloading this in software when
 >>> skb_checksum_help is called.  Right now validate_xmit_skb doesn't have
 >>> any understanding of what to do with SCTP or FCoE and will try to just
 >>> compute a checksum for them.
 >>
 >>
 >> My worry is placed a bit earlier than that, I think. Currently I 
just cannot
 >> do GSO with packets that doesn't have checksum/crc offloaded too because
 >> validate_xmit_skb() will complain.
 >
 > That is probably because you are passing CHECKSUM_PARTIAL instead of
 > CHECKSUM_NONE.

Other way around, but it's cool. We are pretty much on the same page 
now, I think.

 >> As NICs hardly have sctp crc offloading capabilities, I'm thinking 
it makes
 >> sense to do GSO even without crc offloaded. After all, it doesn't matter
 >> much in which stage we are computing the crc as we are computing it 
anyway.
 >
 > Agreed.  You will need to support CHECKSUM_PARTIAL being passed to a
 > device that doesn't support SCTP first.  That way you can start
 > looking at just always setting CHECKSUM_PARTIAL in the transport layer
 > which is really needed if you want to do SCO (SCTP Segmentation
 > Offload) in the first place.  Once you have that you could then start
 > looking at doing the SCO since from that point on you should already
 > be in good shape to address those type of issues.  You should probably
 > use the csum_offset value in the skb in order to flag if this is
 > possibly SCTP.  As far as I know for now there shouldn't be any other
 > protocols that are using the same offset, and if needed you can
 > actually parse the headers to verify if the frame is actually SCTP.

Cool, yes.
Just cannot set CHECKSUM_PARTIAL always because if frame is not a GSO, 
we will not have another chance to fill in SCTP CRC if it's not 
offloaded. A check still have to consider for that, but np.

 >>>>>> +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
 >>>>>> +                                       netdev_features_t features)
 >>>>>> +{
 >>>>>> +       struct sk_buff *segs = ERR_PTR(-EINVAL);
 >>>>>> +       struct sctphdr *sh;
 >>>>>> +
 >>>>>> +       sh = sctp_hdr(skb);
 >>>>>> +       if (!pskb_may_pull(skb, sizeof(*sh)))
 >>>>>> +               goto out;
 >>>>>> +
 >>>>>> +       __skb_pull(skb, sizeof(*sh));
 >>>>>> +
 >>>>>> +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
 >>>>>> +               /* Packet is from an untrusted source, reset 
gso_segs.
 >>>>>> */
 >>>>>> +               int type = skb_shinfo(skb)->gso_type;
 >>>>>> +
 >>>>>> +               if (unlikely(type &
 >>>>>> +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
 >>>>>> +                              0) ||
 >>>>>> +                            !(type & (SKB_GSO_SCTP))))
 >>>>>> +                       goto out;
 >>>>>> +
 >>>>>> +               /* This should not happen as no NIC has SCTP GSO
 >>>>>> +                * offloading, it's always via software and thus we
 >>>>>> +                * won't send a large packet down the stack.
 >>>>>> +                */
 >>>>>> +               WARN_ONCE(1, "SCTP segmentation offloading to 
NICs is
 >>>>>> not supported.");
 >>>>>> +               goto out;
 >>>>>> +       }
 >>>>>> +
 >>>>>
 >>>>>
 >>>>> So what you are going to end up needing here is some way to tell the
 >>>>> hardware that you are doing the checksum no matter what.  There is no
 >>>>> value in you computing a 1's compliment checksum for the payload if
 >>>>> you aren't going to use it.  What you can probably do is just clear
 >>>>> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
 >>>>> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
 >>>>> offloading the checksum.
 >>>>
 >>>>
 >>>> Interesting, ok
 >>>>
 >>>>> One other bit that will make this more complicated is if we ever get
 >>>>> around to supporting SCTP in tunnels.  Then we will need to sort out
 >>>>> how things like remote checksum offload should impact SCTP, and 
how to
 >>>>> deal with needing to compute both a CRC and 1's compliment checksum.
 >>>>> What we would probably need to do is check for encap_hdr_csum and if
 >>>>> it is set and we are doing SCTP then we would need to clear the
 >>>>> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.
 >>>>
 >>>>
 >>>> Yup. And that includes on storing pointers to where to store each 
of it.
 >>>
 >>>
 >>> Actually the pointers bit is easy.  The csum_start and csum_offset
 >>> values should be set up after you have segmented the skb and should be
 >>> updated after the skb has been segmented.  If nothing else you can
 >>> probably take a look at the TCP code tcp_gso_segment and
 >>> tcp4_gso_segment for inspiration.  Basically you need to make sure
 >>> that you set the ip_summed, csum_start, and csum_offset values for
 >>> your first frame before you start segmenting it into multiple frames.
 >>
 >>
 >> Ah yes, ok, that's for now, when not doing crc offloading with some 
chksum
 >> offloading (tunnel) too.
 >
 > Actually that would be regardless of tunnel offloading.  We don't
 > store the outer checksum offsets.  If we need outer checksum we
 > restore them after the fact since the inner checksum offsets are
 > needed as part of the inner header TCP checksum computation.

Hm okay

 >>>>>> +       segs = skb_segment(skb, features);
 >>>>>> +       if (IS_ERR(segs))
 >>>>>> +               goto out;
 >>>>>> +
 >>>>>> +       /* All that is left is update SCTP CRC if necessary */
 >>>>>> +       for (skb = segs; skb; skb = skb->next) {
 >>>>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
 >>>>>> +                       sh = sctp_hdr(skb);
 >>>>>> +                       sh->checksum = sctp_gso_make_checksum(skb);
 >>>>>> +               }
 >>>>>> +       }
 >>>>>> +
 >>>>>
 >>>>>
 >>>>> Okay, so it looks like you are doing the right thing here and leaving
 >>>>> this as CHECKSUM_PARTIAL.
 >>>>
 >>>>
 >>>> Actually no then. sctp_gso_make_checksum() replaces it:
 >>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
 >>>> +{
 >>>> +       skb->ip_summed = CHECKSUM_NONE;
 >>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
 >>>>
 >>>> Why again would have to leave it as CHECKSUM_PARTIAL? IP header?
 >>>
 >>>
 >>> My earlier comment is actually incorrect.  This section is pretty much
 >>> broken since CHECKSUM_PARTIAL only reflects a 1's compliment checksum
 >>> in the case of skb_segment so whatever the value it is worthless.
 >>> CHECKSUM_PARTIAL is used to indicate if a given frame needs to be
 >>> offloaded.  It is meant to let the device know that it still needs to
 >>> compute a checksum or CRC beginning at csum_start and then storing the
 >>> new value at csum_offset.  However for skb_segment it is actually
 >>> referring to a 1's compliment checksum and if it returns CHECKSUM_NONE
 >>> it means it is stored in skb->csum which would really wreck things for
 >>> you since that was your skb->csum_start and skb->csum_offset values.
 >>> I have a patch to change this so that we update a checksum in the
 >>> SKB_GSO_CB, but I wasn't planning on submitting that until net-next
 >>> opens.
 >>
 >>
 >> sctp currently ignores skb->csum. It doesn't mess with the crc but 
computing
 >> it is at least not optimal, yes.
 >
 > Actually sctp sets csum_start and csum_offset if it sets
 > CHECKSUM_PARTIAL.  So it does mess with skb->csum since it is
 > contained in a union with those two fields.

Well, yes, but point was that messed value is not used for anything 
useful later on..
I'll implement the NETIF_F_HW_CSUM trick.

 >>> In the case of SCTP you probably don't even need to bother checking
 >>> the value since it is meaningless as skb_segment doesn't know how to
 >>> do an SCTP checksum anyway.  To that end for now what you could do is
 >>> just set NETIF_F_HW_CSUM.  This way skb_segment won't go and try to
 >>> compute a 1's compliment checksum on the payload since there is no
 >>> actual need for it.
 >>
 >>
 >> Nice, ok.
 >>
 >>> One other bit you will need to do is to check the value of SCTP_CRC
 >>> outside of skb_segment.  You might look at how
 >>> __skb_udp_tunnel_segment does this to populate its own offload_csum
 >>> boolean value, though you would want to use features, not
 >>> skb->dev->features as that is a bit of a workaround since features is
 >>> stripped by hw_enc_features in some paths if I recall correctly.
 >>
 >>>
 >>>
 >>> Once the frames are segmented and if you don't support the offload you
 >>> could then call gso_make_crc32c() or whatever you want to name it to
 >>> perform the CRC calculation and populate the field.  One question by
 >>
 >>
 >> Hmmm.. does it mean that we can use CHECKSUM_PARTIAL then even if CRC
 >> offloading is not possible then? Because the packet will not be 
offloaded in
 >> the end, yes, but this solves my questions above. Then while doing 
GSO, it
 >> re-evaluates if it can offload crc or not?
 >
 > If you compute the CRC you set CHECKSUM_NONE, if you want the device
 > to do it on transmit you should set CHECKSUM_PARTIAL.

Okay

 >>> the way.  Don't you need to initialize the checksum value to 0 before
 >>> you compute it?  I think you might have missed that step when you were
 >>> setting this up.
 >>
 >>
 >> It's fine :) sctp_compute_cksum will replace it with zeroes, 
calculate, and
 >> put back the old value, which then we overwrite with the new one at
 >> sctp_gso_segment.
 >
 > Right but there are scenarios where this will be offloaded isn't
 > there?  You would probably be better off setting the CRC to 0 before
 > you start segmentation and then that way you can either just set
 > csum_offset, csum_start and ip_summed if the lower device supports
 > SCTP CRC offload, otherwise you can just compute it without the need
 > to write the 0 into the header.

Ahh, it's also zeroed when the header is constructed. There is 
'sh->checksum = 0;' in sctp_packet_transmit for this.

I'll look into moving this decision on CRC offloading or not into the 
segmentation moment. I think it will have to be done twice, actually, 
for sctp-reasons. Like, if packet will be fragmented by IP, it currently 
doesn't allow offloading CRC computing. I'll check, then post a v2. I 
think at that least the crc offloading is now clarified. Thanks Alex.

Marcelo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH net-next 3/3] sctp: Add GSO support
@ 2016-02-01 17:41               ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-02-01 17:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, Neil Horman, Vlad Yasevich, David Miller,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal, pabeni,
	John Fastabend, linux-sctp, Tom Herbert

Em 01-02-2016 15:03, Alexander Duyck escreveu:
 > On Mon, Feb 1, 2016 at 8:22 AM, Marcelo Ricardo Leitner
 > <marcelo.leitner@gmail.com> wrote:
 >> Em 30-01-2016 02:07, Alexander Duyck escreveu:
 >>>
 >>> On Fri, Jan 29, 2016 at 11:42 AM, Marcelo Ricardo Leitner
 >>> <marcelo.leitner@gmail.com> wrote:
 >>>>
 >>>> On Fri, Jan 29, 2016 at 11:15:54AM -0800, Alexander Duyck wrote:
 >>>>>
 >>>>> On Wed, Jan 27, 2016 at 9:06 AM, Marcelo Ricardo Leitner
 >>
 >> ...
 >>
 >>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
 >>>>>> index
 >>>>>> 
8cba3d852f251c503b193823b71b27aaef3fb3ae..9583284086967c0746de5f553535e25e125714a5
 >>>>>> 100644
 >>>>>> --- a/net/core/dev.c
 >>>>>> +++ b/net/core/dev.c
 >>>>>> @@ -2680,7 +2680,11 @@ EXPORT_SYMBOL(skb_mac_gso_segment);
 >>>>>>    static inline bool skb_needs_check(struct sk_buff *skb, bool 
tx_path)
 >>>>>>    {
 >>>>>>           if (tx_path)
 >>>>>> -               return skb->ip_summed != CHECKSUM_PARTIAL;
 >>>>>> +               /* FIXME: Why only packets with checksum 
offloading are
 >>>>>> +                * supported for GSO?
 >>>>>> +                */
 >>>>>> +               return skb->ip_summed != CHECKSUM_PARTIAL &&
 >>>>>> +                      skb->ip_summed != CHECKSUM_UNNECESSARY;
 >>>>>>           else
 >>>>>>                   return skb->ip_summed = CHECKSUM_NONE;
 >>>>>>    }
 >>>>>
 >>>>>
 >>>>> Tom Herbert just got rid of the use of CHECKSUM_UNNECESSARY in the
 >>>>> transmit path a little while ago.  Please don't reintroduce it.
 >>>>
 >>>>
 >>>> Can you give me some pointers on that? I cannot find such change.
 >>>> skb_needs_check() seems to be like that since beginning.
 >>>
 >>>
 >>> Maybe you need to update your kernel.  All this stuff was changed in
 >>> December and has been this way for a little while now.
 >>>
 >>> Commits:
 >>> 7a6ae71b24905 "net: Elaborate on checksum offload interface 
description"
 >>> 253aab0597d9e "fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload"
 >>> 53692b1de419c "sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC"
 >>>
 >>> The main reason I even noticed it is because of some of the work I did
 >>> on the Intel NIC offloads.
 >>
 >>
 >> Ok I have those here, but my need here is different. I want to do 
GSO with
 >> packets that won't have CRC offloaded, so I shouldn't use 
CHECKSUM_PARTIAL
 >> but something else.
 >
 > CHECKSUM_NONE if you don't want to have any of the CRC or checksums
 > offloaded.  However as I mentioned before you will want to fake it
 > then since skb_segment assumes it is doing a 1's compliment checksum
 > so you will want to pass NET_F_HW_CSUM as a feature flag and then set
 > CHECKSUM_NONE after the frame has been segmented.

Ok

 >> ...
 >>
 >>>>>> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
 >>>>>> index
 >>>>>> 
7080a6318da7110c1688dd0c5bb240356dbd0cd3..3b96035fa180a4e7195f7b6e7a8be7b97c8f8b26
 >>>>>> 100644
 >>>>>> --- a/net/sctp/offload.c
 >>>>>> +++ b/net/sctp/offload.c
 >>>>>> @@ -36,8 +36,61 @@
 >>>>>>    #include <net/sctp/checksum.h>
 >>>>>>    #include <net/protocol.h>
 >>>>>>
 >>>>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
 >>>>>> +{
 >>>>>> +       skb->ip_summed = CHECKSUM_NONE;
 >>>>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
 >>>>>> +}
 >>>>>> +
 >>>>>
 >>>>>
 >>>>> I really despise the naming of this bit here.  SCTP does not use a
 >>>>> checksum.  It uses a CRC.  Please don't call this a checksum as it
 >>>>> will just make the code really confusing.   I think the name 
should be
 >>>>> something like gso_make_crc32c.
 >>>>
 >>>>
 >>>> Agreed. SCTP code still references it as 'cksum'. I'll change that in
 >>>> another patch.
 >>>>
 >>>>> I think we need to address the CRC issues before we can really get
 >>>>> into segmentation.  Specifically we need to be able to offload SCTP
 >>>>> and FCoE in software since they both use the CHECKSUM_PARTIAL value
 >>>>> and then we can start cleaning up more of this mess and move onto
 >>>>> segmentation.
 >>>>
 >>>>
 >>>> Hm? The mess on CRC issues here is caused by this patch alone. 
It's good
 >>>> as it is today. And a good part of this mess is caused by trying 
to GSO
 >>>> without offloading CRC too.
 >>>>
 >>>> Or you mean that SCTP and FCoE should stop using CHECKSUM_* at all?
 >>>
 >>>
 >>> Well after Tom's change both SCTP and FCoE use CHECKSUM_PARTIAL.
 >>> CHECKSUM_PARTIAL is what is used to indicate to the hardware that a
 >>> checksum offload has been requested so that is what is looked for at
 >>> the driver level.
 >>
 >>
 >> SCTP was actually already using CHECKSUM_PARTIAL. That patch was just a
 >> rename in an attempt to make this crc difference more evident. Yet I'll
 >> continue the rename within sctp code.
 >
 > Yeah it was FCoE that was doing something different.
 >
 >>> My concern with all this is that we should probably be looking at
 >>> coming up with a means of offloading this in software when
 >>> skb_checksum_help is called.  Right now validate_xmit_skb doesn't have
 >>> any understanding of what to do with SCTP or FCoE and will try to just
 >>> compute a checksum for them.
 >>
 >>
 >> My worry is placed a bit earlier than that, I think. Currently I 
just cannot
 >> do GSO with packets that doesn't have checksum/crc offloaded too because
 >> validate_xmit_skb() will complain.
 >
 > That is probably because you are passing CHECKSUM_PARTIAL instead of
 > CHECKSUM_NONE.

Other way around, but it's cool. We are pretty much on the same page 
now, I think.

 >> As NICs hardly have sctp crc offloading capabilities, I'm thinking 
it makes
 >> sense to do GSO even without crc offloaded. After all, it doesn't matter
 >> much in which stage we are computing the crc as we are computing it 
anyway.
 >
 > Agreed.  You will need to support CHECKSUM_PARTIAL being passed to a
 > device that doesn't support SCTP first.  That way you can start
 > looking at just always setting CHECKSUM_PARTIAL in the transport layer
 > which is really needed if you want to do SCO (SCTP Segmentation
 > Offload) in the first place.  Once you have that you could then start
 > looking at doing the SCO since from that point on you should already
 > be in good shape to address those type of issues.  You should probably
 > use the csum_offset value in the skb in order to flag if this is
 > possibly SCTP.  As far as I know for now there shouldn't be any other
 > protocols that are using the same offset, and if needed you can
 > actually parse the headers to verify if the frame is actually SCTP.

Cool, yes.
Just cannot set CHECKSUM_PARTIAL always because if frame is not a GSO, 
we will not have another chance to fill in SCTP CRC if it's not 
offloaded. A check still have to consider for that, but np.

 >>>>>> +static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
 >>>>>> +                                       netdev_features_t features)
 >>>>>> +{
 >>>>>> +       struct sk_buff *segs = ERR_PTR(-EINVAL);
 >>>>>> +       struct sctphdr *sh;
 >>>>>> +
 >>>>>> +       sh = sctp_hdr(skb);
 >>>>>> +       if (!pskb_may_pull(skb, sizeof(*sh)))
 >>>>>> +               goto out;
 >>>>>> +
 >>>>>> +       __skb_pull(skb, sizeof(*sh));
 >>>>>> +
 >>>>>> +       if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
 >>>>>> +               /* Packet is from an untrusted source, reset 
gso_segs.
 >>>>>> */
 >>>>>> +               int type = skb_shinfo(skb)->gso_type;
 >>>>>> +
 >>>>>> +               if (unlikely(type &
 >>>>>> +                            ~(SKB_GSO_SCTP | SKB_GSO_DODGY |
 >>>>>> +                              0) ||
 >>>>>> +                            !(type & (SKB_GSO_SCTP))))
 >>>>>> +                       goto out;
 >>>>>> +
 >>>>>> +               /* This should not happen as no NIC has SCTP GSO
 >>>>>> +                * offloading, it's always via software and thus we
 >>>>>> +                * won't send a large packet down the stack.
 >>>>>> +                */
 >>>>>> +               WARN_ONCE(1, "SCTP segmentation offloading to 
NICs is
 >>>>>> not supported.");
 >>>>>> +               goto out;
 >>>>>> +       }
 >>>>>> +
 >>>>>
 >>>>>
 >>>>> So what you are going to end up needing here is some way to tell the
 >>>>> hardware that you are doing the checksum no matter what.  There is no
 >>>>> value in you computing a 1's compliment checksum for the payload if
 >>>>> you aren't going to use it.  What you can probably do is just clear
 >>>>> the standard checksum flags and then OR in NETIF_F_HW_CSUM if
 >>>>> NETIF_F_SCTP_CRC is set and that should get skb_segment to skip
 >>>>> offloading the checksum.
 >>>>
 >>>>
 >>>> Interesting, ok
 >>>>
 >>>>> One other bit that will make this more complicated is if we ever get
 >>>>> around to supporting SCTP in tunnels.  Then we will need to sort out
 >>>>> how things like remote checksum offload should impact SCTP, and 
how to
 >>>>> deal with needing to compute both a CRC and 1's compliment checksum.
 >>>>> What we would probably need to do is check for encap_hdr_csum and if
 >>>>> it is set and we are doing SCTP then we would need to clear the
 >>>>> NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM flags.
 >>>>
 >>>>
 >>>> Yup. And that includes on storing pointers to where to store each 
of it.
 >>>
 >>>
 >>> Actually the pointers bit is easy.  The csum_start and csum_offset
 >>> values should be set up after you have segmented the skb and should be
 >>> updated after the skb has been segmented.  If nothing else you can
 >>> probably take a look at the TCP code tcp_gso_segment and
 >>> tcp4_gso_segment for inspiration.  Basically you need to make sure
 >>> that you set the ip_summed, csum_start, and csum_offset values for
 >>> your first frame before you start segmenting it into multiple frames.
 >>
 >>
 >> Ah yes, ok, that's for now, when not doing crc offloading with some 
chksum
 >> offloading (tunnel) too.
 >
 > Actually that would be regardless of tunnel offloading.  We don't
 > store the outer checksum offsets.  If we need outer checksum we
 > restore them after the fact since the inner checksum offsets are
 > needed as part of the inner header TCP checksum computation.

Hm okay

 >>>>>> +       segs = skb_segment(skb, features);
 >>>>>> +       if (IS_ERR(segs))
 >>>>>> +               goto out;
 >>>>>> +
 >>>>>> +       /* All that is left is update SCTP CRC if necessary */
 >>>>>> +       for (skb = segs; skb; skb = skb->next) {
 >>>>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL) {
 >>>>>> +                       sh = sctp_hdr(skb);
 >>>>>> +                       sh->checksum = sctp_gso_make_checksum(skb);
 >>>>>> +               }
 >>>>>> +       }
 >>>>>> +
 >>>>>
 >>>>>
 >>>>> Okay, so it looks like you are doing the right thing here and leaving
 >>>>> this as CHECKSUM_PARTIAL.
 >>>>
 >>>>
 >>>> Actually no then. sctp_gso_make_checksum() replaces it:
 >>>> +static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
 >>>> +{
 >>>> +       skb->ip_summed = CHECKSUM_NONE;
 >>>> +       return sctp_compute_cksum(skb, skb_transport_offset(skb));
 >>>>
 >>>> Why again would have to leave it as CHECKSUM_PARTIAL? IP header?
 >>>
 >>>
 >>> My earlier comment is actually incorrect.  This section is pretty much
 >>> broken since CHECKSUM_PARTIAL only reflects a 1's compliment checksum
 >>> in the case of skb_segment so whatever the value it is worthless.
 >>> CHECKSUM_PARTIAL is used to indicate if a given frame needs to be
 >>> offloaded.  It is meant to let the device know that it still needs to
 >>> compute a checksum or CRC beginning at csum_start and then storing the
 >>> new value at csum_offset.  However for skb_segment it is actually
 >>> referring to a 1's compliment checksum and if it returns CHECKSUM_NONE
 >>> it means it is stored in skb->csum which would really wreck things for
 >>> you since that was your skb->csum_start and skb->csum_offset values.
 >>> I have a patch to change this so that we update a checksum in the
 >>> SKB_GSO_CB, but I wasn't planning on submitting that until net-next
 >>> opens.
 >>
 >>
 >> sctp currently ignores skb->csum. It doesn't mess with the crc but 
computing
 >> it is at least not optimal, yes.
 >
 > Actually sctp sets csum_start and csum_offset if it sets
 > CHECKSUM_PARTIAL.  So it does mess with skb->csum since it is
 > contained in a union with those two fields.

Well, yes, but point was that messed value is not used for anything 
useful later on..
I'll implement the NETIF_F_HW_CSUM trick.

 >>> In the case of SCTP you probably don't even need to bother checking
 >>> the value since it is meaningless as skb_segment doesn't know how to
 >>> do an SCTP checksum anyway.  To that end for now what you could do is
 >>> just set NETIF_F_HW_CSUM.  This way skb_segment won't go and try to
 >>> compute a 1's compliment checksum on the payload since there is no
 >>> actual need for it.
 >>
 >>
 >> Nice, ok.
 >>
 >>> One other bit you will need to do is to check the value of SCTP_CRC
 >>> outside of skb_segment.  You might look at how
 >>> __skb_udp_tunnel_segment does this to populate its own offload_csum
 >>> boolean value, though you would want to use features, not
 >>> skb->dev->features as that is a bit of a workaround since features is
 >>> stripped by hw_enc_features in some paths if I recall correctly.
 >>
 >>>
 >>>
 >>> Once the frames are segmented and if you don't support the offload you
 >>> could then call gso_make_crc32c() or whatever you want to name it to
 >>> perform the CRC calculation and populate the field.  One question by
 >>
 >>
 >> Hmmm.. does it mean that we can use CHECKSUM_PARTIAL then even if CRC
 >> offloading is not possible then? Because the packet will not be 
offloaded in
 >> the end, yes, but this solves my questions above. Then while doing 
GSO, it
 >> re-evaluates if it can offload crc or not?
 >
 > If you compute the CRC you set CHECKSUM_NONE, if you want the device
 > to do it on transmit you should set CHECKSUM_PARTIAL.

Okay

 >>> the way.  Don't you need to initialize the checksum value to 0 before
 >>> you compute it?  I think you might have missed that step when you were
 >>> setting this up.
 >>
 >>
 >> It's fine :) sctp_compute_cksum will replace it with zeroes, 
calculate, and
 >> put back the old value, which then we overwrite with the new one at
 >> sctp_gso_segment.
 >
 > Right but there are scenarios where this will be offloaded isn't
 > there?  You would probably be better off setting the CRC to 0 before
 > you start segmentation and then that way you can either just set
 > csum_offset, csum_start and ip_summed if the lower device supports
 > SCTP CRC offload, otherwise you can just compute it without the need
 > to write the 0 into the header.

Ahh, it's also zeroed when the header is constructed. There is 
'sh->checksum = 0;' in sctp_packet_transmit for this.

I'll look into moving this decision on CRC offloading or not into the 
segmentation moment. I think it will have to be done twice, actually, 
for sctp-reasons. Like, if packet will be fragmented by IP, it currently 
doesn't allow offloading CRC computing. I'll check, then post a v2. I 
think at that least the crc offloading is now clarified. Thanks Alex.

Marcelo


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2016-02-01 17:41 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-27 17:06 [RFC PATCH net-next 0/3] sctp: add GSO support Marcelo Ricardo Leitner
2016-01-27 17:06 ` Marcelo Ricardo Leitner
2016-01-27 17:06 ` [RFC PATCH net-next 1/3] skbuff: export skb_gro_receive Marcelo Ricardo Leitner
2016-01-27 17:06   ` Marcelo Ricardo Leitner
2016-01-27 18:35   ` Eric Dumazet
2016-01-27 18:35     ` Eric Dumazet
2016-01-27 18:46     ` Marcelo Ricardo Leitner
2016-01-27 18:46       ` Marcelo Ricardo Leitner
2016-01-27 17:06 ` [RFC PATCH net-next 2/3] sctp: offloading support structure Marcelo Ricardo Leitner
2016-01-27 17:06   ` Marcelo Ricardo Leitner
2016-01-27 17:06 ` [RFC PATCH net-next 3/3] sctp: Add GSO support Marcelo Ricardo Leitner
2016-01-27 17:06   ` Marcelo Ricardo Leitner
2016-01-29 19:15   ` Alexander Duyck
2016-01-29 19:15     ` Alexander Duyck
2016-01-29 19:42     ` Marcelo Ricardo Leitner
2016-01-29 19:42       ` Marcelo Ricardo Leitner
2016-01-30  4:07       ` Alexander Duyck
2016-01-30  4:07         ` Alexander Duyck
2016-02-01 16:22         ` Marcelo Ricardo Leitner
2016-02-01 16:22           ` Marcelo Ricardo Leitner
2016-02-01 17:03           ` Alexander Duyck
2016-02-01 17:03             ` Alexander Duyck
2016-02-01 17:41             ` Marcelo Ricardo Leitner
2016-02-01 17:41               ` Marcelo Ricardo Leitner
2016-01-28 13:51 ` [RFC PATCH net-next 0/3] sctp: add " David Laight
2016-01-28 15:53   ` 'Marcelo Ricardo Leitner'
2016-01-28 15:53     ` 'Marcelo Ricardo Leitner'
2016-01-28 17:30     ` David Laight
2016-01-28 20:55       ` 'Marcelo Ricardo Leitner'
2016-01-28 20:55         ` 'Marcelo Ricardo Leitner'
2016-01-29 15:51         ` David Laight
2016-01-29 18:53           ` 'Marcelo Ricardo Leitner'
2016-01-29 18:53             ` 'Marcelo Ricardo Leitner'
2016-01-29 15:57         ` David Laight
2016-01-29 16:07         ` David Laight
2016-01-28 17:54   ` Michael Tuexen
2016-01-28 17:54     ` Michael Tuexen
2016-01-28 21:03     ` Marcelo Ricardo Leitner
2016-01-28 21:03       ` Marcelo Ricardo Leitner
2016-01-28 23:36       ` Michael Tuexen
2016-01-28 23:36         ` Michael Tuexen
2016-01-29  1:18         ` Marcelo Ricardo Leitner
2016-01-29  1:18           ` Marcelo Ricardo Leitner
2016-01-29 10:57           ` Michael Tuexen
2016-01-29 10:57             ` Michael Tuexen
2016-01-29 11:26             ` Marcelo Ricardo Leitner
2016-01-29 11:26               ` Marcelo Ricardo Leitner
2016-01-29 12:25               ` Michael Tuexen
2016-01-29 12:25                 ` Michael Tuexen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.