All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
@ 2015-10-23 20:50 ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-23 20:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad


This is a request for comments.

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be made
not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on a recently submitted patch for
tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)

These patches have been tested with as set of packetdrill scripts located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper
"Latency and Fairness Trade-Off for Thin Streams using Redundant Data
Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 +++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   9 +-
 include/net/tcp.h                      |  34 ++++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  35 ++++
 net/ipv4/tcp.c                         |  19 ++-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
 12 files changed, 415 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
@ 2015-10-23 20:50 ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-23 20:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen


This is a request for comments.

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be made
not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on a recently submitted patch for
tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)

These patches have been tested with as set of packetdrill scripts located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper
"Latency and Fairness Trade-Off for Thin Streams using Redundant Data
Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 +++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   9 +-
 include/net/tcp.h                      |  34 ++++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  35 ++++
 net/ipv4/tcp.c                         |  19 ++-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
 12 files changed, 415 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50 ` Bendik Rønning Opstad
@ 2015-10-23 20:50   ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-23 20:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/linux/tcp.h                    |  6 ++++++
 include/net/tcp.h                      | 20 ++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp.c                         |  3 +++
 5 files changed, 46 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 85752c8..b841a76 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -700,6 +700,14 @@ tcp_thin_dupack - BOOLEAN
 	Documentation/networking/tcp-thin.txt
 	Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+	Controls the lower bound for ITT (inter-transmission time) threshold
+	for when a stream is considered thin. The value is specified in
+	microseconds, and may not be lower than 10000 (10 ms). This theshold
+	is used to calculate a dynamic packets in flight limit (DPIFL) which
+	is used to classify whether a stream is thin.
+	Default: 10000
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c906f45..fc885db 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -269,6 +269,12 @@ struct tcp_sock {
 	struct sk_buff* lost_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 
+	/* The limit used to identify when a stream is thin based in a minimum
+	 * allowed inter-transmission time (ITT) in microseconds. This is used
+	 * to dynamically calculate a max packets in flight limit (DPIFL).
+	*/
+	int thin_dpifl_itt_lower_bound;
+
 	/* OOO segments go in this list. Note that socket lock must be held,
 	 * as we do not use sk_buff_head lock.
 	 */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4fc457b..6534836 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6       /* After 6 linear retries, do exp. backoff */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 10000  /* Minimum lower bound is 10 ms (10000 usec) */
 
 /* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
 #define TCP_INIT_CWND		10
@@ -274,6 +275,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1631,6 +1633,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
+ *                              limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ *         the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+	u64 dpif_lim = tp->srtt_us >> 3;
+	/* Div by is_thin_min_itt_lim, the minimum allowed ITT
+	 * (Inter-transmission time) in usecs.
+	 */
+	do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
+	return tcp_packets_in_flight(tp) < dpif_lim;
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 25300c5..917fdde 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -709,6 +710,14 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_thin_dpifl_itt_lower_bound",
+		.data		= &sysctl_tcp_thin_dpifl_itt_lower_bound,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0cfa7c0..f712d7c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -287,6 +287,8 @@ int sysctl_tcp_min_tso_segs __read_mostly = 2;
 
 int sysctl_tcp_autocorking __read_mostly = 1;
 
+int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -406,6 +408,7 @@ void tcp_init_sock(struct sock *sk)
 	u64_stats_init(&tp->syncp);
 
 	tp->reordering = sysctl_tcp_reordering;
+	tp->thin_dpifl_itt_lower_bound = sysctl_tcp_thin_dpifl_itt_lower_bound;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
@ 2015-10-23 20:50   ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-23 20:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/linux/tcp.h                    |  6 ++++++
 include/net/tcp.h                      | 20 ++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp.c                         |  3 +++
 5 files changed, 46 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 85752c8..b841a76 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -700,6 +700,14 @@ tcp_thin_dupack - BOOLEAN
 	Documentation/networking/tcp-thin.txt
 	Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+	Controls the lower bound for ITT (inter-transmission time) threshold
+	for when a stream is considered thin. The value is specified in
+	microseconds, and may not be lower than 10000 (10 ms). This theshold
+	is used to calculate a dynamic packets in flight limit (DPIFL) which
+	is used to classify whether a stream is thin.
+	Default: 10000
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c906f45..fc885db 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -269,6 +269,12 @@ struct tcp_sock {
 	struct sk_buff* lost_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 
+	/* The limit used to identify when a stream is thin based in a minimum
+	 * allowed inter-transmission time (ITT) in microseconds. This is used
+	 * to dynamically calculate a max packets in flight limit (DPIFL).
+	*/
+	int thin_dpifl_itt_lower_bound;
+
 	/* OOO segments go in this list. Note that socket lock must be held,
 	 * as we do not use sk_buff_head lock.
 	 */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4fc457b..6534836 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6       /* After 6 linear retries, do exp. backoff */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 10000  /* Minimum lower bound is 10 ms (10000 usec) */
 
 /* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
 #define TCP_INIT_CWND		10
@@ -274,6 +275,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1631,6 +1633,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
+ *                              limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ *         the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+	u64 dpif_lim = tp->srtt_us >> 3;
+	/* Div by is_thin_min_itt_lim, the minimum allowed ITT
+	 * (Inter-transmission time) in usecs.
+	 */
+	do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
+	return tcp_packets_in_flight(tp) < dpif_lim;
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 25300c5..917fdde 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -709,6 +710,14 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_thin_dpifl_itt_lower_bound",
+		.data		= &sysctl_tcp_thin_dpifl_itt_lower_bound,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0cfa7c0..f712d7c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -287,6 +287,8 @@ int sysctl_tcp_min_tso_segs __read_mostly = 2;
 
 int sysctl_tcp_autocorking __read_mostly = 1;
 
+int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -406,6 +408,7 @@ void tcp_init_sock(struct sock *sk)
 	u64_stats_init(&tp->syncp);
 
 	tp->reordering = sysctl_tcp_reordering;
+	tp->thin_dpifl_itt_lower_bound = sysctl_tcp_thin_dpifl_itt_lower_bound;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
@ 2015-10-23 20:50   ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-23 20:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
    with each packet, packet loss can be hidden from the TCP engine due
    to lack of dupACKs. This is because the loss is "repaired" by the
    redundant data in the packet coming after the lost packet. Based on
    incoming ACKs, such hidden loss events are detected, and CWR state
    is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
    be sent with a modified SKB containing the redundant data of
    previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
    tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
    for streams that may benefit from RDB, controlled by the sysctl
    variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl variable tcp_rdb=1.

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 ++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  14 ++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  26 +++
 net/ipv4/tcp.c                         |  16 +-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
 12 files changed, 369 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index b841a76..740e6a3 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
 	is used to classify whether a stream is thin.
 	Default: 10000
 
+tcp_rdb - BOOLEAN
+	Enable RDB for all new TCP connections.
+	Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+	Enable restriction on how many bytes an RDB packet can contain.
+	This is the total amount of payload including the new unsent data.
+	Default: 0
+
+tcp_rdb_max_skbs - INTEGER
+	Enable restriction on how many previous SKBs in the output queue
+	RDB may include data from. A value of 1 will restrict bundling to
+	only the data from the last packet that was sent.
+	Default: 1
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 24f4dfd..3572d21 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2809,6 +2809,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fc885db..f38b889 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -202,9 +202,10 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 3,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
+		rdb         : 1,/* Redundant Data Bundling enabled */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6534836..dce46c2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -276,6 +276,9 @@ extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
+extern int sysctl_tcp_rdb;
+extern int sysctl_tcp_rdb_max_bytes;
+extern int sysctl_tcp_rdb_max_skbs;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -548,6 +551,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 bool tcp_may_send_now(struct sock *sk);
 int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
 int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
@@ -573,6 +578,11 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
 void tcp_reset(struct sock *sk);
 void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 
+/* tcp_rdb.c */
+void rdb_ack_event(struct sock *sk, u32 flags);
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask);
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
@@ -771,6 +781,7 @@ struct tcp_skb_cb {
 	union {
 		struct {
 			/* There is space for up to 20 bytes */
+			__u32 rdb_start_seq; /* Start seq of rdb data bundled */
 		} tx;   /* only used for outgoing skbs */
 		union {
 			struct inet_skb_parm	h4;
@@ -1494,6 +1505,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)		\
+	skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+
 static inline struct sk_buff *tcp_send_head(const struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 65a77b0..ae0fba3 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -115,6 +115,7 @@ enum {
 #define TCP_CC_INFO		26	/* Get Congestion Control (optional) info */
 #define TCP_SAVE_SYN		27	/* Record SYN headers for new connections */
 #define TCP_SAVED_SYN		28	/* Get SYN headers recorded for connection */
+#define TCP_RDB			29	/* Enable RDB mechanism */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fab4599..544f8cc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -978,7 +978,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -986,6 +986,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(copy_skb_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index c29809f..f2cf496 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tcp_rdb.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 917fdde..703078f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -718,6 +718,32 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
 	},
 	{
+		.procname	= "tcp_rdb",
+		.data		= &sysctl_tcp_rdb,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_max_bytes",
+		.data		= &sysctl_tcp_rdb_max_bytes,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= &proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "tcp_rdb_max_skbs",
+		.data		= &sysctl_tcp_rdb_max_skbs,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f712d7c..11d45d4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -289,6 +289,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
 
 int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
+int sysctl_tcp_rdb __read_mostly;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -409,6 +411,7 @@ void tcp_init_sock(struct sock *sk)
 
 	tp->reordering = sysctl_tcp_reordering;
 	tp->thin_dpifl_itt_lower_bound = sysctl_tcp_thin_dpifl_itt_lower_bound;
+	tp->rdb = sysctl_tcp_rdb;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
@@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_RDB:
+		if (val < 0 || val > 1) {
+			err = -EINVAL;
+		} else {
+			tp->rdb = val;
+			tp->nonagle = val;
+		}
+		break;
+
 	case TCP_REPAIR:
 		if (!tcp_can_repair_sock(sk))
 			err = -EPERM;
@@ -2828,7 +2840,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		val = tp->thin_dupack;
 		break;
-
+	case TCP_RDB:
+		val = tp->rdb;
+		break;
 	case TCP_REPAIR:
 		val = tp->repair;
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fdd88c3..a4901b3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3503,6 +3503,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
 
 	if (icsk->icsk_ca_ops->in_ack_event)
 		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+
+	if (unlikely(tcp_sk(sk)->rdb))
+		rdb_ack_event(sk, flags);
 }
 
 /* This routine deals with incoming acks, but not outgoing ones. */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f6f7f9b..6d4ea7d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -900,8 +900,8 @@ out:
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-			    gfp_t gfp_mask)
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_sock *inet;
@@ -2113,9 +2113,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+		if (unlikely(tcp_sk(sk)->rdb)) {
+			if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
+				break;
+		} else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
 			break;
-
+		}
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.
diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
new file mode 100644
index 0000000..37faf35
--- /dev/null
+++ b/net/ipv4/tcp_rdb.c
@@ -0,0 +1,281 @@
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+int sysctl_tcp_rdb_max_bytes __read_mostly;
+int sysctl_tcp_rdb_max_skbs __read_mostly = 1;
+
+/**
+ * rdb_check_rtx_queue_loss() - Perform loss detection by analysing acks.
+ * @sk: the socket.
+ * @seq_acked: The sequence number that was acked.
+ *
+ * Return: The number of packets that are presumed to be lost.
+ */
+static int rdb_check_rtx_queue_loss(struct sock *sk, u32 seq_acked)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb, *tmp, *prev_skb = NULL;
+	struct sk_buff *send_head = tcp_send_head(sk);
+	struct tcp_skb_cb *scb;
+	bool fully_acked = true;
+	int lost_count = 0;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == send_head)
+			break;
+
+		scb = TCP_SKB_CB(skb);
+
+		/* Determine how many packets and what bytes were acked, no TSO
+		 * support
+		 */
+		if (after(scb->end_seq, tp->snd_una)) {
+			if (tcp_skb_pcount(skb) == 1 ||
+			    !after(tp->snd_una, scb->seq)) {
+				break;
+			}
+
+			/* We do not handle SKBs with gso_segs */
+			if (tcp_skb_pcount(skb))
+				break;
+			fully_acked = false;
+		}
+
+		/* Acks up to this SKB */
+		if (scb->end_seq == seq_acked) {
+			/* This SKB was sent with RDB data, and acked data on
+			 * previous skb
+			 */
+			if (TCP_SKB_CB(skb)->tx.rdb_start_seq != scb->seq &&
+			    prev_skb) {
+				/* Find how many previous packets were Acked
+				 * (and thereby lost)
+				 */
+				tcp_for_write_queue(tmp, sk) {
+					/* We have reached the acked SKB */
+					if (tmp == skb)
+						break;
+					lost_count++;
+				}
+			}
+			break;
+		}
+		if (!fully_acked)
+			break;
+		prev_skb = skb;
+	}
+	return lost_count;
+}
+
+/**
+ * rdb_in_ack_event() - Initiate loss detection
+ * @sk: the socket
+ * @flags: The flags
+ */
+void rdb_ack_event(struct sock *sk, u32 flags)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	if (rdb_check_rtx_queue_loss(sk, tp->snd_una))
+		tcp_enter_cwr(sk);
+}
+
+/**
+ * skb_append_data() - Copy data from an SKB to the end of another
+ * @from_skb: The SKB to copy data from
+ * @to_skb: The SKB to copy data to
+ */
+static int skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
+{
+	/* Copy the linear data and the data from the frags into the linear page
+	 * buffer of to_skb.
+	 */
+	if (WARN_ON(skb_copy_bits(from_skb, 0,
+				  skb_put(to_skb, from_skb->len),
+				  from_skb->len))) {
+		goto fault;
+	}
+
+	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
+
+	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
+		to_skb->ip_summed = CHECKSUM_PARTIAL;
+
+	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
+		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
+					      to_skb->len);
+	return 0;
+fault:
+	return -EFAULT;
+}
+
+/**
+ * rdb_build_skb() - Builds the new RDB SKB and copies all the data into the
+ *                   linear page buffer.
+ * @sk: the socket
+ * @xmit_skb: This is the SKB that tcp_write_xmit wants to send
+ * @first_skb: The first SKB in the output queue we will bundle
+ * @gfp_mask: The gfp_t allocation
+ * @bytes_in_rdb_skb: The total number of data bytes for the new rdb_skb
+ *                         (NEW + Redundant)
+ *
+ * Return: A new SKB containing redundant data, or NULL if memory allocation
+ *         failed
+ */
+static struct sk_buff *rdb_build_skb(const struct sock *sk,
+				     struct sk_buff *xmit_skb,
+				     struct sk_buff *first_skb,
+				     u32 bytes_in_rdb_skb,
+				     gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb, *tmp_skb;
+
+	rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
+				      (int)bytes_in_rdb_skb,
+				      gfp_mask, true);
+	if (!rdb_skb)
+		return NULL;
+	copy_skb_header(rdb_skb, xmit_skb);
+	rdb_skb->ip_summed = xmit_skb->ip_summed;
+
+	TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
+
+	tmp_skb = first_skb;
+
+	tcp_for_write_queue_from(tmp_skb, sk) {
+		/* Copy data from tmp_skb to rdb_skb */
+		if (skb_append_data(tmp_skb, rdb_skb))
+			return NULL;
+		/* We are at the last skb that should be included (The unsent
+		 * one)
+		 */
+		if (tmp_skb == xmit_skb)
+			break;
+	}
+	return rdb_skb;
+}
+
+/**
+ * rdb_can_bundle_check() - check if redundant data can be bundled
+ * @sk: the socket
+ * @xmit_skb: The SKB processed for transmission by the output engine
+ * @mss_now: The current mss value
+ * @bytes_in_rdb_skb: Will contain the resulting number of bytes to bundle
+ *                         at exit.
+ * @skbs_to_bundle_count: The total number of SKBs to be in the bundle
+ *
+ * Traverses the entire write queue and checks if any un-acked data
+ * may be bundled.
+ *
+ * Return: The first SKB to be in the bundle, or NULL if no bundling
+ */
+static struct sk_buff *rdb_can_bundle_check(const struct sock *sk,
+					    struct sk_buff *xmit_skb,
+					    unsigned int mss_now,
+					    u32 *bytes_in_rdb_skb,
+					    u32 *skbs_to_bundle_count)
+{
+	struct sk_buff *first_to_bundle = NULL;
+	struct sk_buff *tmp, *skb = xmit_skb->prev;
+	u32 skbs_in_bundle_count = 1; /* 1 to account for current skb */
+	u32 byte_count = xmit_skb->len;
+
+	/* We start at the skb before xmit_skb, and go backwards in the list.*/
+	tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+		/* Not enough room to bundle data from this SKB */
+		if ((byte_count + skb->len) > mss_now)
+			break;
+
+		if (sysctl_tcp_rdb_max_bytes &&
+		    ((byte_count + skb->len) > sysctl_tcp_rdb_max_bytes))
+			break;
+
+		if (sysctl_tcp_rdb_max_skbs &&
+		    (skbs_in_bundle_count > sysctl_tcp_rdb_max_skbs))
+			break;
+
+		byte_count += skb->len;
+		skbs_in_bundle_count++;
+		first_to_bundle = skb;
+	}
+	*bytes_in_rdb_skb = byte_count;
+	*skbs_to_bundle_count = skbs_in_bundle_count;
+	return first_to_bundle;
+}
+
+/**
+ * create_rdb_skb() - Try to create RDB SKB
+ * @sk: the socket
+ * @xmit_skb: The SKB that should be sent
+ * @mss_now: Current MSS
+ * @gfp_mask: The gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if no bundling could be
+ *         performed
+ */
+struct sk_buff *create_rdb_skb(const struct sock *sk, struct sk_buff *xmit_skb,
+			       unsigned int mss_now, u32 *bytes_in_rdb_skb,
+			       gfp_t gfp_mask)
+{
+	u32 skb_in_bundle_count;
+	struct sk_buff *first_to_bundle;
+
+	if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb))
+		return NULL;
+
+	/* No bundling on FIN packet */
+	if (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)
+		return NULL;
+
+	/* Find number of (previous) SKBs to get data from */
+	first_to_bundle = rdb_can_bundle_check(sk, xmit_skb, mss_now,
+					       bytes_in_rdb_skb,
+					       &skb_in_bundle_count);
+	if (!first_to_bundle)
+		return NULL;
+
+	/* Create an SKB that contains the data from 'skb_in_bundle_count'
+	 * SKBs.
+	 */
+	return rdb_build_skb(sk, xmit_skb, first_to_bundle,
+			     *bytes_in_rdb_skb, gfp_mask);
+}
+
+/**
+ * tcp_transmit_rdb_skb() - Try to create and send an RDB packet
+ * @sk: the socket
+ * @xmit_skb: The SKB processed for transmission by the output engine
+ * @mss_now: Current MSS
+ * @gfp_mask: The gfp_t allocation
+ *
+ * Return: 0 if successfully sent packet, else != 0
+ */
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *rdb_skb = NULL;
+	u32 bytes_in_rdb_skb = 0; /* May be used for statistical purposes */
+
+	/* How we detect that RDB was used. When equal, no RDB data was sent */
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
+
+	if (tcp_stream_is_thin_dpifl(tp)) {
+		rdb_skb = create_rdb_skb(sk, xmit_skb, mss_now,
+					 &bytes_in_rdb_skb, gfp_mask);
+		if (!rdb_skb)
+			goto xmit_default;
+
+		/* Set tstamp for SKB in output queue, because tcp_transmit_skb
+		 * will do this for the rdb_skb and not the SKB in the output
+		 * queue (xmit_skb).
+		 */
+		skb_mstamp_get(&xmit_skb->skb_mstamp);
+		rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
+		return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
+	}
+xmit_default:
+	/* Transmit the unmodified SKB from output queue */
+	return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
+}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-10-23 20:50   ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-23 20:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
    with each packet, packet loss can be hidden from the TCP engine due
    to lack of dupACKs. This is because the loss is "repaired" by the
    redundant data in the packet coming after the lost packet. Based on
    incoming ACKs, such hidden loss events are detected, and CWR state
    is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
    be sent with a modified SKB containing the redundant data of
    previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
    tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
    for streams that may benefit from RDB, controlled by the sysctl
    variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl variable tcp_rdb=1.

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 ++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  14 ++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  26 +++
 net/ipv4/tcp.c                         |  16 +-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
 12 files changed, 369 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index b841a76..740e6a3 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
 	is used to classify whether a stream is thin.
 	Default: 10000
 
+tcp_rdb - BOOLEAN
+	Enable RDB for all new TCP connections.
+	Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+	Enable restriction on how many bytes an RDB packet can contain.
+	This is the total amount of payload including the new unsent data.
+	Default: 0
+
+tcp_rdb_max_skbs - INTEGER
+	Enable restriction on how many previous SKBs in the output queue
+	RDB may include data from. A value of 1 will restrict bundling to
+	only the data from the last packet that was sent.
+	Default: 1
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 24f4dfd..3572d21 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2809,6 +2809,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fc885db..f38b889 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -202,9 +202,10 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 3,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
+		rdb         : 1,/* Redundant Data Bundling enabled */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6534836..dce46c2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -276,6 +276,9 @@ extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
+extern int sysctl_tcp_rdb;
+extern int sysctl_tcp_rdb_max_bytes;
+extern int sysctl_tcp_rdb_max_skbs;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -548,6 +551,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 bool tcp_may_send_now(struct sock *sk);
 int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
 int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
@@ -573,6 +578,11 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
 void tcp_reset(struct sock *sk);
 void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 
+/* tcp_rdb.c */
+void rdb_ack_event(struct sock *sk, u32 flags);
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask);
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
@@ -771,6 +781,7 @@ struct tcp_skb_cb {
 	union {
 		struct {
 			/* There is space for up to 20 bytes */
+			__u32 rdb_start_seq; /* Start seq of rdb data bundled */
 		} tx;   /* only used for outgoing skbs */
 		union {
 			struct inet_skb_parm	h4;
@@ -1494,6 +1505,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)		\
+	skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+
 static inline struct sk_buff *tcp_send_head(const struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 65a77b0..ae0fba3 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -115,6 +115,7 @@ enum {
 #define TCP_CC_INFO		26	/* Get Congestion Control (optional) info */
 #define TCP_SAVE_SYN		27	/* Record SYN headers for new connections */
 #define TCP_SAVED_SYN		28	/* Get SYN headers recorded for connection */
+#define TCP_RDB			29	/* Enable RDB mechanism */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fab4599..544f8cc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -978,7 +978,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -986,6 +986,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(copy_skb_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index c29809f..f2cf496 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tcp_rdb.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 917fdde..703078f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -718,6 +718,32 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
 	},
 	{
+		.procname	= "tcp_rdb",
+		.data		= &sysctl_tcp_rdb,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_max_bytes",
+		.data		= &sysctl_tcp_rdb_max_bytes,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= &proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "tcp_rdb_max_skbs",
+		.data		= &sysctl_tcp_rdb_max_skbs,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f712d7c..11d45d4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -289,6 +289,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
 
 int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
+int sysctl_tcp_rdb __read_mostly;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -409,6 +411,7 @@ void tcp_init_sock(struct sock *sk)
 
 	tp->reordering = sysctl_tcp_reordering;
 	tp->thin_dpifl_itt_lower_bound = sysctl_tcp_thin_dpifl_itt_lower_bound;
+	tp->rdb = sysctl_tcp_rdb;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
@@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_RDB:
+		if (val < 0 || val > 1) {
+			err = -EINVAL;
+		} else {
+			tp->rdb = val;
+			tp->nonagle = val;
+		}
+		break;
+
 	case TCP_REPAIR:
 		if (!tcp_can_repair_sock(sk))
 			err = -EPERM;
@@ -2828,7 +2840,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		val = tp->thin_dupack;
 		break;
-
+	case TCP_RDB:
+		val = tp->rdb;
+		break;
 	case TCP_REPAIR:
 		val = tp->repair;
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fdd88c3..a4901b3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3503,6 +3503,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
 
 	if (icsk->icsk_ca_ops->in_ack_event)
 		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+
+	if (unlikely(tcp_sk(sk)->rdb))
+		rdb_ack_event(sk, flags);
 }
 
 /* This routine deals with incoming acks, but not outgoing ones. */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f6f7f9b..6d4ea7d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -900,8 +900,8 @@ out:
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-			    gfp_t gfp_mask)
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_sock *inet;
@@ -2113,9 +2113,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+		if (unlikely(tcp_sk(sk)->rdb)) {
+			if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
+				break;
+		} else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
 			break;
-
+		}
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.
diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
new file mode 100644
index 0000000..37faf35
--- /dev/null
+++ b/net/ipv4/tcp_rdb.c
@@ -0,0 +1,281 @@
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+int sysctl_tcp_rdb_max_bytes __read_mostly;
+int sysctl_tcp_rdb_max_skbs __read_mostly = 1;
+
+/**
+ * rdb_check_rtx_queue_loss() - Perform loss detection by analysing acks.
+ * @sk: the socket.
+ * @seq_acked: The sequence number that was acked.
+ *
+ * Return: The number of packets that are presumed to be lost.
+ */
+static int rdb_check_rtx_queue_loss(struct sock *sk, u32 seq_acked)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb, *tmp, *prev_skb = NULL;
+	struct sk_buff *send_head = tcp_send_head(sk);
+	struct tcp_skb_cb *scb;
+	bool fully_acked = true;
+	int lost_count = 0;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == send_head)
+			break;
+
+		scb = TCP_SKB_CB(skb);
+
+		/* Determine how many packets and what bytes were acked, no TSO
+		 * support
+		 */
+		if (after(scb->end_seq, tp->snd_una)) {
+			if (tcp_skb_pcount(skb) == 1 ||
+			    !after(tp->snd_una, scb->seq)) {
+				break;
+			}
+
+			/* We do not handle SKBs with gso_segs */
+			if (tcp_skb_pcount(skb))
+				break;
+			fully_acked = false;
+		}
+
+		/* Acks up to this SKB */
+		if (scb->end_seq == seq_acked) {
+			/* This SKB was sent with RDB data, and acked data on
+			 * previous skb
+			 */
+			if (TCP_SKB_CB(skb)->tx.rdb_start_seq != scb->seq &&
+			    prev_skb) {
+				/* Find how many previous packets were Acked
+				 * (and thereby lost)
+				 */
+				tcp_for_write_queue(tmp, sk) {
+					/* We have reached the acked SKB */
+					if (tmp == skb)
+						break;
+					lost_count++;
+				}
+			}
+			break;
+		}
+		if (!fully_acked)
+			break;
+		prev_skb = skb;
+	}
+	return lost_count;
+}
+
+/**
+ * rdb_in_ack_event() - Initiate loss detection
+ * @sk: the socket
+ * @flags: The flags
+ */
+void rdb_ack_event(struct sock *sk, u32 flags)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	if (rdb_check_rtx_queue_loss(sk, tp->snd_una))
+		tcp_enter_cwr(sk);
+}
+
+/**
+ * skb_append_data() - Copy data from an SKB to the end of another
+ * @from_skb: The SKB to copy data from
+ * @to_skb: The SKB to copy data to
+ */
+static int skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
+{
+	/* Copy the linear data and the data from the frags into the linear page
+	 * buffer of to_skb.
+	 */
+	if (WARN_ON(skb_copy_bits(from_skb, 0,
+				  skb_put(to_skb, from_skb->len),
+				  from_skb->len))) {
+		goto fault;
+	}
+
+	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
+
+	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
+		to_skb->ip_summed = CHECKSUM_PARTIAL;
+
+	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
+		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
+					      to_skb->len);
+	return 0;
+fault:
+	return -EFAULT;
+}
+
+/**
+ * rdb_build_skb() - Builds the new RDB SKB and copies all the data into the
+ *                   linear page buffer.
+ * @sk: the socket
+ * @xmit_skb: This is the SKB that tcp_write_xmit wants to send
+ * @first_skb: The first SKB in the output queue we will bundle
+ * @gfp_mask: The gfp_t allocation
+ * @bytes_in_rdb_skb: The total number of data bytes for the new rdb_skb
+ *                         (NEW + Redundant)
+ *
+ * Return: A new SKB containing redundant data, or NULL if memory allocation
+ *         failed
+ */
+static struct sk_buff *rdb_build_skb(const struct sock *sk,
+				     struct sk_buff *xmit_skb,
+				     struct sk_buff *first_skb,
+				     u32 bytes_in_rdb_skb,
+				     gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb, *tmp_skb;
+
+	rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
+				      (int)bytes_in_rdb_skb,
+				      gfp_mask, true);
+	if (!rdb_skb)
+		return NULL;
+	copy_skb_header(rdb_skb, xmit_skb);
+	rdb_skb->ip_summed = xmit_skb->ip_summed;
+
+	TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
+
+	tmp_skb = first_skb;
+
+	tcp_for_write_queue_from(tmp_skb, sk) {
+		/* Copy data from tmp_skb to rdb_skb */
+		if (skb_append_data(tmp_skb, rdb_skb))
+			return NULL;
+		/* We are at the last skb that should be included (The unsent
+		 * one)
+		 */
+		if (tmp_skb == xmit_skb)
+			break;
+	}
+	return rdb_skb;
+}
+
+/**
+ * rdb_can_bundle_check() - check if redundant data can be bundled
+ * @sk: the socket
+ * @xmit_skb: The SKB processed for transmission by the output engine
+ * @mss_now: The current mss value
+ * @bytes_in_rdb_skb: Will contain the resulting number of bytes to bundle
+ *                         at exit.
+ * @skbs_to_bundle_count: The total number of SKBs to be in the bundle
+ *
+ * Traverses the entire write queue and checks if any un-acked data
+ * may be bundled.
+ *
+ * Return: The first SKB to be in the bundle, or NULL if no bundling
+ */
+static struct sk_buff *rdb_can_bundle_check(const struct sock *sk,
+					    struct sk_buff *xmit_skb,
+					    unsigned int mss_now,
+					    u32 *bytes_in_rdb_skb,
+					    u32 *skbs_to_bundle_count)
+{
+	struct sk_buff *first_to_bundle = NULL;
+	struct sk_buff *tmp, *skb = xmit_skb->prev;
+	u32 skbs_in_bundle_count = 1; /* 1 to account for current skb */
+	u32 byte_count = xmit_skb->len;
+
+	/* We start at the skb before xmit_skb, and go backwards in the list.*/
+	tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+		/* Not enough room to bundle data from this SKB */
+		if ((byte_count + skb->len) > mss_now)
+			break;
+
+		if (sysctl_tcp_rdb_max_bytes &&
+		    ((byte_count + skb->len) > sysctl_tcp_rdb_max_bytes))
+			break;
+
+		if (sysctl_tcp_rdb_max_skbs &&
+		    (skbs_in_bundle_count > sysctl_tcp_rdb_max_skbs))
+			break;
+
+		byte_count += skb->len;
+		skbs_in_bundle_count++;
+		first_to_bundle = skb;
+	}
+	*bytes_in_rdb_skb = byte_count;
+	*skbs_to_bundle_count = skbs_in_bundle_count;
+	return first_to_bundle;
+}
+
+/**
+ * create_rdb_skb() - Try to create RDB SKB
+ * @sk: the socket
+ * @xmit_skb: The SKB that should be sent
+ * @mss_now: Current MSS
+ * @gfp_mask: The gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if no bundling could be
+ *         performed
+ */
+struct sk_buff *create_rdb_skb(const struct sock *sk, struct sk_buff *xmit_skb,
+			       unsigned int mss_now, u32 *bytes_in_rdb_skb,
+			       gfp_t gfp_mask)
+{
+	u32 skb_in_bundle_count;
+	struct sk_buff *first_to_bundle;
+
+	if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb))
+		return NULL;
+
+	/* No bundling on FIN packet */
+	if (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)
+		return NULL;
+
+	/* Find number of (previous) SKBs to get data from */
+	first_to_bundle = rdb_can_bundle_check(sk, xmit_skb, mss_now,
+					       bytes_in_rdb_skb,
+					       &skb_in_bundle_count);
+	if (!first_to_bundle)
+		return NULL;
+
+	/* Create an SKB that contains the data from 'skb_in_bundle_count'
+	 * SKBs.
+	 */
+	return rdb_build_skb(sk, xmit_skb, first_to_bundle,
+			     *bytes_in_rdb_skb, gfp_mask);
+}
+
+/**
+ * tcp_transmit_rdb_skb() - Try to create and send an RDB packet
+ * @sk: the socket
+ * @xmit_skb: The SKB processed for transmission by the output engine
+ * @mss_now: Current MSS
+ * @gfp_mask: The gfp_t allocation
+ *
+ * Return: 0 if successfully sent packet, else != 0
+ */
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *rdb_skb = NULL;
+	u32 bytes_in_rdb_skb = 0; /* May be used for statistical purposes */
+
+	/* How we detect that RDB was used. When equal, no RDB data was sent */
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
+
+	if (tcp_stream_is_thin_dpifl(tp)) {
+		rdb_skb = create_rdb_skb(sk, xmit_skb, mss_now,
+					 &bytes_in_rdb_skb, gfp_mask);
+		if (!rdb_skb)
+			goto xmit_default;
+
+		/* Set tstamp for SKB in output queue, because tcp_transmit_skb
+		 * will do this for the rdb_skb and not the SKB in the output
+		 * queue (xmit_skb).
+		 */
+		skb_mstamp_get(&xmit_skb->skb_mstamp);
+		rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
+		return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
+	}
+xmit_default:
+	/* Transmit the unmodified SKB from output queue */
+	return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50   ` Bendik Rønning Opstad
@ 2015-10-23 21:44     ` Eric Dumazet
  -1 siblings, 0 replies; 81+ messages in thread
From: Eric Dumazet @ 2015-10-23 21:44 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Fri, 2015-10-23 at 22:50 +0200, Bendik Rønning Opstad wrote:

>  
> +/**
> + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
> + *                              limit
> + * @tp: the tcp_sock struct
> + *
> + * Return: true if current packets in flight (PIF) count is lower than
> + *         the dynamic PIF limit, else false
> + */
> +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
> +{
> +	u64 dpif_lim = tp->srtt_us >> 3;
> +	/* Div by is_thin_min_itt_lim, the minimum allowed ITT
> +	 * (Inter-transmission time) in usecs.
> +	 */
> +	do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
> +	return tcp_packets_in_flight(tp) < dpif_lim;
> +}
> +
This is very strange :

You are using a do_div() while both operands are 32bits.  A regular
divide would be ok :

u32 dpif_lim = (tp->srtt_us >> 3) / tp->thin_dpifl_itt_lower_bound;

But then, you can avoid the divide by using a multiply, less expensive :

return	(u64)tcp_packets_in_flight(tp) * tp->thin_dpifl_itt_lower_bound <
	(tp->srtt_us >> 3);



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
@ 2015-10-23 21:44     ` Eric Dumazet
  0 siblings, 0 replies; 81+ messages in thread
From: Eric Dumazet @ 2015-10-23 21:44 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc

On Fri, 2015-10-23 at 22:50 +0200, Bendik Rønning Opstad wrote:

>  
> +/**
> + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
> + *                              limit
> + * @tp: the tcp_sock struct
> + *
> + * Return: true if current packets in flight (PIF) count is lower than
> + *         the dynamic PIF limit, else false
> + */
> +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
> +{
> +	u64 dpif_lim = tp->srtt_us >> 3;
> +	/* Div by is_thin_min_itt_lim, the minimum allowed ITT
> +	 * (Inter-transmission time) in usecs.
> +	 */
> +	do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
> +	return tcp_packets_in_flight(tp) < dpif_lim;
> +}
> +
This is very strange :

You are using a do_div() while both operands are 32bits.  A regular
divide would be ok :

u32 dpif_lim = (tp->srtt_us >> 3) / tp->thin_dpifl_itt_lower_bound;

But then, you can avoid the divide by using a multiply, less expensive :

return	(u64)tcp_packets_in_flight(tp) * tp->thin_dpifl_itt_lower_bound <
	(tp->srtt_us >> 3);



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
  (?)
@ 2015-10-24  6:11   ` Yuchung Cheng
  -1 siblings, 0 replies; 81+ messages in thread
From: Yuchung Cheng @ 2015-10-24  6:11 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Paolo Abeni,
	Erik Kline, Hannes Frederic Sowa, Al Viro, Jiri Pirko,
	Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Fri, Oct 23, 2015 at 1:50 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>
> This is a request for comments.
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
> Latency-sensitive applications or services, such as online games and
> remote desktop, produce traffic with thin-stream characteristics,
> characterized by small packets and a relatively high ITT. By bundling
> already sent data in packets with new data, RDB alleviates head-of-line
> blocking by reducing the need to retransmit data segments when packets
> are lost. RDB is a continuation on the work on latency improvements for
> TCP in Linux, previously resulting in two thin-stream mechanisms in the
> Linux kernel
> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>
> The RDB implementation has been thoroughly tested, and shows
> significant latency reductions when packet loss occurs[1]. The tests
> show that, by imposing restrictions on the bundling rate, it can be made
> not to negatively affect competing traffic in an unfair manner.
>
> Note: Current patch set depends on a recently submitted patch for
> tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)
>
> These patches have been tested with as set of packetdrill scripts located at
> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
> (The tests require patching packetdrill with a new socket option:
> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>
> Detailed info about the RDB mechanism can be found at
> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper

What's the difference between RDB and TCP repacketization
(http://flylib.com/books/en/3.223.1.226/1/) ?

Reading the blog page, I am concerned the amount of
change (esp on fast path) just to bundle new writes during timeout &
retransmit, for a specific type of application? why not just send X
packets with total bytes < MSS on timeout..

> "Latency and Fairness Trade-Off for Thin Streams using Redundant Data
> Bundling in TCP"[2].
>
> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
>
>
> Bendik Rønning Opstad (2):
>   tcp: Add DPIFL thin stream detection mechanism
>   tcp: Add Redundant Data Bundling (RDB)
>
>  Documentation/networking/ip-sysctl.txt |  23 +++
>  include/linux/skbuff.h                 |   1 +
>  include/linux/tcp.h                    |   9 +-
>  include/net/tcp.h                      |  34 ++++
>  include/uapi/linux/tcp.h               |   1 +
>  net/core/skbuff.c                      |   3 +-
>  net/ipv4/Makefile                      |   3 +-
>  net/ipv4/sysctl_net_ipv4.c             |  35 ++++
>  net/ipv4/tcp.c                         |  19 ++-
>  net/ipv4/tcp_input.c                   |   3 +
>  net/ipv4/tcp_output.c                  |  11 +-
>  net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
>  12 files changed, 415 insertions(+), 8 deletions(-)
>  create mode 100644 net/ipv4/tcp_rdb.c
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
@ 2015-10-24  6:11   ` Yuchung Cheng
  0 siblings, 0 replies; 81+ messages in thread
From: Yuchung Cheng @ 2015-10-24  6:11 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Paolo Abeni,
	Erik Kline, Hannes Frederic Sowa, Al Viro, Jiri Pirko,
	Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc

On Fri, Oct 23, 2015 at 1:50 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>
> This is a request for comments.
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
> Latency-sensitive applications or services, such as online games and
> remote desktop, produce traffic with thin-stream characteristics,
> characterized by small packets and a relatively high ITT. By bundling
> already sent data in packets with new data, RDB alleviates head-of-line
> blocking by reducing the need to retransmit data segments when packets
> are lost. RDB is a continuation on the work on latency improvements for
> TCP in Linux, previously resulting in two thin-stream mechanisms in the
> Linux kernel
> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>
> The RDB implementation has been thoroughly tested, and shows
> significant latency reductions when packet loss occurs[1]. The tests
> show that, by imposing restrictions on the bundling rate, it can be made
> not to negatively affect competing traffic in an unfair manner.
>
> Note: Current patch set depends on a recently submitted patch for
> tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)
>
> These patches have been tested with as set of packetdrill scripts located at
> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
> (The tests require patching packetdrill with a new socket option:
> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>
> Detailed info about the RDB mechanism can be found at
> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper

What's the difference between RDB and TCP repacketization
(http://flylib.com/books/en/3.223.1.226/1/) ?

Reading the blog page, I am concerned the amount of
change (esp on fast path) just to bundle new writes during timeout &
retransmit, for a specific type of application? why not just send X
packets with total bytes < MSS on timeout..

> "Latency and Fairness Trade-Off for Thin Streams using Redundant Data
> Bundling in TCP"[2].
>
> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
>
>
> Bendik Rønning Opstad (2):
>   tcp: Add DPIFL thin stream detection mechanism
>   tcp: Add Redundant Data Bundling (RDB)
>
>  Documentation/networking/ip-sysctl.txt |  23 +++
>  include/linux/skbuff.h                 |   1 +
>  include/linux/tcp.h                    |   9 +-
>  include/net/tcp.h                      |  34 ++++
>  include/uapi/linux/tcp.h               |   1 +
>  net/core/skbuff.c                      |   3 +-
>  net/ipv4/Makefile                      |   3 +-
>  net/ipv4/sysctl_net_ipv4.c             |  35 ++++
>  net/ipv4/tcp.c                         |  19 ++-
>  net/ipv4/tcp_input.c                   |   3 +
>  net/ipv4/tcp_output.c                  |  11 +-
>  net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
>  12 files changed, 415 insertions(+), 8 deletions(-)
>  create mode 100644 net/ipv4/tcp_rdb.c
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
@ 2015-10-24  6:11   ` Yuchung Cheng
  0 siblings, 0 replies; 81+ messages in thread
From: Yuchung Cheng @ 2015-10-24  6:11 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Paolo Abeni,
	Erik Kline, Hannes Frederic Sowa, Al Viro, Jiri Pirko,
	Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel

On Fri, Oct 23, 2015 at 1:50 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>
> This is a request for comments.
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
> Latency-sensitive applications or services, such as online games and
> remote desktop, produce traffic with thin-stream characteristics,
> characterized by small packets and a relatively high ITT. By bundling
> already sent data in packets with new data, RDB alleviates head-of-line
> blocking by reducing the need to retransmit data segments when packets
> are lost. RDB is a continuation on the work on latency improvements for
> TCP in Linux, previously resulting in two thin-stream mechanisms in the
> Linux kernel
> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>
> The RDB implementation has been thoroughly tested, and shows
> significant latency reductions when packet loss occurs[1]. The tests
> show that, by imposing restrictions on the bundling rate, it can be made
> not to negatively affect competing traffic in an unfair manner.
>
> Note: Current patch set depends on a recently submitted patch for
> tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)
>
> These patches have been tested with as set of packetdrill scripts located at
> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
> (The tests require patching packetdrill with a new socket option:
> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>
> Detailed info about the RDB mechanism can be found at
> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper

What's the difference between RDB and TCP repacketization
(http://flylib.com/books/en/3.223.1.226/1/) ?

Reading the blog page, I am concerned the amount of
change (esp on fast path) just to bundle new writes during timeout &
retransmit, for a specific type of application? why not just send X
packets with total bytes < MSS on timeout..

> "Latency and Fairness Trade-Off for Thin Streams using Redundant Data
> Bundling in TCP"[2].
>
> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
>
>
> Bendik Rønning Opstad (2):
>   tcp: Add DPIFL thin stream detection mechanism
>   tcp: Add Redundant Data Bundling (RDB)
>
>  Documentation/networking/ip-sysctl.txt |  23 +++
>  include/linux/skbuff.h                 |   1 +
>  include/linux/tcp.h                    |   9 +-
>  include/net/tcp.h                      |  34 ++++
>  include/uapi/linux/tcp.h               |   1 +
>  net/core/skbuff.c                      |   3 +-
>  net/ipv4/Makefile                      |   3 +-
>  net/ipv4/sysctl_net_ipv4.c             |  35 ++++
>  net/ipv4/tcp.c                         |  19 ++-
>  net/ipv4/tcp_input.c                   |   3 +
>  net/ipv4/tcp_output.c                  |  11 +-
>  net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
>  12 files changed, 415 insertions(+), 8 deletions(-)
>  create mode 100644 net/ipv4/tcp_rdb.c
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-24  6:11   ` Yuchung Cheng
@ 2015-10-24  8:00     ` Jonas Markussen
  -1 siblings, 0 replies; 81+ messages in thread
From: Jonas Markussen @ 2015-10-24  8:00 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Bendik Rønning Opstad, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Jonathan Corbet, Eric Dumazet, Neal Cardwell, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3781 bytes --]



> On 24 Oct 2015, at 08:11, Yuchung Cheng <ycheng@google.com> wrote:
> 
> On Fri, Oct 23, 2015 at 1:50 PM, Bendik Rønning Opstad
> <bro.devel@gmail.com> wrote:
>> 
>> This is a request for comments.
>> 
>> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
>> the latency for applications sending time-dependent data.
>> Latency-sensitive applications or services, such as online games and
>> remote desktop, produce traffic with thin-stream characteristics,
>> characterized by small packets and a relatively high ITT. By bundling
>> already sent data in packets with new data, RDB alleviates head-of-line
>> blocking by reducing the need to retransmit data segments when packets
>> are lost. RDB is a continuation on the work on latency improvements for
>> TCP in Linux, previously resulting in two thin-stream mechanisms in the
>> Linux kernel
>> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>> 
>> The RDB implementation has been thoroughly tested, and shows
>> significant latency reductions when packet loss occurs[1]. The tests
>> show that, by imposing restrictions on the bundling rate, it can be made
>> not to negatively affect competing traffic in an unfair manner.
>> 
>> Note: Current patch set depends on a recently submitted patch for
>> tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)
>> 
>> These patches have been tested with as set of packetdrill scripts located at
>> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
>> (The tests require patching packetdrill with a new socket option:
>> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>> 
>> Detailed info about the RDB mechanism can be found at
>> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper
> 
> What's the difference between RDB and TCP repacketization
> (http://flylib.com/books/en/3.223.1.226/1/) ?
> 
> Reading the blog page, I am concerned the amount of
> change (esp on fast path) just to bundle new writes during timeout &
> retransmit, for a specific type of application? why not just send X
> packets with total bytes < MSS on timeout..

Repacketization is only on retransmissions; RDB bundles previously sent segments with the next “normal” transmission instead. 

This makes the flow recover the lost segment  before a retransmission is triggered by an RTO or fast retransmit.

>> "Latency and Fairness Trade-Off for Thin Streams using Redundant Data
>> Bundling in TCP"[2].
>> 
>> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
>> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
>> 
>> 
>> Bendik Rønning Opstad (2):
>>  tcp: Add DPIFL thin stream detection mechanism
>>  tcp: Add Redundant Data Bundling (RDB)
>> 
>> Documentation/networking/ip-sysctl.txt |  23 +++
>> include/linux/skbuff.h                 |   1 +
>> include/linux/tcp.h                    |   9 +-
>> include/net/tcp.h                      |  34 ++++
>> include/uapi/linux/tcp.h               |   1 +
>> net/core/skbuff.c                      |   3 +-
>> net/ipv4/Makefile                      |   3 +-
>> net/ipv4/sysctl_net_ipv4.c             |  35 ++++
>> net/ipv4/tcp.c                         |  19 ++-
>> net/ipv4/tcp_input.c                   |   3 +
>> net/ipv4/tcp_output.c                  |  11 +-
>> net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
>> 12 files changed, 415 insertions(+), 8 deletions(-)
>> create mode 100644 net/ipv4/tcp_rdb.c
>> 
>> --
>> 1.9.1

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
@ 2015-10-24  8:00     ` Jonas Markussen
  0 siblings, 0 replies; 81+ messages in thread
From: Jonas Markussen @ 2015-10-24  8:00 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Bendik Rønning Opstad, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Jonathan Corbet, Eric Dumazet, Neal Cardwell, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn



> On 24 Oct 2015, at 08:11, Yuchung Cheng <ycheng@google.com> wrote:
> 
> On Fri, Oct 23, 2015 at 1:50 PM, Bendik Rønning Opstad
> <bro.devel@gmail.com> wrote:
>> 
>> This is a request for comments.
>> 
>> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
>> the latency for applications sending time-dependent data.
>> Latency-sensitive applications or services, such as online games and
>> remote desktop, produce traffic with thin-stream characteristics,
>> characterized by small packets and a relatively high ITT. By bundling
>> already sent data in packets with new data, RDB alleviates head-of-line
>> blocking by reducing the need to retransmit data segments when packets
>> are lost. RDB is a continuation on the work on latency improvements for
>> TCP in Linux, previously resulting in two thin-stream mechanisms in the
>> Linux kernel
>> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>> 
>> The RDB implementation has been thoroughly tested, and shows
>> significant latency reductions when packet loss occurs[1]. The tests
>> show that, by imposing restrictions on the bundling rate, it can be made
>> not to negatively affect competing traffic in an unfair manner.
>> 
>> Note: Current patch set depends on a recently submitted patch for
>> tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)
>> 
>> These patches have been tested with as set of packetdrill scripts located at
>> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
>> (The tests require patching packetdrill with a new socket option:
>> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>> 
>> Detailed info about the RDB mechanism can be found at
>> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper
> 
> What's the difference between RDB and TCP repacketization
> (http://flylib.com/books/en/3.223.1.226/1/) ?
> 
> Reading the blog page, I am concerned the amount of
> change (esp on fast path) just to bundle new writes during timeout &
> retransmit, for a specific type of application? why not just send X
> packets with total bytes < MSS on timeout..

Repacketization is only on retransmissions; RDB bundles previously sent segments with the next “normal” transmission instead. 

This makes the flow recover the lost segment  before a retransmission is triggered by an RTO or fast retransmit.

>> "Latency and Fairness Trade-Off for Thin Streams using Redundant Data
>> Bundling in TCP"[2].
>> 
>> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
>> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
>> 
>> 
>> Bendik Rønning Opstad (2):
>>  tcp: Add DPIFL thin stream detection mechanism
>>  tcp: Add Redundant Data Bundling (RDB)
>> 
>> Documentation/networking/ip-sysctl.txt |  23 +++
>> include/linux/skbuff.h                 |   1 +
>> include/linux/tcp.h                    |   9 +-
>> include/net/tcp.h                      |  34 ++++
>> include/uapi/linux/tcp.h               |   1 +
>> net/core/skbuff.c                      |   3 +-
>> net/ipv4/Makefile                      |   3 +-
>> net/ipv4/sysctl_net_ipv4.c             |  35 ++++
>> net/ipv4/tcp.c                         |  19 ++-
>> net/ipv4/tcp_input.c                   |   3 +
>> net/ipv4/tcp_output.c                  |  11 +-
>> net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
>> 12 files changed, 415 insertions(+), 8 deletions(-)
>> create mode 100644 net/ipv4/tcp_rdb.c
>> 
>> --
>> 1.9.1


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
@ 2015-10-24 12:57       ` Eric Dumazet
  0 siblings, 0 replies; 81+ messages in thread
From: Eric Dumazet @ 2015-10-24 12:57 UTC (permalink / raw)
  To: Jonas Markussen
  Cc: Yuchung Cheng, Bendik Rønning Opstad, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Neal Cardwell,
	Tom Herbert, Paolo Abeni, Erik Kline, Hannes Frederic Sowa,
	Al Viro, Jiri Pirko, Alexander Duyck, Florian Westphal,
	Daniel Lee, Marcelo Ricardo Leitner, Daniel Borkmann,
	Willem de Bruijn, Linus Lüssing, linux-doc, linux-kernel,
	netdev, linux-api, Andreas Petlund, Carsten Griwodz,
	Pål Halvorsen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Sat, 2015-10-24 at 08:00 +0000, Jonas Markussen wrote:

> Repacketization is only on retransmissions; RDB bundles previously sent segments with the next “normal” transmission instead. 
> 
> This makes the flow recover the lost segment  before a retransmission is triggered by an RTO or fast retransmit.

Thank you for this very high quality patch submission.

Please give us a few days for proper evaluation.

Thanks !



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
@ 2015-10-24 12:57       ` Eric Dumazet
  0 siblings, 0 replies; 81+ messages in thread
From: Eric Dumazet @ 2015-10-24 12:57 UTC (permalink / raw)
  To: Jonas Markussen
  Cc: Yuchung Cheng, Bendik Rønning Opstad, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Neal Cardwell,
	Tom Herbert, Paolo Abeni, Erik Kline, Hannes Frederic Sowa,
	Al Viro, Jiri Pirko, Alexander Duyck, Florian Westphal,
	Daniel Lee, Marcelo Ricardo Leitner, Daniel Borkmann,
	Willem de Bruijn

On Sat, 2015-10-24 at 08:00 +0000, Jonas Markussen wrote:

> Repacketization is only on retransmissions; RDB bundles previously sent segments with the next “normal” transmission instead. 
> 
> This makes the flow recover the lost segment  before a retransmission is triggered by an RTO or fast retransmit.

Thank you for this very high quality patch submission.

Please give us a few days for proper evaluation.

Thanks !

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
@ 2015-10-25  5:56       ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-25  5:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On Friday, October 23, 2015 02:44:14 PM Eric Dumazet wrote:
> On Fri, 2015-10-23 at 22:50 +0200, Bendik Rønning Opstad wrote:
> 
> >  
> > +/**
> > + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
> > + *                              limit
> > + * @tp: the tcp_sock struct
> > + *
> > + * Return: true if current packets in flight (PIF) count is lower than
> > + *         the dynamic PIF limit, else false
> > + */
> > +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
> > +{
> > +	u64 dpif_lim = tp->srtt_us >> 3;
> > +	/* Div by is_thin_min_itt_lim, the minimum allowed ITT
> > +	 * (Inter-transmission time) in usecs.
> > +	 */
> > +	do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
> > +	return tcp_packets_in_flight(tp) < dpif_lim;
> > +}
> > +
> This is very strange :
> 
> You are using a do_div() while both operands are 32bits.  A regular
> divide would be ok :
> 
> u32 dpif_lim = (tp->srtt_us >> 3) / tp->thin_dpifl_itt_lower_bound;
> 
> But then, you can avoid the divide by using a multiply, less expensive :
> 
> return	(u64)tcp_packets_in_flight(tp) * tp->thin_dpifl_itt_lower_bound <
> 	(tp->srtt_us >> 3);
> 

You are of course correct. Will fix this and use multiply. Thanks.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
@ 2015-10-25  5:56       ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-25  5:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc-u79uwXL29TasMV2rI37PzA

On Friday, October 23, 2015 02:44:14 PM Eric Dumazet wrote:
> On Fri, 2015-10-23 at 22:50 +0200, Bendik Rønning Opstad wrote:
> 
> >  
> > +/**
> > + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
> > + *                              limit
> > + * @tp: the tcp_sock struct
> > + *
> > + * Return: true if current packets in flight (PIF) count is lower than
> > + *         the dynamic PIF limit, else false
> > + */
> > +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
> > +{
> > +	u64 dpif_lim = tp->srtt_us >> 3;
> > +	/* Div by is_thin_min_itt_lim, the minimum allowed ITT
> > +	 * (Inter-transmission time) in usecs.
> > +	 */
> > +	do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
> > +	return tcp_packets_in_flight(tp) < dpif_lim;
> > +}
> > +
> This is very strange :
> 
> You are using a do_div() while both operands are 32bits.  A regular
> divide would be ok :
> 
> u32 dpif_lim = (tp->srtt_us >> 3) / tp->thin_dpifl_itt_lower_bound;
> 
> But then, you can avoid the divide by using a multiply, less expensive :
> 
> return	(u64)tcp_packets_in_flight(tp) * tp->thin_dpifl_itt_lower_bound <
> 	(tp->srtt_us >> 3);
> 

You are of course correct. Will fix this and use multiply. Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50   ` Bendik Rønning Opstad
@ 2015-10-26 14:50     ` Neal Cardwell
  -1 siblings, 0 replies; 81+ messages in thread
From: Neal Cardwell @ 2015-10-26 14:50 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Tom Herbert, Yuchung Cheng, Paolo Abeni,
	Erik Kline, Hannes Frederic Sowa, Al Viro, Jiri Pirko,
	Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>@@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
...
> +       case TCP_RDB:
> +               if (val < 0 || val > 1) {
> +                       err = -EINVAL;
> +               } else {
> +                       tp->rdb = val;
> +                       tp->nonagle = val;

The semantics of the tp->nonagle bits are already a bit complex. My
sense is that having a setsockopt of TCP_RDB transparently modify the
nagle behavior is going to add more extra complexity and unanticipated
behavior than is warranted given the slight possible gain in
convenience to the app writer. What about a model where the
application user just needs to remember to call
setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
sensible? I see your nice tests at

   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b

are already doing that. And my sense is that likewise most
well-engineered "thin stream" apps will already be using
setsockopt(TCP_NODELAY). Is that workable?

neal

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-10-26 14:50     ` Neal Cardwell
  0 siblings, 0 replies; 81+ messages in thread
From: Neal Cardwell @ 2015-10-26 14:50 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Tom Herbert, Yuchung Cheng, Paolo Abeni,
	Erik Kline, Hannes Frederic Sowa, Al Viro, Jiri Pirko,
	Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML

On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>@@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
...
> +       case TCP_RDB:
> +               if (val < 0 || val > 1) {
> +                       err = -EINVAL;
> +               } else {
> +                       tp->rdb = val;
> +                       tp->nonagle = val;

The semantics of the tp->nonagle bits are already a bit complex. My
sense is that having a setsockopt of TCP_RDB transparently modify the
nagle behavior is going to add more extra complexity and unanticipated
behavior than is warranted given the slight possible gain in
convenience to the app writer. What about a model where the
application user just needs to remember to call
setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
sensible? I see your nice tests at

   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b

are already doing that. And my sense is that likewise most
well-engineered "thin stream" apps will already be using
setsockopt(TCP_NODELAY). Is that workable?

neal

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-26 14:50     ` Neal Cardwell
@ 2015-10-26 21:35       ` Andreas Petlund
  -1 siblings, 0 replies; 81+ messages in thread
From: Andreas Petlund @ 2015-10-26 21:35 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Bendik Rønning Opstad, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Jonathan Corbet, Eric Dumazet, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad


> On 26 Oct 2015, at 15:50, Neal Cardwell <ncardwell@google.com> wrote:
> 
> On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> <bro.devel@gmail.com> wrote:
>> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> ...
>> +       case TCP_RDB:
>> +               if (val < 0 || val > 1) {
>> +                       err = -EINVAL;
>> +               } else {
>> +                       tp->rdb = val;
>> +                       tp->nonagle = val;
> 
> The semantics of the tp->nonagle bits are already a bit complex. My
> sense is that having a setsockopt of TCP_RDB transparently modify the
> nagle behavior is going to add more extra complexity and unanticipated
> behavior than is warranted given the slight possible gain in
> convenience to the app writer. What about a model where the
> application user just needs to remember to call
> setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> sensible? I see your nice tests at
> 
>   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b
> 
> are already doing that. And my sense is that likewise most
> well-engineered "thin stream" apps will already be using
> setsockopt(TCP_NODELAY). Is that workable?

We have been discussing this a bit back and forth. Your suggestion would be the right thing to keep the nagle semantics less complex and to educate developers in the intrinsics of the transport.

We ended up choosing to implicitly disable nagle since it 
1) is incompatible with the logic of RDB.
2) leaving it up to the developer to read the documentation and register the line saying that "failing to set TCP_NODELAY will void the RDB latency gain" will increase the chance of misconfigurations leading to deployment with no effect.

The hope was to help both the well-engineered thin-stream apps and the ones deployed by developers with less detailed knowledge of the transport.

-Andreas


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-10-26 21:35       ` Andreas Petlund
  0 siblings, 0 replies; 81+ messages in thread
From: Andreas Petlund @ 2015-10-26 21:35 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Bendik Rønning Opstad, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Jonathan Corbet, Eric Dumazet, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing


> On 26 Oct 2015, at 15:50, Neal Cardwell <ncardwell@google.com> wrote:
> 
> On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> <bro.devel@gmail.com> wrote:
>> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> ...
>> +       case TCP_RDB:
>> +               if (val < 0 || val > 1) {
>> +                       err = -EINVAL;
>> +               } else {
>> +                       tp->rdb = val;
>> +                       tp->nonagle = val;
> 
> The semantics of the tp->nonagle bits are already a bit complex. My
> sense is that having a setsockopt of TCP_RDB transparently modify the
> nagle behavior is going to add more extra complexity and unanticipated
> behavior than is warranted given the slight possible gain in
> convenience to the app writer. What about a model where the
> application user just needs to remember to call
> setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> sensible? I see your nice tests at
> 
>   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b
> 
> are already doing that. And my sense is that likewise most
> well-engineered "thin stream" apps will already be using
> setsockopt(TCP_NODELAY). Is that workable?

We have been discussing this a bit back and forth. Your suggestion would be the right thing to keep the nagle semantics less complex and to educate developers in the intrinsics of the transport.

We ended up choosing to implicitly disable nagle since it 
1) is incompatible with the logic of RDB.
2) leaving it up to the developer to read the documentation and register the line saying that "failing to set TCP_NODELAY will void the RDB latency gain" will increase the chance of misconfigurations leading to deployment with no effect.

The hope was to help both the well-engineered thin-stream apps and the ones deployed by developers with less detailed knowledge of the transport.

-Andreas

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-26 21:35       ` Andreas Petlund
@ 2015-10-26 21:58         ` Yuchung Cheng
  -1 siblings, 0 replies; 81+ messages in thread
From: Yuchung Cheng @ 2015-10-26 21:58 UTC (permalink / raw)
  To: Andreas Petlund
  Cc: Neal Cardwell, Bendik Rønning Opstad, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Mon, Oct 26, 2015 at 2:35 PM, Andreas Petlund <apetlund@simula.no> wrote:
>
>
> > On 26 Oct 2015, at 15:50, Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> > <bro.devel@gmail.com> wrote:
> >> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> > ...
> >> +       case TCP_RDB:
> >> +               if (val < 0 || val > 1) {
> >> +                       err = -EINVAL;
> >> +               } else {
> >> +                       tp->rdb = val;
> >> +                       tp->nonagle = val;
> >
> > The semantics of the tp->nonagle bits are already a bit complex. My
> > sense is that having a setsockopt of TCP_RDB transparently modify the
> > nagle behavior is going to add more extra complexity and unanticipated
> > behavior than is warranted given the slight possible gain in
> > convenience to the app writer. What about a model where the
> > application user just needs to remember to call
> > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > sensible? I see your nice tests at
> >
> >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b
> >
> > are already doing that. And my sense is that likewise most
> > well-engineered "thin stream" apps will already be using
> > setsockopt(TCP_NODELAY). Is that workable?
>
> We have been discussing this a bit back and forth. Your suggestion would be the right thing to keep the nagle semantics less complex and to educate developers in the intrinsics of the transport.
>
> We ended up choosing to implicitly disable nagle since it
> 1) is incompatible with the logic of RDB.
> 2) leaving it up to the developer to read the documentation and register the line saying that "failing to set TCP_NODELAY will void the RDB latency gain" will increase the chance of misconfigurations leading to deployment with no effect.
>
> The hope was to help both the well-engineered thin-stream apps and the ones deployed by developers with less detailed knowledge of the transport.
but would RDB be voided if this developer turns on RDB then turns on
Nagle later?

>
> -Andreas
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-10-26 21:58         ` Yuchung Cheng
  0 siblings, 0 replies; 81+ messages in thread
From: Yuchung Cheng @ 2015-10-26 21:58 UTC (permalink / raw)
  To: Andreas Petlund
  Cc: Neal Cardwell, Bendik Rønning Opstad, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing

On Mon, Oct 26, 2015 at 2:35 PM, Andreas Petlund <apetlund@simula.no> wrote:
>
>
> > On 26 Oct 2015, at 15:50, Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> > <bro.devel@gmail.com> wrote:
> >> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> > ...
> >> +       case TCP_RDB:
> >> +               if (val < 0 || val > 1) {
> >> +                       err = -EINVAL;
> >> +               } else {
> >> +                       tp->rdb = val;
> >> +                       tp->nonagle = val;
> >
> > The semantics of the tp->nonagle bits are already a bit complex. My
> > sense is that having a setsockopt of TCP_RDB transparently modify the
> > nagle behavior is going to add more extra complexity and unanticipated
> > behavior than is warranted given the slight possible gain in
> > convenience to the app writer. What about a model where the
> > application user just needs to remember to call
> > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > sensible? I see your nice tests at
> >
> >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b
> >
> > are already doing that. And my sense is that likewise most
> > well-engineered "thin stream" apps will already be using
> > setsockopt(TCP_NODELAY). Is that workable?
>
> We have been discussing this a bit back and forth. Your suggestion would be the right thing to keep the nagle semantics less complex and to educate developers in the intrinsics of the transport.
>
> We ended up choosing to implicitly disable nagle since it
> 1) is incompatible with the logic of RDB.
> 2) leaving it up to the developer to read the documentation and register the line saying that "failing to set TCP_NODELAY will void the RDB latency gain" will increase the chance of misconfigurations leading to deployment with no effect.
>
> The hope was to help both the well-engineered thin-stream apps and the ones deployed by developers with less detailed knowledge of the transport.
but would RDB be voided if this developer turns on RDB then turns on
Nagle later?

>
> -Andreas
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-26 21:58         ` Yuchung Cheng
@ 2015-10-27 19:15           ` Jonas Markussen
  -1 siblings, 0 replies; 81+ messages in thread
From: Jonas Markussen @ 2015-10-27 19:15 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Andreas Petlund, Neal Cardwell, Bendik Rønning Opstad,
	David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Tom Herbert, Paolo Abeni, Erik Kline,
	Hannes Frederic Sowa, Al Viro, Jiri Pirko, Alexander Duyck,
	Florian Westphal, Daniel Lee, Marcelo Ricardo Leitner,
	Daniel Borkmann, Willem de Bruijn, Linus Lüssing, linux-doc,
	LKML, Netdev, linux-api, Carsten Griwodz, Pål Halvorsen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On 26 Oct 2015, at 22:58, Yuchung Cheng <ycheng@google.com> wrote:
> but would RDB be voided if this developer turns on RDB then turns on
> Nagle later?

The short answer is answer is "kind of"

My understanding is that Nagle will delay segments until they're
either MSS-sized or until segments "down the pipe" are acknowledged.

As RDB isn't able to bundle if the payload is more than MSS/2, only
an application that that sends data less frequent than an RTT would
still theoretically benefit from RDB even if Nagle is on.

However, in my opinion this is a scenario where Nagle itself is void:

If you transmit more rarely than the RTT, enabling Nagle makes no
difference.

If you transfer more frequent than the RTT, enabling Nagle makes
RDB void.

-Jonas

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-10-27 19:15           ` Jonas Markussen
  0 siblings, 0 replies; 81+ messages in thread
From: Jonas Markussen @ 2015-10-27 19:15 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Andreas Petlund, Neal Cardwell, Bendik Rønning Opstad,
	David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Tom Herbert, Paolo Abeni, Erik Kline,
	Hannes Frederic Sowa, Al Viro, Jiri Pirko, Alexander Duyck,
	Florian Westphal, Daniel Lee, Marcelo Ricardo Leitner,
	Daniel Borkmann, Willem de Bruijn

On 26 Oct 2015, at 22:58, Yuchung Cheng <ycheng@google.com> wrote:
> but would RDB be voided if this developer turns on RDB then turns on
> Nagle later?

The short answer is answer is "kind of"

My understanding is that Nagle will delay segments until they're
either MSS-sized or until segments "down the pipe" are acknowledged.

As RDB isn't able to bundle if the payload is more than MSS/2, only
an application that that sends data less frequent than an RTT would
still theoretically benefit from RDB even if Nagle is on.

However, in my opinion this is a scenario where Nagle itself is void:

If you transmit more rarely than the RTT, enabling Nagle makes no
difference.

If you transfer more frequent than the RTT, enabling Nagle makes
RDB void.

-Jonas

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-26 21:58         ` Yuchung Cheng
@ 2015-10-29 22:53           ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-29 22:53 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Andreas Petlund, Neal Cardwell, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

On Monday, October 26, 2015 02:58:03 PM Yuchung Cheng wrote:
> On Mon, Oct 26, 2015 at 2:35 PM, Andreas Petlund <apetlund@simula.no> wrote:
> > > On 26 Oct 2015, at 15:50, Neal Cardwell <ncardwell@google.com> wrote:
> > > 
> > > On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> > > 
> > > <bro.devel@gmail.com> wrote:
> > >> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk,
> > >> int level,> > 
> > > ...
> > > 
> > >> +       case TCP_RDB:
> > >> +               if (val < 0 || val > 1) {
> > >> +                       err = -EINVAL;
> > >> +               } else {
> > >> +                       tp->rdb = val;
> > >> +                       tp->nonagle = val;
> > > 
> > > The semantics of the tp->nonagle bits are already a bit complex. My
> > > sense is that having a setsockopt of TCP_RDB transparently modify the
> > > nagle behavior is going to add more extra complexity and unanticipated
> > > behavior than is warranted given the slight possible gain in
> > > convenience to the app writer. What about a model where the
> > > application user just needs to remember to call
> > > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > > sensible? I see your nice tests at
> > > 
> > >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b
> > >   7d8baf703b2c2ac1b> > 
> > > are already doing that. And my sense is that likewise most
> > > well-engineered "thin stream" apps will already be using
> > > setsockopt(TCP_NODELAY). Is that workable?

This is definitely workable. I agree that it may not be an ideal solution to
have TCP_RDB disable Nagle, however, it would be useful with a way to easily
enable RDB and disable Nagle.

> > We have been discussing this a bit back and forth. Your suggestion would
> > be the right thing to keep the nagle semantics less complex and to
> > educate developers in the intrinsics of the transport.
> > 
> > We ended up choosing to implicitly disable nagle since it
> > 1) is incompatible with the logic of RDB.
> > 2) leaving it up to the developer to read the documentation and register
> > the line saying that "failing to set TCP_NODELAY will void the RDB
> > latency gain" will increase the chance of misconfigurations leading to
> > deployment with no effect.
> > 
> > The hope was to help both the well-engineered thin-stream apps and the
> > ones deployed by developers with less detailed knowledge of the
> > transport.
> but would RDB be voided if this developer turns on RDB then turns on
> Nagle later?

It would (to a large degree), but I believe that's ok? The intention with also
disabling Nagle is not to remove control from the application writer, so if
TCP_RDB disables Nagle, they should not be prevented from explicitly enabling
Nagle after enabling RDB.

The idea is to make it as easy as possible for the application writer, and
since Nagle is on by default, it makes sense to change this behavior when the
application has indicated that it values low latencies.

Would a solution with multiple option values to TCP_RDB be acceptable? E.g.
0 = Disable
1 = Enable RDB
2 = Enable RDB and disable Nagle

If the sysctl tcp_rdb accepts the same values, setting the sysctl to 2 would
allow to use and test RDB (with Nagle off) on applications that haven't
explicitly disabled Nagle, which would make the sysctl tcp_rdb even more useful.

Instead of having TCP_RDB modify Nagle, would it be better/acceptable to have a
separate socket option (e.g. TCP_THIN/TCP_THIN_LOW_LATENCY) that enables RDB and
disables Nagle? e.g.
0 = Use default system options?
1 = Enable RDB and disable Nagle

This would separate the modification of Nagle from the TCP_RDB socket option and
make it cleaner?

Such an option could also enable other latency-reducing options like
TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK:
2 = Enable RDB, TCP_THIN_LINEAR_TIMEOUTS, TCP_THIN_DUPACK, and disable Nagle

Bendik


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-10-29 22:53           ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-29 22:53 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Andreas Petlund, Neal Cardwell, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc

On Monday, October 26, 2015 02:58:03 PM Yuchung Cheng wrote:
> On Mon, Oct 26, 2015 at 2:35 PM, Andreas Petlund <apetlund@simula.no> wrote:
> > > On 26 Oct 2015, at 15:50, Neal Cardwell <ncardwell@google.com> wrote:
> > > 
> > > On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> > > 
> > > <bro.devel@gmail.com> wrote:
> > >> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk,
> > >> int level,> > 
> > > ...
> > > 
> > >> +       case TCP_RDB:
> > >> +               if (val < 0 || val > 1) {
> > >> +                       err = -EINVAL;
> > >> +               } else {
> > >> +                       tp->rdb = val;
> > >> +                       tp->nonagle = val;
> > > 
> > > The semantics of the tp->nonagle bits are already a bit complex. My
> > > sense is that having a setsockopt of TCP_RDB transparently modify the
> > > nagle behavior is going to add more extra complexity and unanticipated
> > > behavior than is warranted given the slight possible gain in
> > > convenience to the app writer. What about a model where the
> > > application user just needs to remember to call
> > > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > > sensible? I see your nice tests at
> > > 
> > >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b
> > >   7d8baf703b2c2ac1b> > 
> > > are already doing that. And my sense is that likewise most
> > > well-engineered "thin stream" apps will already be using
> > > setsockopt(TCP_NODELAY). Is that workable?

This is definitely workable. I agree that it may not be an ideal solution to
have TCP_RDB disable Nagle, however, it would be useful with a way to easily
enable RDB and disable Nagle.

> > We have been discussing this a bit back and forth. Your suggestion would
> > be the right thing to keep the nagle semantics less complex and to
> > educate developers in the intrinsics of the transport.
> > 
> > We ended up choosing to implicitly disable nagle since it
> > 1) is incompatible with the logic of RDB.
> > 2) leaving it up to the developer to read the documentation and register
> > the line saying that "failing to set TCP_NODELAY will void the RDB
> > latency gain" will increase the chance of misconfigurations leading to
> > deployment with no effect.
> > 
> > The hope was to help both the well-engineered thin-stream apps and the
> > ones deployed by developers with less detailed knowledge of the
> > transport.
> but would RDB be voided if this developer turns on RDB then turns on
> Nagle later?

It would (to a large degree), but I believe that's ok? The intention with also
disabling Nagle is not to remove control from the application writer, so if
TCP_RDB disables Nagle, they should not be prevented from explicitly enabling
Nagle after enabling RDB.

The idea is to make it as easy as possible for the application writer, and
since Nagle is on by default, it makes sense to change this behavior when the
application has indicated that it values low latencies.

Would a solution with multiple option values to TCP_RDB be acceptable? E.g.
0 = Disable
1 = Enable RDB
2 = Enable RDB and disable Nagle

If the sysctl tcp_rdb accepts the same values, setting the sysctl to 2 would
allow to use and test RDB (with Nagle off) on applications that haven't
explicitly disabled Nagle, which would make the sysctl tcp_rdb even more useful.

Instead of having TCP_RDB modify Nagle, would it be better/acceptable to have a
separate socket option (e.g. TCP_THIN/TCP_THIN_LOW_LATENCY) that enables RDB and
disables Nagle? e.g.
0 = Use default system options?
1 = Enable RDB and disable Nagle

This would separate the modification of Nagle from the TCP_RDB socket option and
make it cleaner?

Such an option could also enable other latency-reducing options like
TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK:
2 = Enable RDB, TCP_THIN_LINEAR_TIMEOUTS, TCP_THIN_DUPACK, and disable Nagle

Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-29 22:53           ` Bendik Rønning Opstad
@ 2015-11-02  9:18             ` David Laight
  -1 siblings, 0 replies; 81+ messages in thread
From: David Laight @ 2015-11-02  9:18 UTC (permalink / raw)
  To: 'bro.devel+kernel@gmail.com', Yuchung Cheng
  Cc: Andreas Petlund, Neal Cardwell, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

From: Bendik Rønning Opstad
> Sent: 29 October 2015 22:54
...
> > > > The semantics of the tp->nonagle bits are already a bit complex. My
> > > > sense is that having a setsockopt of TCP_RDB transparently modify the
> > > > nagle behavior is going to add more extra complexity and unanticipated
> > > > behavior than is warranted given the slight possible gain in
> > > > convenience to the app writer. What about a model where the
> > > > application user just needs to remember to call
> > > > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > > > sensible? I see your nice tests at
> > > >
> > > >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b
> > > >   7d8baf703b2c2ac1b> >
> > > > are already doing that. And my sense is that likewise most
> > > > well-engineered "thin stream" apps will already be using
> > > > setsockopt(TCP_NODELAY). Is that workable?
> 
> This is definitely workable. I agree that it may not be an ideal solution to
> have TCP_RDB disable Nagle, however, it would be useful with a way to easily
> enable RDB and disable Nagle.

If enabling RDB disables Nagle, then what happens when you turn RDB back off?

	David


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-11-02  9:18             ` David Laight
  0 siblings, 0 replies; 81+ messages in thread
From: David Laight @ 2015-11-02  9:18 UTC (permalink / raw)
  To: 'bro.devel+kernel@gmail.com', Yuchung Cheng
  Cc: Andreas Petlund, Neal Cardwell, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing

From: Bendik Rønning Opstad
> Sent: 29 October 2015 22:54
...
> > > > The semantics of the tp->nonagle bits are already a bit complex. My
> > > > sense is that having a setsockopt of TCP_RDB transparently modify the
> > > > nagle behavior is going to add more extra complexity and unanticipated
> > > > behavior than is warranted given the slight possible gain in
> > > > convenience to the app writer. What about a model where the
> > > > application user just needs to remember to call
> > > > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > > > sensible? I see your nice tests at
> > > >
> > > >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b
> > > >   7d8baf703b2c2ac1b> >
> > > > are already doing that. And my sense is that likewise most
> > > > well-engineered "thin stream" apps will already be using
> > > > setsockopt(TCP_NODELAY). Is that workable?
> 
> This is definitely workable. I agree that it may not be an ideal solution to
> have TCP_RDB disable Nagle, however, it would be useful with a way to easily
> enable RDB and disable Nagle.

If enabling RDB disables Nagle, then what happens when you turn RDB back off?

	David

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50   ` Bendik Rønning Opstad
@ 2015-11-02  9:37     ` David Laight
  -1 siblings, 0 replies; 81+ messages in thread
From: David Laight @ 2015-11-02  9:37 UTC (permalink / raw)
  To: 'Bendik Rønning Opstad',
	David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1506 bytes --]

From: Bendik Rønning Opstad
> Sent: 23 October 2015 21:50
> RDB is a mechanism that enables a TCP sender to bundle redundant
> (already sent) data with TCP packets containing new data. By bundling
> (retransmitting) already sent data with each TCP packet containing new
> data, the connection will be more resistant to sporadic packet loss
> which reduces the application layer latency significantly in congested
> scenarios.

What sort of traffic flows do you expect this to help?

An ssh (or similar) connection will get additional data to send,
but that sort of data flow needs Nagle in order to reduce the
number of packets sent.
OTOH it might benefit from including unacked data if the Nagle
timer expires.
Being able to set the Nagle timer on a per-connection basis
(or maybe using something based on the RTT instead of 2 secs)
might make packet loss less problematic.

Data flows that already have Nagle disabled (probably anything that
isn't command-response and isn't unidirectional bulk data) are
likely to generate a lot of packets within the RTT.
Resending unacked data will just eat into available network bandwidth
and could easily make any congestion worse.

I think that means you shouldn't resend data more than once, and/or
should make sure that the resent data isn't a significant overhead
on the packet being sent.

	David


ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-11-02  9:37     ` David Laight
  0 siblings, 0 replies; 81+ messages in thread
From: David Laight @ 2015-11-02  9:37 UTC (permalink / raw)
  To: 'Bendik Rønning Opstad',
	David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api

From: Bendik Rønning Opstad
> Sent: 23 October 2015 21:50
> RDB is a mechanism that enables a TCP sender to bundle redundant
> (already sent) data with TCP packets containing new data. By bundling
> (retransmitting) already sent data with each TCP packet containing new
> data, the connection will be more resistant to sporadic packet loss
> which reduces the application layer latency significantly in congested
> scenarios.

What sort of traffic flows do you expect this to help?

An ssh (or similar) connection will get additional data to send,
but that sort of data flow needs Nagle in order to reduce the
number of packets sent.
OTOH it might benefit from including unacked data if the Nagle
timer expires.
Being able to set the Nagle timer on a per-connection basis
(or maybe using something based on the RTT instead of 2 secs)
might make packet loss less problematic.

Data flows that already have Nagle disabled (probably anything that
isn't command-response and isn't unidirectional bulk data) are
likely to generate a lot of packets within the RTT.
Resending unacked data will just eat into available network bandwidth
and could easily make any congestion worse.

I think that means you shouldn't resend data more than once, and/or
should make sure that the resent data isn't a significant overhead
on the packet being sent.

	David



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-11-02  9:37     ` David Laight
  (?)
@ 2015-11-05  2:06       ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-11-05  2:06 UTC (permalink / raw)
  To: David Laight
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Monday, November 02, 2015 09:37:54 AM David Laight wrote:
> From: Bendik Rønning Opstad
> > Sent: 23 October 2015 21:50
> > RDB is a mechanism that enables a TCP sender to bundle redundant
> > (already sent) data with TCP packets containing new data. By bundling
> > (retransmitting) already sent data with each TCP packet containing new
> > data, the connection will be more resistant to sporadic packet loss
> > which reduces the application layer latency significantly in congested
> > scenarios.
> 
> What sort of traffic flows do you expect this to help?

As mentioned in the cover letter, RDB is aimed at reducing the
latencies for "thin-stream" traffic often produced by
latency-sensitive applications. This blog post describes RDB and the
underlying motivation:
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp

Further information is available in the links referred to in the blog
post.

> An ssh (or similar) connection will get additional data to send,
> but that sort of data flow needs Nagle in order to reduce the
> number of packets sent.

Whether an application needs to reduce the number of packets sent
depends on the perspective of who you ask. If low latency is of high
priority for the application it may need to increase the number of
packets sent by disabling Nagle to reduce the segments sojourn times
on the sender side.

As for SSH clients, it seems OpenSSH disables Nagle for interactive
sessions.

> OTOH it might benefit from including unacked data if the Nagle
> timer expires.
> Being able to set the Nagle timer on a per-connection basis
> (or maybe using something based on the RTT instead of 2 secs)
> might make packet loss less problematic.

There is no timer for Nagle? The current (Minshall variant)
implementation restricts sending a small segment as long as the
previously transmitted packet was small and is not yet ACKed.

> Data flows that already have Nagle disabled (probably anything that
> isn't command-response and isn't unidirectional bulk data) are
> likely to generate a lot of packets within the RTT.

How many packets such applications need to transmit for optimal
latency varies to a great extent. Packets per RTT is not a very useful
metric in this regard, considering the strict dependency on the RTT.

This is why we propose a dynamic packets in flight limit (DPIFL) that
indirectly relies on the application write frequency, i.e. how often
the application performs write systems calls. This limit is used to
ensure that only applications that write data less frequently than a
certain limit may utilize RDB.

> Resending unacked data will just eat into available network bandwidth
> and could easily make any congestion worse.
>
> I think that means you shouldn't resend data more than once, and/or
> should make sure that the resent data isn't a significant overhead
> on the packet being sent.

It is important to remember what type of traffic flows we are
discussing. The applications RDB is aimed at helping produce
application-limited flows that transmit small amounts of data, both in
terms of payload per packet and packets per second.

Analysis of traces from latency-sensitive applications producing
traffic with thin-stream characteristics show inter-transmission times
ranging from a few ms (typically 20-30 ms on average) to many hundred
ms.
(http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#thin_streams)

Increasing the amount of transmitted data will certainly contribute to
congestion to some degree, but it is not (necessarily) an unreasonable
trade-off considering the relatively small amounts of data such
applications transmit compared to greedy flows.

RDB does not cause more packets to be sent through the network, as it
uses available "free" space in packets already scheduled for
transmission. With a bundling limitation of only one previous segment,
the bandwidth requirement is doubled - accounting for headers it would
be less.

By increasing the BW requirement for an application that produces
relatively little data, we still end up with a low BW requirement.
The suggested minimum lower bound inter-transmission time is 10 ms,
meaning that when an application writes data more frequently than
every 10 ms (on average) it will not be allowed to utilize RDB.

To what degree RDB affects competing traffic will of course depend on
the link capacity and the number of simultaneous flows utilizing RDB.
We have performed tests to asses how RDB affects competing traffic. In
one of the test scenarios, 10 RDB-enabled thin streams and 10 regular
TCP thin streams compete against 5 greedy TCP flows over a shared
bottleneck limited to 5Mbit/s. The results from this test show that by
only bundling one previous segment with each packet (segment size: 120
bytes), the effect on the the competing thin-stream traffic is modest.
(http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#latency_test_with_cross_traffic).

Also relevant to the discussion is the paper "Reducing web latency:
the virtue of gentle aggression, (2013)", and one of the presented
mechanisms (called Proactive) which applies redundancy by transmitting
every packet twice. While doubling the bandwidth requirements when
using Proactive, their measurements show negligible effect on the
baseline traffic because, as they explain, the traffic utilizing the
mechanism (Web service traffic in their case) is only a small amount
of the total traffic passing through their servers.

While RDB and the Proactive mechanism have slightly different
approaches, they aim at solving the same basic problem; the increased
latencies caused by the need for normal retransmissions. By
proactively (re)transmitting redundant data they are able to avoid the
need for normal retransmissions to a great extent, which reduces
application layer latency by alleviating head-of-line blocking on the
receiver.

An important property of RDB is that by only using packets already
scheduled for transmission, a limit is naturally imposed when severe
congestion occurs. As soon as loss is detected, resulting in a
reduction of the CWND (i.e. becomes network limited), new data from
the application will be appended to the SKB in the output queue
containing the newest (unsent) data. Depending on the rate at which the
application produces data and the level of congestion (the size of the
CWND), the new data from the application will eventually fill up the
SKBs such that skb->len >= MSS. The result is that there is no "free"
space available to bundle redundant data, effectively disabling RDB
and enforcing a behavior equal to regular TCP.

Bendik


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-11-05  2:06       ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-11-05  2:06 UTC (permalink / raw)
  To: David Laight
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing

On Monday, November 02, 2015 09:37:54 AM David Laight wrote:
> From: Bendik Rønning Opstad
> > Sent: 23 October 2015 21:50
> > RDB is a mechanism that enables a TCP sender to bundle redundant
> > (already sent) data with TCP packets containing new data. By bundling
> > (retransmitting) already sent data with each TCP packet containing new
> > data, the connection will be more resistant to sporadic packet loss
> > which reduces the application layer latency significantly in congested
> > scenarios.
> 
> What sort of traffic flows do you expect this to help?

As mentioned in the cover letter, RDB is aimed at reducing the
latencies for "thin-stream" traffic often produced by
latency-sensitive applications. This blog post describes RDB and the
underlying motivation:
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp

Further information is available in the links referred to in the blog
post.

> An ssh (or similar) connection will get additional data to send,
> but that sort of data flow needs Nagle in order to reduce the
> number of packets sent.

Whether an application needs to reduce the number of packets sent
depends on the perspective of who you ask. If low latency is of high
priority for the application it may need to increase the number of
packets sent by disabling Nagle to reduce the segments sojourn times
on the sender side.

As for SSH clients, it seems OpenSSH disables Nagle for interactive
sessions.

> OTOH it might benefit from including unacked data if the Nagle
> timer expires.
> Being able to set the Nagle timer on a per-connection basis
> (or maybe using something based on the RTT instead of 2 secs)
> might make packet loss less problematic.

There is no timer for Nagle? The current (Minshall variant)
implementation restricts sending a small segment as long as the
previously transmitted packet was small and is not yet ACKed.

> Data flows that already have Nagle disabled (probably anything that
> isn't command-response and isn't unidirectional bulk data) are
> likely to generate a lot of packets within the RTT.

How many packets such applications need to transmit for optimal
latency varies to a great extent. Packets per RTT is not a very useful
metric in this regard, considering the strict dependency on the RTT.

This is why we propose a dynamic packets in flight limit (DPIFL) that
indirectly relies on the application write frequency, i.e. how often
the application performs write systems calls. This limit is used to
ensure that only applications that write data less frequently than a
certain limit may utilize RDB.

> Resending unacked data will just eat into available network bandwidth
> and could easily make any congestion worse.
>
> I think that means you shouldn't resend data more than once, and/or
> should make sure that the resent data isn't a significant overhead
> on the packet being sent.

It is important to remember what type of traffic flows we are
discussing. The applications RDB is aimed at helping produce
application-limited flows that transmit small amounts of data, both in
terms of payload per packet and packets per second.

Analysis of traces from latency-sensitive applications producing
traffic with thin-stream characteristics show inter-transmission times
ranging from a few ms (typically 20-30 ms on average) to many hundred
ms.
(http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#thin_streams)

Increasing the amount of transmitted data will certainly contribute to
congestion to some degree, but it is not (necessarily) an unreasonable
trade-off considering the relatively small amounts of data such
applications transmit compared to greedy flows.

RDB does not cause more packets to be sent through the network, as it
uses available "free" space in packets already scheduled for
transmission. With a bundling limitation of only one previous segment,
the bandwidth requirement is doubled - accounting for headers it would
be less.

By increasing the BW requirement for an application that produces
relatively little data, we still end up with a low BW requirement.
The suggested minimum lower bound inter-transmission time is 10 ms,
meaning that when an application writes data more frequently than
every 10 ms (on average) it will not be allowed to utilize RDB.

To what degree RDB affects competing traffic will of course depend on
the link capacity and the number of simultaneous flows utilizing RDB.
We have performed tests to asses how RDB affects competing traffic. In
one of the test scenarios, 10 RDB-enabled thin streams and 10 regular
TCP thin streams compete against 5 greedy TCP flows over a shared
bottleneck limited to 5Mbit/s. The results from this test show that by
only bundling one previous segment with each packet (segment size: 120
bytes), the effect on the the competing thin-stream traffic is modest.
(http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#latency_test_with_cross_traffic).

Also relevant to the discussion is the paper "Reducing web latency:
the virtue of gentle aggression, (2013)", and one of the presented
mechanisms (called Proactive) which applies redundancy by transmitting
every packet twice. While doubling the bandwidth requirements when
using Proactive, their measurements show negligible effect on the
baseline traffic because, as they explain, the traffic utilizing the
mechanism (Web service traffic in their case) is only a small amount
of the total traffic passing through their servers.

While RDB and the Proactive mechanism have slightly different
approaches, they aim at solving the same basic problem; the increased
latencies caused by the need for normal retransmissions. By
proactively (re)transmitting redundant data they are able to avoid the
need for normal retransmissions to a great extent, which reduces
application layer latency by alleviating head-of-line blocking on the
receiver.

An important property of RDB is that by only using packets already
scheduled for transmission, a limit is naturally imposed when severe
congestion occurs. As soon as loss is detected, resulting in a
reduction of the CWND (i.e. becomes network limited), new data from
the application will be appended to the SKB in the output queue
containing the newest (unsent) data. Depending on the rate at which the
application produces data and the level of congestion (the size of the
CWND), the new data from the application will eventually fill up the
SKBs such that skb->len >= MSS. The result is that there is no "free"
space available to bundle redundant data, effectively disabling RDB
and enforcing a behavior equal to regular TCP.

Bendik


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
@ 2015-11-05  2:06       ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-11-05  2:06 UTC (permalink / raw)
  To: David Laight
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc

On Monday, November 02, 2015 09:37:54 AM David Laight wrote:
> From: Bendik Rønning Opstad
> > Sent: 23 October 2015 21:50
> > RDB is a mechanism that enables a TCP sender to bundle redundant
> > (already sent) data with TCP packets containing new data. By bundling
> > (retransmitting) already sent data with each TCP packet containing new
> > data, the connection will be more resistant to sporadic packet loss
> > which reduces the application layer latency significantly in congested
> > scenarios.
> 
> What sort of traffic flows do you expect this to help?

As mentioned in the cover letter, RDB is aimed at reducing the
latencies for "thin-stream" traffic often produced by
latency-sensitive applications. This blog post describes RDB and the
underlying motivation:
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp

Further information is available in the links referred to in the blog
post.

> An ssh (or similar) connection will get additional data to send,
> but that sort of data flow needs Nagle in order to reduce the
> number of packets sent.

Whether an application needs to reduce the number of packets sent
depends on the perspective of who you ask. If low latency is of high
priority for the application it may need to increase the number of
packets sent by disabling Nagle to reduce the segments sojourn times
on the sender side.

As for SSH clients, it seems OpenSSH disables Nagle for interactive
sessions.

> OTOH it might benefit from including unacked data if the Nagle
> timer expires.
> Being able to set the Nagle timer on a per-connection basis
> (or maybe using something based on the RTT instead of 2 secs)
> might make packet loss less problematic.

There is no timer for Nagle? The current (Minshall variant)
implementation restricts sending a small segment as long as the
previously transmitted packet was small and is not yet ACKed.

> Data flows that already have Nagle disabled (probably anything that
> isn't command-response and isn't unidirectional bulk data) are
> likely to generate a lot of packets within the RTT.

How many packets such applications need to transmit for optimal
latency varies to a great extent. Packets per RTT is not a very useful
metric in this regard, considering the strict dependency on the RTT.

This is why we propose a dynamic packets in flight limit (DPIFL) that
indirectly relies on the application write frequency, i.e. how often
the application performs write systems calls. This limit is used to
ensure that only applications that write data less frequently than a
certain limit may utilize RDB.

> Resending unacked data will just eat into available network bandwidth
> and could easily make any congestion worse.
>
> I think that means you shouldn't resend data more than once, and/or
> should make sure that the resent data isn't a significant overhead
> on the packet being sent.

It is important to remember what type of traffic flows we are
discussing. The applications RDB is aimed at helping produce
application-limited flows that transmit small amounts of data, both in
terms of payload per packet and packets per second.

Analysis of traces from latency-sensitive applications producing
traffic with thin-stream characteristics show inter-transmission times
ranging from a few ms (typically 20-30 ms on average) to many hundred
ms.
(http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#thin_streams)

Increasing the amount of transmitted data will certainly contribute to
congestion to some degree, but it is not (necessarily) an unreasonable
trade-off considering the relatively small amounts of data such
applications transmit compared to greedy flows.

RDB does not cause more packets to be sent through the network, as it
uses available "free" space in packets already scheduled for
transmission. With a bundling limitation of only one previous segment,
the bandwidth requirement is doubled - accounting for headers it would
be less.

By increasing the BW requirement for an application that produces
relatively little data, we still end up with a low BW requirement.
The suggested minimum lower bound inter-transmission time is 10 ms,
meaning that when an application writes data more frequently than
every 10 ms (on average) it will not be allowed to utilize RDB.

To what degree RDB affects competing traffic will of course depend on
the link capacity and the number of simultaneous flows utilizing RDB.
We have performed tests to asses how RDB affects competing traffic. In
one of the test scenarios, 10 RDB-enabled thin streams and 10 regular
TCP thin streams compete against 5 greedy TCP flows over a shared
bottleneck limited to 5Mbit/s. The results from this test show that by
only bundling one previous segment with each packet (segment size: 120
bytes), the effect on the the competing thin-stream traffic is modest.
(http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#latency_test_with_cross_traffic).

Also relevant to the discussion is the paper "Reducing web latency:
the virtue of gentle aggression, (2013)", and one of the presented
mechanisms (called Proactive) which applies redundancy by transmitting
every packet twice. While doubling the bandwidth requirements when
using Proactive, their measurements show negligible effect on the
baseline traffic because, as they explain, the traffic utilizing the
mechanism (Web service traffic in their case) is only a small amount
of the total traffic passing through their servers.

While RDB and the Proactive mechanism have slightly different
approaches, they aim at solving the same basic problem; the increased
latencies caused by the need for normal retransmissions. By
proactively (re)transmitting redundant data they are able to avoid the
need for normal retransmissions to a great extent, which reduces
application layer latency by alleviating head-of-line blocking on the
receiver.

An important property of RDB is that by only using packets already
scheduled for transmission, a limit is naturally imposed when severe
congestion occurs. As soon as loss is detected, resulting in a
reduction of the CWND (i.e. becomes network limited), new data from
the application will be appended to the SKB in the output queue
containing the newest (unsent) data. Depending on the rate at which the
application produces data and the level of congestion (the size of the
CWND), the new data from the application will eventually fill up the
SKBs such that skb->len >= MSS. The result is that there is no "free"
space available to bundle redundant data, effectively disabling RDB
and enforcing a behavior equal to regular TCP.

Bendik


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-24 12:57       ` Eric Dumazet
  (?)
@ 2015-11-09 19:40       ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-11-09 19:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonas Markussen, Yuchung Cheng, Bendik Rønning Opstad,
	David S. Miller, Eric Dumazet, Neal Cardwell, netdev,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Kristian Evensen, Kenneth Klette Jonassen

On 24/10/15 14:57, Eric Dumazet wrote:
> Thank you for this very high quality patch submission.
>
> Please give us a few days for proper evaluation.
>
> Thanks !

Guys, thank you very much for taking the time to evaluate this.

Since there haven't been any more feedback or comments I'll submit an
RFCv2 with a few changes which includes removing the Nagle
modification.

After discussing the Nagle change on setsockopt we realize that it
should be evaluated more thoroughly, and is better left for a later
patch submission.

Bendik

P.S.
Trimming the CC list to only those who have responded as gmail says
I'm spamming :-)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH RFC v2 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (3 preceding siblings ...)
  (?)
@ 2015-11-23 16:26 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-11-23 16:26 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad


This is a request for comments.

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be made
not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on a recently submitted patch for
tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)

These patches have been tested with as set of packetdrill scripts located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper
"Latency and Fairness Trade-Off for Thin Streams using Redundant Data
Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v2:
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
     feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
     to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss



Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 +++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  35 +++++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  35 +++++
 net/ipv4/tcp.c                         |  16 +-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 271 +++++++++++++++++++++++++++++++++
 12 files changed, 397 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH RFC v2 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (4 preceding siblings ...)
  (?)
@ 2015-11-23 16:26 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-11-23 16:26 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/net/tcp.h                      | 21 +++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp.c                         |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 2ea4c45..938ae73 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -700,6 +700,14 @@ tcp_thin_dupack - BOOLEAN
 	Documentation/networking/tcp-thin.txt
 	Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+	Controls the lower bound inter-transmission time (ITT) threshold
+	for when a stream is considered thin. The value is specified in
+	microseconds, and may not be lower than 10000 (10 ms). Based on
+	this threshold, a dynamic packets in flight limit (DPIFL) is
+	calculated, which is used to classify whether a stream is thin.
+	Default: 10000
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4fc457b..deac96f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6       /* After 6 linear retries, do exp. backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (10000 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 10000
 
 /* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
 #define TCP_INIT_CWND		10
@@ -274,6 +276,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1631,6 +1634,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
+ *                              limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ *         the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+	/* Calculate the maximum allowed PIF limit by dividing the RTT by
+	 * the minimum allowed inter-transmission time (ITT).
+	 * Tests if PIF < RTT / ITT-lower-bound
+	 */
+	return (u64) tcp_packets_in_flight(tp) *
+		sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a0bd7a5..5b12446 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -709,6 +710,14 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_thin_dpifl_itt_lower_bound",
+		.data		= &sysctl_tcp_thin_dpifl_itt_lower_bound,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c172877..cb3354d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -287,6 +287,8 @@ int sysctl_tcp_min_tso_segs __read_mostly = 2;
 
 int sysctl_tcp_autocorking __read_mostly = 1;
 
+int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH RFC v2 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (5 preceding siblings ...)
  (?)
@ 2015-11-23 16:26 ` Bendik Rønning Opstad
  2015-11-23 17:43   ` Eric Dumazet
  -1 siblings, 1 reply; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-11-23 16:26 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
    with each packet, packet loss can be hidden from the TCP engine due
    to lack of dupACKs. This is because the loss is "repaired" by the
    redundant data in the packet coming after the lost packet. Based on
    incoming ACKs, such hidden loss events are detected, and CWR state
    is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
    be sent with a modified SKB containing the redundant data of
    previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
    tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
    for streams that may benefit from RDB, controlled by the sysctl
    variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl variable tcp_rdb=1.

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 ++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  14 ++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  26 ++++
 net/ipv4/tcp.c                         |  14 +-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 271 +++++++++++++++++++++++++++++++++
 12 files changed, 357 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 938ae73..1077de1 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
 	calculated, which is used to classify whether a stream is thin.
 	Default: 10000
 
+tcp_rdb - BOOLEAN
+	Enable RDB for all new TCP connections.
+	Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+	Enable restriction on how many bytes an RDB packet can contain.
+	This is the total amount of payload including the new unsent data.
+	Default: 0
+
+tcp_rdb_max_skbs - INTEGER
+	Enable restriction on how many previous SKBs in the output queue
+	RDB may include data from. A value of 1 will restrict bundling to
+	only the data from the last packet that was sent.
+	Default: 1
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c9c394b..1639cc0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2806,6 +2806,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b386361..da6dae8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -202,9 +202,10 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 3,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
+		rdb         : 1,/* Redundant Data Bundling enabled      */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index deac96f..d636026 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -277,6 +277,9 @@ extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
+extern int sysctl_tcp_rdb;
+extern int sysctl_tcp_rdb_max_bytes;
+extern int sysctl_tcp_rdb_max_skbs;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -549,6 +552,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 bool tcp_may_send_now(struct sock *sk);
 int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
 int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
@@ -574,6 +579,11 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
 void tcp_reset(struct sock *sk);
 void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 
+/* tcp_rdb.c */
+void rdb_ack_event(struct sock *sk, u32 flags);
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask);
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
@@ -772,6 +782,7 @@ struct tcp_skb_cb {
 	union {
 		struct {
 			/* There is space for up to 20 bytes */
+			__u32 rdb_start_seq; /* Start seq of rdb data bundled */
 		} tx;   /* only used for outgoing skbs */
 		union {
 			struct inet_skb_parm	h4;
@@ -1495,6 +1506,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)		\
+	skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+
 static inline struct sk_buff *tcp_send_head(const struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 65a77b0..ae0fba3 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -115,6 +115,7 @@ enum {
 #define TCP_CC_INFO		26	/* Get Congestion Control (optional) info */
 #define TCP_SAVE_SYN		27	/* Record SYN headers for new connections */
 #define TCP_SAVED_SYN		28	/* Get SYN headers recorded for connection */
+#define TCP_RDB			29	/* Enable RDB mechanism */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 152b9c7..7b556c8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -978,7 +978,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -986,6 +986,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(copy_skb_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index c29809f..f2cf496 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tcp_rdb.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 5b12446..9217200 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -718,6 +718,32 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
 	},
 	{
+		.procname	= "tcp_rdb",
+		.data		= &sysctl_tcp_rdb,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_max_bytes",
+		.data		= &sysctl_tcp_rdb_max_bytes,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= &proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "tcp_rdb_max_skbs",
+		.data		= &sysctl_tcp_rdb_max_skbs,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cb3354d..298ec4b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -289,6 +289,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
 
 int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
+int sysctl_tcp_rdb __read_mostly;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -408,6 +410,7 @@ void tcp_init_sock(struct sock *sk)
 	u64_stats_init(&tp->syncp);
 
 	tp->reordering = sysctl_tcp_reordering;
+	tp->rdb = sysctl_tcp_rdb;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
@@ -2408,6 +2411,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_RDB:
+		if (val < 0 || val > 1)
+			err = -EINVAL;
+		else
+			tp->rdb = val;
+		break;
+
 	case TCP_REPAIR:
 		if (!tcp_can_repair_sock(sk))
 			err = -EPERM;
@@ -2828,7 +2838,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		val = tp->thin_dupack;
 		break;
-
+	case TCP_RDB:
+		val = tp->rdb;
+		break;
 	case TCP_REPAIR:
 		val = tp->repair;
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fdd88c3..a4901b3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3503,6 +3503,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
 
 	if (icsk->icsk_ca_ops->in_ack_event)
 		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+
+	if (unlikely(tcp_sk(sk)->rdb))
+		rdb_ack_event(sk, flags);
 }
 
 /* This routine deals with incoming acks, but not outgoing ones. */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cb7ca56..bd7d8c5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -900,8 +900,8 @@ out:
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-			    gfp_t gfp_mask)
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_sock *inet;
@@ -2113,9 +2113,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+		if (unlikely(tcp_sk(sk)->rdb)) {
+			if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
+				break;
+		} else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
 			break;
-
+		}
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.
diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
new file mode 100644
index 0000000..708fefa
--- /dev/null
+++ b/net/ipv4/tcp_rdb.c
@@ -0,0 +1,271 @@
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+int sysctl_tcp_rdb_max_bytes __read_mostly;
+int sysctl_tcp_rdb_max_skbs __read_mostly = 1;
+
+/**
+ * rdb_check_rtx_queue_loss() - Perform loss detection by analysing acks.
+ * @sk: the socket.
+ *
+ * Return: The number of packets that are presumed to be lost.
+ */
+static unsigned int rdb_check_rtx_queue_loss(struct sock *sk)
+{
+	struct sk_buff *skb, *tmp;
+	struct tcp_skb_cb *scb;
+	u32 seq_acked = tcp_sk(sk)->snd_una;
+	unsigned int packets_lost = 0;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == tcp_send_head(sk))
+			break;
+
+		scb = TCP_SKB_CB(skb);
+		/* The ACK acknowledges parts of the data in this SKB.
+		 * Can be caused by:
+		 * - TSO: We abort as RDB is not used on SKBs split across
+		 *        multiple packets on lower layers as these are greater
+		 *        than one MSS.
+		 * - Retrans collapse: We've had a retrans, so loss has already
+		 *                     been detected.
+		 */
+		if (after(scb->end_seq, seq_acked)) {
+			break;
+		/* The ACKed packet */
+		} else if (scb->end_seq == seq_acked) {
+			/* This SKB was sent with no RDB data, or no prior
+			 * unacked SKBs in output queue, so break here.
+			 */
+			if (scb->tx.rdb_start_seq == scb->seq ||
+			    skb_queue_is_first(&sk->sk_write_queue, skb))
+				break;
+			/* Find number of prior SKBs who's data was bundled in
+			 * this (ACKed) SKB. We presume any redundant data
+			 * covering previous SKB's are due to loss. (An
+			 * exception would be reordering).
+			 */
+			skb = skb->prev;
+			tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+				if (!before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
+					packets_lost++;
+				else
+					break;
+			}
+			break;
+		}
+	}
+	return packets_lost;
+}
+
+/**
+ * rdb_ack_event() - Initiate loss detection
+ * @sk: the socket
+ * @flags: The flags
+ */
+void rdb_ack_event(struct sock *sk, u32 flags)
+{
+	if (rdb_check_rtx_queue_loss(sk))
+		tcp_enter_cwr(sk);
+}
+
+/**
+ * skb_append_data() - Copy data from an SKB to the end of another
+ * @from_skb: The SKB to copy data from
+ * @to_skb: The SKB to copy data to
+ *
+ * Return: 0 on success, else error
+ */
+static int skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
+{
+	/* Copy the linear data and the data from the frags into the linear page
+	 * buffer of to_skb.
+	 */
+	if (WARN_ON(skb_copy_bits(from_skb, 0,
+				  skb_put(to_skb, from_skb->len),
+				  from_skb->len))) {
+		goto fault;
+	}
+
+	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
+
+	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
+		to_skb->ip_summed = CHECKSUM_PARTIAL;
+
+	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
+		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
+					      to_skb->len);
+	return 0;
+fault:
+	return -EFAULT;
+}
+
+/**
+ * rdb_build_skb() - Builds the new RDB SKB and copies all the data into the
+ *                   linear page buffer.
+ * @sk: the socket
+ * @xmit_skb: This is the SKB that tcp_write_xmit wants to send
+ * @first_skb: The first SKB in the output queue we will bundle
+ * @gfp_mask: The gfp_t allocation
+ * @bytes_in_rdb_skb: The total number of data bytes for the new rdb_skb
+ *                         (NEW + Redundant)
+ *
+ * Return: A new SKB containing redundant data, or NULL if memory allocation
+ *         failed
+ */
+static struct sk_buff *rdb_build_skb(const struct sock *sk,
+				     struct sk_buff *xmit_skb,
+				     struct sk_buff *first_skb,
+				     u32 bytes_in_rdb_skb,
+				     gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb, *tmp_skb;
+
+	rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
+				      (int)bytes_in_rdb_skb,
+				      gfp_mask, true);
+	if (!rdb_skb)
+		return NULL;
+	copy_skb_header(rdb_skb, xmit_skb);
+	rdb_skb->ip_summed = xmit_skb->ip_summed;
+
+	TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
+
+	tmp_skb = first_skb;
+
+	tcp_for_write_queue_from(tmp_skb, sk) {
+		/* Copy data from tmp_skb to rdb_skb */
+		if (skb_append_data(tmp_skb, rdb_skb))
+			return NULL;
+		/* We are at the last skb that should be included (The unsent
+		 * one)
+		 */
+		if (tmp_skb == xmit_skb)
+			break;
+	}
+	return rdb_skb;
+}
+
+/**
+ * rdb_can_bundle_test() - test if redundant data can be bundled
+ * @sk: the socket
+ * @xmit_skb: The SKB processed for transmission by the output engine
+ * @mss_now: The current mss value
+ * @bytes_in_rdb_skb: Will contain the resulting number of bytes to bundle
+ *                         at exit.
+ * @skbs_to_bundle_count: The total number of SKBs to be in the bundle
+ *
+ * Traverses the entire write queue and checks if any un-acked data
+ * may be bundled.
+ *
+ * Return: The first SKB to be in the bundle, or NULL if no bundling
+ */
+static struct sk_buff *rdb_can_bundle_test(const struct sock *sk,
+					    struct sk_buff *xmit_skb,
+					    unsigned int mss_now,
+					    u32 *bytes_in_rdb_skb,
+					    u32 *skbs_to_bundle_count)
+{
+	struct sk_buff *first_to_bundle = NULL;
+	struct sk_buff *tmp, *skb = xmit_skb->prev;
+	u32 skbs_in_bundle_count = 1; /* 1 to account for current skb */
+	u32 byte_count = xmit_skb->len;
+
+	/* We start at the skb before xmit_skb, and go backwards in the list.*/
+	tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+		/* Not enough room to bundle data from this SKB */
+		if ((byte_count + skb->len) > mss_now)
+			break;
+
+		if (sysctl_tcp_rdb_max_bytes &&
+		    ((byte_count + skb->len) > sysctl_tcp_rdb_max_bytes))
+			break;
+
+		if (sysctl_tcp_rdb_max_skbs &&
+		    (skbs_in_bundle_count > sysctl_tcp_rdb_max_skbs))
+			break;
+
+		byte_count += skb->len;
+		skbs_in_bundle_count++;
+		first_to_bundle = skb;
+	}
+	*bytes_in_rdb_skb = byte_count;
+	*skbs_to_bundle_count = skbs_in_bundle_count;
+	return first_to_bundle;
+}
+
+/**
+ * create_rdb_skb() - Try to create an RDB SKB
+ * @sk: the socket
+ * @xmit_skb: The SKB from the output queue to be sent
+ * @mss_now: Current MSS
+ * @gfp_mask: The gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if no bundling could be
+ *         performed
+ */
+struct sk_buff *create_rdb_skb(const struct sock *sk, struct sk_buff *xmit_skb,
+			       unsigned int mss_now, u32 *bytes_in_rdb_skb,
+			       gfp_t gfp_mask)
+{
+	u32 skb_in_bundle_count;
+	struct sk_buff *first_to_bundle;
+
+	if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb))
+		return NULL;
+
+	/* No bundling on FIN packet */
+	if (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)
+		return NULL;
+
+	/* Find number of (previous) SKBs to get data from */
+	first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now,
+					       bytes_in_rdb_skb,
+					       &skb_in_bundle_count);
+	if (!first_to_bundle)
+		return NULL;
+
+	/* Create an SKB that contains the data from 'skb_in_bundle_count'
+	 * SKBs.
+	 */
+	return rdb_build_skb(sk, xmit_skb, first_to_bundle,
+			     *bytes_in_rdb_skb, gfp_mask);
+}
+
+/**
+ * tcp_transmit_rdb_skb() - Try to create and send an RDB packet
+ * @sk: the socket
+ * @xmit_skb: The SKB processed for transmission by the output engine
+ * @mss_now: Current MSS
+ * @gfp_mask: The gfp_t allocation
+ *
+ * Return: 0 if successfully sent packet, else error
+ */
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb = NULL;
+	u32 bytes_in_rdb_skb = 0; /* May be used for statistical purposes */
+
+	/* How we detect that RDB was used. When equal, no RDB data was sent */
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
+
+	if (tcp_stream_is_thin_dpifl(tcp_sk(sk))) {
+		rdb_skb = create_rdb_skb(sk, xmit_skb, mss_now,
+					 &bytes_in_rdb_skb, gfp_mask);
+		if (!rdb_skb)
+			goto xmit_default;
+
+		/* Set tstamp for SKB in output queue, because tcp_transmit_skb
+		 * will do this for the rdb_skb and not the SKB in the output
+		 * queue (xmit_skb).
+		 */
+		skb_mstamp_get(&xmit_skb->skb_mstamp);
+		rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
+		return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
+	}
+xmit_default:
+	/* Transmit the unmodified SKB from output queue */
+	return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC v2 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-11-23 16:26 ` [PATCH RFC v2 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
@ 2015-11-23 17:43   ` Eric Dumazet
  2015-11-23 20:05     ` Bendik Rønning Opstad
  0 siblings, 1 reply; 81+ messages in thread
From: Eric Dumazet @ 2015-11-23 17:43 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Mon, 2015-11-23 at 17:26 +0100, Bendik Rønning Opstad wrote:

> +
> +tcp_rdb_max_skbs - INTEGER
> +	Enable restriction on how many previous SKBs in the output queue
> +	RDB may include data from. A value of 1 will restrict bundling to
> +	only the data from the last packet that was sent.
> +	Default: 1
> +

skb is an internal thing. I would rather not expose a sysctl with such
name.

Can be multi segment or not (if GSO/TSO is enabled)

So even '1' skb can have very different content, from 1 byte to ~64 KB

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH RFC v2 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-11-23 17:43   ` Eric Dumazet
@ 2015-11-23 20:05     ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2015-11-23 20:05 UTC (permalink / raw)
  To: Eric Dumazet, Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On 23/11/15 18:43, Eric Dumazet wrote:
> On Mon, 2015-11-23 at 17:26 +0100, Bendik Rønning Opstad wrote:
> 
>> > +
>> > +tcp_rdb_max_skbs - INTEGER
>> > +	Enable restriction on how many previous SKBs in the output queue
>> > +	RDB may include data from. A value of 1 will restrict bundling to
>> > +	only the data from the last packet that was sent.
>> > +	Default: 1
>> > +
> skb is an internal thing. I would rather not expose a sysctl with such
> name.
> 
> Can be multi segment or not (if GSO/TSO is enabled)
> 
> So even '1' skb can have very different content, from 1 byte to ~64 KB

I see your point about not exposing the internal naming. What about
tcp_rdb_max_packets?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v3 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (6 preceding siblings ...)
  (?)
@ 2016-02-02 19:23 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-02 19:23 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
(http://patchwork.ozlabs.org/patch/510674)

These patches have also been tested with as set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
     tcp_rdb_max_packets after comment from Eric Dumazet about
     not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
     feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
     to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 +++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  35 +++++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  35 +++++
 net/ipv4/tcp.c                         |  16 +-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 273 +++++++++++++++++++++++++++++++++
 12 files changed, 399 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v3 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (7 preceding siblings ...)
  (?)
@ 2016-02-02 19:23 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-02 19:23 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/net/tcp.h                      | 21 +++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp.c                         |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 73b36d7..eb42853 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN
 	Documentation/networking/tcp-thin.txt
 	Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+	Controls the lower bound inter-transmission time (ITT) threshold
+	for when a stream is considered thin. The value is specified in
+	microseconds, and may not be lower than 10000 (10 ms). Based on
+	this threshold, a dynamic packets in flight limit (DPIFL) is
+	calculated, which is used to classify whether a stream is thin.
+	Default: 10000
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3dd20fe..2d86bd7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6       /* After 6 linear retries, do exp. backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (10000 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 10000
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND		10
@@ -271,6 +273,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1649,6 +1652,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
+ *                              limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ *         the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+	/* Calculate the maximum allowed PIF limit by dividing the RTT by
+	 * the minimum allowed inter-transmission time (ITT).
+	 * Tests if PIF < RTT / ITT-lower-bound
+	 */
+	return (u64) tcp_packets_in_flight(tp) *
+		sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 4d367b4..6014bc4 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -687,6 +688,14 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_thin_dpifl_itt_lower_bound",
+		.data		= &sysctl_tcp_thin_dpifl_itt_lower_bound,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 19746b3..16087fe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -288,6 +288,8 @@ int sysctl_tcp_min_tso_segs __read_mostly = 2;
 
 int sysctl_tcp_autocorking __read_mostly = 1;
 
+int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (8 preceding siblings ...)
  (?)
@ 2016-02-02 19:23 ` Bendik Rønning Opstad
  2016-02-02 20:35   ` Eric Dumazet
  -1 siblings, 1 reply; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-02 19:23 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
    with each packet, packet loss can be hidden from the TCP engine due
    to lack of dupACKs. This is because the loss is "repaired" by the
    redundant data in the packet coming after the lost packet. Based on
    incoming ACKs, such hidden loss events are detected, and CWR state
    is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
    be sent with a modified SKB containing the redundant data of
    previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
    tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
    for streams that may benefit from RDB, controlled by the sysctl
    variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl variable tcp_rdb=1.

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 ++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  14 ++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  26 ++++
 net/ipv4/tcp.c                         |  14 +-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 273 +++++++++++++++++++++++++++++++++
 12 files changed, 359 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index eb42853..14f960d 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
 	calculated, which is used to classify whether a stream is thin.
 	Default: 10000
 
+tcp_rdb - BOOLEAN
+	Enable RDB for all new TCP connections.
+	Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+	Enable restriction on how many bytes an RDB packet can contain.
+	This is the total amount of payload including the new unsent data.
+	Default: 0
+
+tcp_rdb_max_packets - INTEGER
+	Enable restriction on how many previous packets in the output queue
+	RDB may include data from. A value of 1 will restrict bundling to
+	only the data from the last packet that was sent.
+	Default: 1
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 11f935c..eb81877 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2914,6 +2914,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b386361..da6dae8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -202,9 +202,10 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 3,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
+		rdb         : 1,/* Redundant Data Bundling enabled      */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2d86bd7..bc6e81b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -274,6 +274,9 @@ extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
+extern int sysctl_tcp_rdb;
+extern int sysctl_tcp_rdb_max_bytes;
+extern int sysctl_tcp_rdb_max_packets;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -547,6 +550,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 bool tcp_may_send_now(struct sock *sk);
 int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
 int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
@@ -572,6 +577,11 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
 void tcp_reset(struct sock *sk);
 void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 
+/* tcp_rdb.c */
+void rdb_ack_event(struct sock *sk, u32 flags);
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask);
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
@@ -770,6 +780,7 @@ struct tcp_skb_cb {
 	union {
 		struct {
 			/* There is space for up to 20 bytes */
+			__u32 rdb_start_seq; /* Start seq of rdb data bundled */
 		} tx;   /* only used for outgoing skbs */
 		union {
 			struct inet_skb_parm	h4;
@@ -1501,6 +1512,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)		\
+	skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+
 static inline struct sk_buff *tcp_send_head(const struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 65a77b0..ae0fba3 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -115,6 +115,7 @@ enum {
 #define TCP_CC_INFO		26	/* Get Congestion Control (optional) info */
 #define TCP_SAVE_SYN		27	/* Record SYN headers for new connections */
 #define TCP_SAVED_SYN		28	/* Get SYN headers recorded for connection */
+#define TCP_RDB			29	/* Enable RDB mechanism */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b2df375..d23058b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -978,7 +978,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -986,6 +986,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(copy_skb_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 62c049b..3c55ba9 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tcp_rdb.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 6014bc4..f7c9b1e 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -696,6 +696,32 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
 	},
 	{
+		.procname	= "tcp_rdb",
+		.data		= &sysctl_tcp_rdb,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_max_bytes",
+		.data		= &sysctl_tcp_rdb_max_bytes,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= &proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "tcp_rdb_max_packets",
+		.data		= &sysctl_tcp_rdb_max_packets,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 16087fe..ea9d6ef 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -290,6 +290,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
 
 int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
+int sysctl_tcp_rdb __read_mostly;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -409,6 +411,7 @@ void tcp_init_sock(struct sock *sk)
 	u64_stats_init(&tp->syncp);
 
 	tp->reordering = sysctl_tcp_reordering;
+	tp->rdb = sysctl_tcp_rdb;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
@@ -2409,6 +2412,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_RDB:
+		if (val < 0 || val > 1)
+			err = -EINVAL;
+		else
+			tp->rdb = val;
+		break;
+
 	case TCP_REPAIR:
 		if (!tcp_can_repair_sock(sk))
 			err = -EPERM;
@@ -2832,7 +2842,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		val = tp->thin_dupack;
 		break;
-
+	case TCP_RDB:
+		val = tp->rdb;
+		break;
 	case TCP_REPAIR:
 		val = tp->repair;
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1c2a734..1fda930 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3504,6 +3504,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
 
 	if (icsk->icsk_ca_ops->in_ack_event)
 		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+
+	if (unlikely(tcp_sk(sk)->rdb))
+		rdb_ack_event(sk, flags);
 }
 
 /* This routine deals with incoming acks, but not outgoing ones. */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fda379c..e24ab6a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -900,8 +900,8 @@ out:
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-			    gfp_t gfp_mask)
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_sock *inet;
@@ -2113,9 +2113,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+		if (unlikely(tcp_sk(sk)->rdb)) {
+			if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
+				break;
+		} else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
 			break;
-
+		}
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.
diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
new file mode 100644
index 0000000..50439eb
--- /dev/null
+++ b/net/ipv4/tcp_rdb.c
@@ -0,0 +1,273 @@
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+int sysctl_tcp_rdb_max_bytes __read_mostly;
+int sysctl_tcp_rdb_max_packets __read_mostly = 1;
+
+/**
+ * rdb_check_rtx_queue_loss() - perform loss detection by analysing acks.
+ * @sk: socket.
+ *
+ * Return: The number of packets that are presumed to be lost.
+ */
+static unsigned int rdb_check_rtx_queue_loss(struct sock *sk)
+{
+	struct sk_buff *skb, *tmp;
+	struct tcp_skb_cb *scb;
+	u32 seq_acked = tcp_sk(sk)->snd_una;
+	unsigned int packets_lost = 0;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == tcp_send_head(sk))
+			break;
+
+		scb = TCP_SKB_CB(skb);
+		/* The ACK acknowledges parts of the data in this SKB.
+		 * Can be caused by:
+		 * - TSO: We abort as RDB is not used on SKBs split across
+		 *        multiple packets on lower layers as these are greater
+		 *        than one MSS.
+		 * - Retrans collapse: We've had a retrans, so loss has already
+		 *                     been detected.
+		 */
+		if (after(scb->end_seq, seq_acked)) {
+			break;
+		/* The ACKed packet */
+		} else if (scb->end_seq == seq_acked) {
+			/* This SKB was sent with no RDB data, or no prior
+			 * unacked SKBs in output queue, so break here.
+			 */
+			if (scb->tx.rdb_start_seq == scb->seq ||
+			    skb_queue_is_first(&sk->sk_write_queue, skb))
+				break;
+			/* Find number of prior SKBs who's data was bundled in
+			 * this (ACKed) SKB. We presume any redundant data
+			 * covering previous SKB's are due to loss. (An
+			 * exception would be reordering).
+			 */
+			skb = skb->prev;
+			tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+				if (!before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
+					packets_lost++;
+				else
+					break;
+			}
+			break;
+		}
+	}
+	return packets_lost;
+}
+
+/**
+ * rdb_ack_event() - initiate loss detection
+ * @sk: socket
+ * @flags: flags
+ */
+void rdb_ack_event(struct sock *sk, u32 flags)
+{
+	if (rdb_check_rtx_queue_loss(sk))
+		tcp_enter_cwr(sk);
+}
+
+/**
+ * skb_append_data() - copy data from an SKB to the end of another
+ * @from_skb: the SKB to copy data from
+ * @to_skb: the SKB to copy data to
+ *
+ * Return: 0 on success, else error
+ */
+static int skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
+{
+	/* Copy the linear data and the data from the frags into the linear page
+	 * buffer of to_skb.
+	 */
+	if (WARN_ON(skb_copy_bits(from_skb, 0,
+				  skb_put(to_skb, from_skb->len),
+				  from_skb->len))) {
+		goto fault;
+	}
+
+	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
+
+	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
+		to_skb->ip_summed = CHECKSUM_PARTIAL;
+
+	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
+		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
+					      to_skb->len);
+	return 0;
+fault:
+	return -EFAULT;
+}
+
+/**
+ * rdb_build_skb() - build the new RDB SKB and copies all the data into the
+ *                   linear page buffer.
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @first_skb: the first SKB in the output queue to be bundled
+ * @bytes_in_rdb_skb: the total number of data bytes for the new rdb_skb
+ *                    (NEW + Redundant)
+ * @gfp_mask: gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if memory allocation
+ *         failed
+ */
+static struct sk_buff *rdb_build_skb(const struct sock *sk,
+				     struct sk_buff *xmit_skb,
+				     struct sk_buff *first_skb,
+				     u32 bytes_in_rdb_skb,
+				     gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb, *tmp_skb;
+
+	rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
+				      (int)bytes_in_rdb_skb,
+				      gfp_mask, true);
+	if (!rdb_skb)
+		return NULL;
+	copy_skb_header(rdb_skb, xmit_skb);
+	rdb_skb->ip_summed = xmit_skb->ip_summed;
+
+	TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
+
+	tmp_skb = first_skb;
+
+	tcp_for_write_queue_from(tmp_skb, sk) {
+		/* Copy data from tmp_skb to rdb_skb */
+		if (skb_append_data(tmp_skb, rdb_skb))
+			return NULL;
+		/* We are at the last skb that should be included (The unsent
+		 * one)
+		 */
+		if (tmp_skb == xmit_skb)
+			break;
+	}
+	return rdb_skb;
+}
+
+/**
+ * rdb_can_bundle_test() - test if redundant data can be bundled
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @mss_now: current mss value
+ * @bytes_in_rdb_skb: store the total number of payload bytes in the
+ *                    RDB SKB if bundling can be performed.
+ * @skbs_to_bundle_count: the total number of SKBs to be in the bundle
+ *
+ * Traverse the output queue and check if any un-acked data may be
+ * bundled.
+ *
+ * Return: The first SKB to be in the bundle, or NULL if no bundling
+ */
+static struct sk_buff *rdb_can_bundle_test(const struct sock *sk,
+					   struct sk_buff *xmit_skb,
+					   unsigned int mss_now,
+					   u32 *bytes_in_rdb_skb,
+					   u32 *skbs_to_bundle_count)
+{
+	struct sk_buff *first_to_bundle = NULL;
+	struct sk_buff *tmp, *skb = xmit_skb->prev;
+	u32 skbs_in_bundle_count = 1; /* 1 to account for current skb */
+	u32 byte_count = xmit_skb->len;
+
+	/* We start at the skb before xmit_skb, and go backwards in the list.*/
+	tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+		/* Not enough room to bundle data from this SKB */
+		if ((byte_count + skb->len) > mss_now)
+			break;
+
+		if (sysctl_tcp_rdb_max_bytes &&
+		    ((byte_count + skb->len) > sysctl_tcp_rdb_max_bytes))
+			break;
+
+		if (sysctl_tcp_rdb_max_packets &&
+		    (skbs_in_bundle_count > sysctl_tcp_rdb_max_packets))
+			break;
+
+		byte_count += skb->len;
+		skbs_in_bundle_count++;
+		first_to_bundle = skb;
+	}
+	*bytes_in_rdb_skb = byte_count;
+	*skbs_to_bundle_count = skbs_in_bundle_count;
+	return first_to_bundle;
+}
+
+/**
+ * create_rdb_skb() - try to create an RDB SKB
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @mss_now: current mss value
+ * @bytes_in_rdb_skb: store the total number of payload bytes in the
+ *                    RDB SKB if bundling can be performed.
+ * @gfp_mask: gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if no bundling could be
+ *         performed
+ */
+struct sk_buff *create_rdb_skb(const struct sock *sk, struct sk_buff *xmit_skb,
+			       unsigned int mss_now, u32 *bytes_in_rdb_skb,
+			       gfp_t gfp_mask)
+{
+	u32 skb_in_bundle_count;
+	struct sk_buff *first_to_bundle;
+
+	if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb))
+		return NULL;
+
+	/* No bundling on FIN packet */
+	if (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)
+		return NULL;
+
+	/* Find number of (previous) SKBs to get data from */
+	first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now,
+					       bytes_in_rdb_skb,
+					       &skb_in_bundle_count);
+	if (!first_to_bundle)
+		return NULL;
+
+	/* Create an SKB that contains the data from 'skb_in_bundle_count'
+	 * SKBs.
+	 */
+	return rdb_build_skb(sk, xmit_skb, first_to_bundle,
+			     *bytes_in_rdb_skb, gfp_mask);
+}
+
+/**
+ * tcp_transmit_rdb_skb() - try to create and send an RDB packet
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @mss_now: current mss value
+ * @gfp_mask: gfp_t allocation
+ *
+ * Return: 0 if successfully sent packet, else error
+ */
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb = NULL;
+	u32 bytes_in_rdb_skb = 0; /* May be used for statistical purposes */
+
+	/* How we detect that RDB was used. When equal, no RDB data was sent */
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
+
+	if (tcp_stream_is_thin_dpifl(tcp_sk(sk))) {
+		rdb_skb = create_rdb_skb(sk, xmit_skb, mss_now,
+					 &bytes_in_rdb_skb, gfp_mask);
+		if (!rdb_skb)
+			goto xmit_default;
+
+		/* Set tstamp for SKB in output queue, because tcp_transmit_skb
+		 * will do this for the rdb_skb and not the SKB in the output
+		 * queue (xmit_skb).
+		 */
+		skb_mstamp_get(&xmit_skb->skb_mstamp);
+		rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
+		return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
+	}
+xmit_default:
+	/* Transmit the unmodified SKB from output queue */
+	return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-02-02 19:23 ` [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
@ 2016-02-02 20:35   ` Eric Dumazet
  2016-02-03 18:17     ` Bendik Rønning Opstad
  0 siblings, 1 reply; 81+ messages in thread
From: Eric Dumazet @ 2016-02-02 20:35 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote:
> RDB is a mechanism that enables a TCP sender to bundle redundant
> (already sent) data with TCP packets containing new data. By bundling
> (retransmitting) already sent data with each TCP packet containing new
> data, the connection will be more resistant to sporadic packet loss
> which reduces the application layer latency significantly in congested
> scenarios.
> 
> The main functionality added:
> 
>   o Loss detection of hidden loss events: When bundling redundant data
>     with each packet, packet loss can be hidden from the TCP engine due
>     to lack of dupACKs. This is because the loss is "repaired" by the
>     redundant data in the packet coming after the lost packet. Based on
>     incoming ACKs, such hidden loss events are detected, and CWR state
>     is entered.
> 
>   o When packets are scheduled for transmission, RDB replaces the SKB to
>     be sent with a modified SKB containing the redundant data of
>     previously sent data segments from the TCP output queue.

Really this looks very complicated.

Why not simply append the new skb content to prior one ?

skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
really available (ie its clone not sitting/waiting in a qdisc on the
host)

Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
MAX_TCP_HEADER) available bytes in skb->data.

Also note that tcp_collapse_retrans() is very similar to your needs. You
might simply expand it.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-02-02 20:35   ` Eric Dumazet
@ 2016-02-03 18:17     ` Bendik Rønning Opstad
  2016-02-03 19:34       ` Eric Dumazet
  0 siblings, 1 reply; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-03 18:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote:
>>
>>   o When packets are scheduled for transmission, RDB replaces the SKB to
>>     be sent with a modified SKB containing the redundant data of
>>     previously sent data segments from the TCP output queue.
>
> Really this looks very complicated.

Can you be more specific?

> Why not simply append the new skb content to prior one ?

It's not clear to me what you mean. At what stage in the output engine
do you refer to?

We want to avoid modifying the data of the SKBs in the output queue,
therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
The header and payload of the first SKB containing data we want to
redundantly transmit is then copied. Then the payload of the SKBs following
next in the output queue is appended onto the rdb_skb. The last payload
that is appended is from the first SKB with unsent data, i.e. the
sk_send_head.

Would you suggest a different approach?

> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
> really available (ie its clone not sitting/waiting in a qdisc on the
> host)

Where do you suggest this should be used?

> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
> MAX_TCP_HEADER) available bytes in skb->data.

Sure, rdb_build_skb() could use this instead of the calculated
bytes_in_rdb_skb.

> Also note that tcp_collapse_retrans() is very similar to your needs. You
> might simply expand it.

The functionality shared is the copying of data from one SKB to another, as
well as adjusting sequence numbers and checksum. Unlinking SKBs from the
output queue, modifying the data of SKBs in the output queue, and changing
retrans hints is not shared.

To reduce code duplication, the function skb_append_data in tcp_rdb.c could
be moved to tcp_output.c, and then be called from tcp_collapse_retrans.

Is it something like this you had in mind?


Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-02-03 18:17     ` Bendik Rønning Opstad
@ 2016-02-03 19:34       ` Eric Dumazet
       [not found]         ` <CAF8eE=VOuoNLQHtkRwM9ZG+vJ-uH2ufVW5y_pS24rGqWh4Qa2g@mail.gmail.com>
  2016-02-08 17:38         ` Bendik Rønning Opstad
  0 siblings, 2 replies; 81+ messages in thread
From: Eric Dumazet @ 2016-02-03 19:34 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote:
> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote:
> >>
> >>   o When packets are scheduled for transmission, RDB replaces the SKB to
> >>     be sent with a modified SKB containing the redundant data of
> >>     previously sent data segments from the TCP output queue.
> >
> > Really this looks very complicated.
> 
> Can you be more specific?

A lot of code added, needing maintenance cost for years to come.

> 
> > Why not simply append the new skb content to prior one ?
> 
> It's not clear to me what you mean. At what stage in the output engine
> do you refer to?
> 
> We want to avoid modifying the data of the SKBs in the output queue,

Why ? We already do that, as I pointed out.

> therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
> The header and payload of the first SKB containing data we want to
> redundantly transmit is then copied. Then the payload of the SKBs following
> next in the output queue is appended onto the rdb_skb. The last payload
> that is appended is from the first SKB with unsent data, i.e. the
> sk_send_head.
> 
> Would you suggest a different approach?
> 
> > skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
> > really available (ie its clone not sitting/waiting in a qdisc on the
> > host)
> 
> Where do you suggest this should be used?

To detect if appending data to prior skb is possible.

If the prior packet is still in qdisc, no change is allowed,
and it is fine : DRB should not trigger anyway.

> 
> > Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
> > MAX_TCP_HEADER) available bytes in skb->data.
> 
> Sure, rdb_build_skb() could use this instead of the calculated
> bytes_in_rdb_skb.

Point is : small packets already have tail room in skb->head

When RDB decides a packet should be merged into the prior one, you can
simply copy payload into the tailroom, then free the skb.

No skb allocations are needed, only freeing.

RDB could be implemented in a more concise way.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
       [not found]         ` <CAF8eE=VOuoNLQHtkRwM9ZG+vJ-uH2ufVW5y_pS24rGqWh4Qa2g@mail.gmail.com>
@ 2016-02-08 17:30           ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-08 17:30 UTC (permalink / raw)
  To: Bendik Rønning Opstad, Eric Dumazet
  Cc: David S. Miller, Netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

Sorry guys, I messed up that email by including HTML, and it got
rejected by netdev@vger.kernel.org. I'll resend it properly formatted.

Bendik

On 08/02/16 18:17, Bendik Rønning Opstad wrote:
> Eric, thank you for the feedback!
> 
> On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote:
>>> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.dumazet@gmail.com>
> wrote:
>>>> Really this looks very complicated.
>>>
>>> Can you be more specific?
>>
>> A lot of code added, needing maintenance cost for years to come.
> 
> Yes, that is understandable.
> 
>>>> Why not simply append the new skb content to prior one ?
>>>
>>> It's not clear to me what you mean. At what stage in the output engine
>>> do you refer to?
>>>
>>> We want to avoid modifying the data of the SKBs in the output queue,
>>
>> Why ? We already do that, as I pointed out.
> 
> I suspect that we might be talking past each other. It wasn't clear to
> me that we were discussing how to implement this in a different way.
> 
> The current retrans collapse functionality only merges SKBs that
> contain data that has already been sent and is about to be
> retransmitted.
> 
> This differs significantly from RDB, which combines both already
> transmitted data and unsent data in the same packet without changing
> how the data is stored (and the state tracked) in the output queue.
> Another difference is that RDB includes un-ACKed data that is not
> considered lost.
> 
>>> therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
>>> The header and payload of the first SKB containing data we want to
>>> redundantly transmit is then copied. Then the payload of the SKBs
> following
>>> next in the output queue is appended onto the rdb_skb. The last payload
>>> that is appended is from the first SKB with unsent data, i.e. the
>>> sk_send_head.
>>>
>>> Would you suggest a different approach?
>>>
>>>> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb
> is
>>>> really available (ie its clone not sitting/waiting in a qdisc on the
>>>> host)
>>>
>>> Where do you suggest this should be used?
>>
>> To detect if appending data to prior skb is possible.
> 
> I see. As the implementation intentionally avoids modifying SKBs in
> the output queue, this was not obvious.
> 
>> If the prior packet is still in qdisc, no change is allowed,
>> and it is fine : DRB should not trigger anyway.
> 
> Actually, whether the data in the prior SKB is on the wire or is still
> on the host (in qdisc/driver queue) is not relevant. RDB always wants
> to redundantly resend the data if there is room in the packet, because
> the previous packet may become lost.
> 
>>>> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
>>>> MAX_TCP_HEADER) available bytes in skb->data.
>>>
>>> Sure, rdb_build_skb() could use this instead of the calculated
>>> bytes_in_rdb_skb.
>>
>> Point is : small packets already have tail room in skb->head
> 
> Yes, I'm aware of that. But we do not allocate new SKBs because we
> think the existing SKBs do not have enough space available. We do it
> to avoid modifications to the SKBs in the output queue.
> 
>> When RDB decides a packet should be merged into the prior one, you can
>> simply copy payload into the tailroom, then free the skb.
>>
>> No skb allocations are needed, only freeing.
> 
> It wasn't clear to me that you suggest a completely different
> implementation approach altogether.
> 
> As I understand you, the approach you suggest is as follows:
> 
> 1. An SKB containing unsent data is processed for transmission (lets
>    call it T_SKB)
> 2. Check if the previous SKB (lets call it P_SKB) (containing sent but
>    un-ACKed data) has available (tail) room for the payload contained
>    in T_SKB.
> 3. If room in P_SKB:
>   * Copy the unsent data from T_SKB to P_SKB by appending it to the
>     linear data and update sequence numbers.
>   * Remove T_SKB (which contains only the new and unsent data) from
>     the output queue.
>   * Transmit P_SKB, which now contains some already sent data and some
>     unsent data.
> 
> 
> If I have misunderstood, can you please elaborate in detail what you
> mean?
> 
> If this is the approach you suggest, I can think of some potential
> downsides that require further considerations:
> 
> 
> 1) ACK-accounting will work differently
> 
> When the previous SKB (P_SKB) is modified by appending the data of the
> next SKB (T_SKB), what should happen when an incoming ACK
> acknowledges the data that was sent in the original transmission
> (before the SKB was modified), but not the data that was appended
> later? tcp_clean_rtx_queue currently handles partially ACKed SKBs due
> to TSO, in which case the tcp_skb_pcount(skb) > 1. So this function
> would need to be modified to handle this for RDB modified SKBs in the
> queue, where all the data is located in the linear data buffer (no GSO
> segs).
> 
> How should SACK and retrans flags be handled when one SKB in the
> output queue can represent multiple transmitted packets?
> 
> 
> 2) Timestamps and RTT measurements
> 
> How should RTT measurements work when you don't have a timestamp for
> the data that was newly appended to the existing SKB containing sent
> but un-ACKed data? Or should the skb->skb_mstamp be updated when the
> SKB with newly appended data is sent again? That would make any RTT
> measurements based on ACKs on the originally sent packet unusable.
> 
> 
> 3) Retransmit and lost SKB hints
> 
> Appending unsent data to SKBs with sent data will affect the usage of
> tp->retransmit_skb_hint and tp->lost_skb_hint. As these variables
> contain pointers to SKBs in the output queue, it is implied that all
> the data in an SKB has the same state, such as retransmitted or lost.
> 
> 
> 4) RDB's loss accounting
> 
> RDB detects loss by looking at how many segments that are ACKed. If an
> incoming ACK acknowledges data in multiples SKBs, we can infer that
> loss has occurred (ignoring the possibility of reordering). With the
> approach you suggest, we lose the information about how many packets
> we originally had, and how much of the payload was redundant
> (considering SKBs are updated with new data and sent out again). We
> would need additional variables in order to keep track of this.
> 
> 
> 5) Forced bundling on retransmissions
> 
> Since the SKBs in the output queue are modified to contain redundant
> data, retransmissions of the SKBs will necessarily only contain the
> redundant data unless the SKBs are modified before the retransmission.
> 
> 
> 6) Configuring how much is bundled becomes complex
> 
> When previous SKBs are to be used by appending the new data to be
> sent, it is no longer possible to configure the amount of data to
> bundle. We are forced to bundle all the data in the previous SKB.
> 
> Say we have 3 SKBs in the queue, with unsent segments 1, 2, 3:
> [1] [2] [3]
> 
> Send 1:
> [1] ->
> Try to send 2, but first merge 2 with 1:
> [1,2] [3]
> Send merged SKB:
> [1,2] ->
> 
> When we want to send segment 3, we are forced to bundle both 1 and 2.
> Try to send 3, but first merge 3 with 1,2.
> [1,2,3]
> Send merged SKB:
> [1,2,3] ->
> 
> Transmitting only 2,3 in a packet then becomes difficult without
> additional logic for RDB record keeping.
> 
> 
>> RDB could be implemented in a more concise way.
> 
> I'm open for suggestions to improvements. However, I can't see how the
> suggested approach (as I've understood it) can be implemented without
> making extensive modifications to the current TCP engine. Having one
> SKB represent multiple packets, where each packet contains different data
> and possibly in different states (retransmitted/lost), seems very complex.
> 
> By avoiding any modifications to the output queue we ensure the
> default code branch is completely unaffected, avoiding any special
> handling in multiple locations in the codebase.
> 
> 
> Regards,
> 
> Bendik
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-02-03 19:34       ` Eric Dumazet
       [not found]         ` <CAF8eE=VOuoNLQHtkRwM9ZG+vJ-uH2ufVW5y_pS24rGqWh4Qa2g@mail.gmail.com>
@ 2016-02-08 17:38         ` Bendik Rønning Opstad
  1 sibling, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-08 17:38 UTC (permalink / raw)
  To: Eric Dumazet, Bendik Rønning Opstad
  Cc: David S. Miller, Netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

Eric, thank you for the feedback!

On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote:
>> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.dumazet@gmail.com>
wrote:
>>> Really this looks very complicated.
>>
>> Can you be more specific?
>
> A lot of code added, needing maintenance cost for years to come.

Yes, that is understandable.

>>> Why not simply append the new skb content to prior one ?
>>
>> It's not clear to me what you mean. At what stage in the output engine
>> do you refer to?
>>
>> We want to avoid modifying the data of the SKBs in the output queue,
>
> Why ? We already do that, as I pointed out.

I suspect that we might be talking past each other. It wasn't clear to
me that we were discussing how to implement this in a different way.

The current retrans collapse functionality only merges SKBs that
contain data that has already been sent and is about to be
retransmitted.

This differs significantly from RDB, which combines both already
transmitted data and unsent data in the same packet without changing
how the data is stored (and the state tracked) in the output queue.
Another difference is that RDB includes un-ACKed data that is not
considered lost.


>> therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
>> The header and payload of the first SKB containing data we want to
>> redundantly transmit is then copied. Then the payload of the SKBs following
>> next in the output queue is appended onto the rdb_skb. The last payload
>> that is appended is from the first SKB with unsent data, i.e. the
>> sk_send_head.
>>
>> Would you suggest a different approach?
>>
>>> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
>>> really available (ie its clone not sitting/waiting in a qdisc on the
>>> host)
>>
>> Where do you suggest this should be used?
>
> To detect if appending data to prior skb is possible.

I see. As the implementation intentionally avoids modifying SKBs in
the output queue, this was not obvious.

> If the prior packet is still in qdisc, no change is allowed,
> and it is fine : DRB should not trigger anyway.

Actually, whether the data in the prior SKB is on the wire or is still
on the host (in qdisc/driver queue) is not relevant. RDB always wants
to redundantly resend the data if there is room in the packet, because
the previous packet may become lost.

>>> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
>>> MAX_TCP_HEADER) available bytes in skb->data.
>>
>> Sure, rdb_build_skb() could use this instead of the calculated
>> bytes_in_rdb_skb.
>
> Point is : small packets already have tail room in skb->head

Yes, I'm aware of that. But we do not allocate new SKBs because we
think the existing SKBs do not have enough space available. We do it
to avoid modifications to the SKBs in the output queue.

> When RDB decides a packet should be merged into the prior one, you can
> simply copy payload into the tailroom, then free the skb.
>
> No skb allocations are needed, only freeing.

It wasn't clear to me that you suggest a completely different
implementation approach altogether.

As I understand you, the approach you suggest is as follows:

1. An SKB containing unsent data is processed for transmission (lets
   call it T_SKB)
2. Check if the previous SKB (lets call it P_SKB) (containing sent but
   un-ACKed data) has available (tail) room for the payload contained
   in T_SKB.
3. If room in P_SKB:
  * Copy the unsent data from T_SKB to P_SKB by appending it to the
    linear data and update sequence numbers.
  * Remove T_SKB (which contains only the new and unsent data) from
    the output queue.
  * Transmit P_SKB, which now contains some already sent data and some
    unsent data.


If I have misunderstood, can you please elaborate in detail what you
mean?

If this is the approach you suggest, I can think of some potential
downsides that require further considerations:


1) ACK-accounting will work differently

When the previous SKB (P_SKB) is modified by appending the data of the
next SKB (T_SKB), what should happen when an incoming ACK acknowledges
the data that was sent in the original transmission (before the SKB
was modified), but not the data that was appended later?
tcp_clean_rtx_queue currently handles partially ACKed SKBs due to TSO,
in which case the tcp_skb_pcount(skb) > 1. So this function would need
to be modified to handle this for RDB modified SKBs in the queue,
where all the data is located in the linear data buffer (no GSO segs).

How should SACK and retrans flags be handled when one SKB in the
output queue can represent multiple transmitted packets?


2) Timestamps and RTT measurements

How should RTT measurements work when you don't have a timestamp for
the data that was newly appended to the existing SKB containing sent
but un-ACKed data? Or should the skb->skb_mstamp be updated when the
SKB with newly appended data is sent again? That would make any RTT
measurements based on ACKs on the originally sent packet unusable.


3) Retransmit and lost SKB hints

Appending unsent data to SKBs with sent data will affect the usage of
tp->retransmit_skb_hint and tp->lost_skb_hint. As these variables
contain pointers to SKBs in the output queue, it is implied that all
the data in an SKB has the same state, such as retransmitted or lost.


4) RDB's loss accounting

RDB detects loss by looking at how many segments that are ACKed. If an
incoming ACK acknowledges data in multiples SKBs, we can infer that
loss has occurred (ignoring the possibility of reordering). With the
approach you suggest, we lose the information about how many packets
we originally had, and how much of the payload was redundant
(considering SKBs are updated with new data and sent out again). We
would need additional variables in order to keep track of this.


5) Forced bundling on retransmissions

Since the SKBs in the output queue are modified to contain redundant
data, retransmissions of the SKBs will necessarily only contain the
redundant data unless the SKBs are modified before the retransmission.


6) Configuring how much is bundled becomes complex

When previous SKBs are to be used by appending the new data to be
sent, it is no longer possible to configure the amount of data to
bundle. We are forced to bundle all the data in the previous SKB.

Say we have 3 SKBs in the queue, with unsent segments 1, 2, 3:
[1] [2] [3]

Send 1:
[1] ->
Try to send 2, but first merge 2 with 1:
[1,2] [3]
Send merged SKB:
[1,2] ->

When we want to send segment 3, we are forced to bundle both 1 and 2.
Try to send 3, but first merge 3 with 1,2.
[1,2,3]
Send merged SKB:
[1,2,3] ->

Transmitting only 2,3 in a packet then becomes difficult without
additional logic for RDB record keeping.


> RDB could be implemented in a more concise way.

I'm open for suggestions to improvements. However, I can't see how the
suggested approach (as I've understood it) can be implemented without
making extensive modifications to the current TCP engine. Having one
SKB represent multiple packets, where each packet contains different
data and possibly in different states (retransmitted/lost), seems very
complex.

By avoiding any modifications to the output queue we ensure the
default code branch is completely unaffected, avoiding any special
handling in multiple locations in the codebase.


Regards,

Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v4 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (9 preceding siblings ...)
  (?)
@ 2016-02-16 13:51 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-16 13:51 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
(http://patchwork.ozlabs.org/patch/510674)

These patches have also been tested with as set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v4 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Moved skb_append_data() to tcp_output.c and call this
     function from tcp_collapse_retrans() as well.
   * Merged functionality of create_rdb_skb() into
     tcp_transmit_rdb_skb()
   * Removed one parameter from rdb_can_bundle_test()

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
     tcp_rdb_max_packets after comment from Eric Dumazet about
     not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
     feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
     to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 ++++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  36 ++++++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  35 ++++++
 net/ipv4/tcp.c                         |  16 ++-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  49 +++++---
 net/ipv4/tcp_rdb.c                     | 215 +++++++++++++++++++++++++++++++++
 12 files changed, 365 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v4 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (10 preceding siblings ...)
  (?)
@ 2016-02-16 13:51 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-16 13:51 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/net/tcp.h                      | 21 +++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp.c                         |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 24ce97f..3b23ed8f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN
 	Documentation/networking/tcp-thin.txt
 	Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+	Controls the lower bound inter-transmission time (ITT) threshold
+	for when a stream is considered thin. The value is specified in
+	microseconds, and may not be lower than 10000 (10 ms). Based on
+	this threshold, a dynamic packets in flight limit (DPIFL) is
+	calculated, which is used to classify whether a stream is thin.
+	Default: 10000
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index bdd5e1c..a23ce24 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6       /* After 6 linear retries, do exp. backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (10000 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 10000
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND		10
@@ -264,6 +266,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1645,6 +1648,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
+ *                              limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ *         the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+	/* Calculate the maximum allowed PIF limit by dividing the RTT by
+	 * the minimum allowed inter-transmission time (ITT).
+	 * Tests if PIF < RTT / ITT-lower-bound
+	 */
+	return (u64) tcp_packets_in_flight(tp) *
+		sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b537338..e7b1b30 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -595,6 +596,14 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_thin_dpifl_itt_lower_bound",
+		.data		= &sysctl_tcp_thin_dpifl_itt_lower_bound,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 014f18e..0a2144a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -286,6 +286,8 @@ int sysctl_tcp_min_tso_segs __read_mostly = 2;
 
 int sysctl_tcp_autocorking __read_mostly = 1;
 
+int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (11 preceding siblings ...)
  (?)
@ 2016-02-16 13:51 ` Bendik Rønning Opstad
  2016-02-18 15:18   ` Eric Dumazet
  -1 siblings, 1 reply; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-16 13:51 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
    with each packet, packet loss can be hidden from the TCP engine due
    to lack of dupACKs. This is because the loss is "repaired" by the
    redundant data in the packet coming after the lost packet. Based on
    incoming ACKs, such hidden loss events are detected, and CWR state
    is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
    be sent with a modified SKB containing the redundant data of
    previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
    tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
    for streams that may benefit from RDB, controlled by the sysctl
    variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl net.ipv4.tcp_rdb=1

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 +++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  15 +++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  26 ++++
 net/ipv4/tcp.c                         |  14 ++-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  49 +++++---
 net/ipv4/tcp_rdb.c                     | 215 +++++++++++++++++++++++++++++++++
 12 files changed, 325 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 3b23ed8f..e3869e7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
 	calculated, which is used to classify whether a stream is thin.
 	Default: 10000
 
+tcp_rdb - BOOLEAN
+	Enable RDB for all new TCP connections.
+	Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+	Enable restriction on how many bytes an RDB packet can contain.
+	This is the total amount of payload including the new unsent data.
+	Default: 0
+
+tcp_rdb_max_packets - INTEGER
+	Enable restriction on how many previous packets in the output queue
+	RDB may include data from. A value of 1 will restrict bundling to
+	only the data from the last packet that was sent.
+	Default: 1
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3920675..9075041 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2923,6 +2923,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index bcbf51d..c84de15 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -207,9 +207,10 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 3,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
+		rdb         : 1,/* Redundant Data Bundling enabled      */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a23ce24..18ff9ef 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -267,6 +267,9 @@ extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
+extern int sysctl_tcp_rdb;
+extern int sysctl_tcp_rdb_max_bytes;
+extern int sysctl_tcp_rdb_max_packets;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -539,6 +542,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 bool tcp_may_send_now(struct sock *sk);
 int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
 int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
@@ -556,6 +561,7 @@ void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
 bool tcp_schedule_loss_probe(struct sock *sk);
+void skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb);
 
 /* tcp_input.c */
 void tcp_resume_early_retransmit(struct sock *sk);
@@ -565,6 +571,11 @@ void tcp_reset(struct sock *sk);
 void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 void tcp_fin(struct sock *sk);
 
+/* tcp_rdb.c */
+void rdb_ack_event(struct sock *sk, u32 flags);
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask);
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
@@ -763,6 +774,7 @@ struct tcp_skb_cb {
 	union {
 		struct {
 			/* There is space for up to 20 bytes */
+			__u32 rdb_start_seq; /* Start seq of rdb data bundled */
 		} tx;   /* only used for outgoing skbs */
 		union {
 			struct inet_skb_parm	h4;
@@ -1497,6 +1509,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)		\
+	skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+
 static inline struct sk_buff *tcp_send_head(const struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 65a77b0..ae0fba3 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -115,6 +115,7 @@ enum {
 #define TCP_CC_INFO		26	/* Get Congestion Control (optional) info */
 #define TCP_SAVE_SYN		27	/* Record SYN headers for new connections */
 #define TCP_SAVED_SYN		28	/* Get SYN headers recorded for connection */
+#define TCP_RDB			29	/* Enable RDB mechanism */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a5bd067..1f405a3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1053,7 +1053,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -1061,6 +1061,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(copy_skb_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 62c049b..3c55ba9 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tcp_rdb.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e7b1b30..5aa77e9 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -604,6 +604,32 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
 	},
 	{
+		.procname	= "tcp_rdb",
+		.data		= &sysctl_tcp_rdb,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_max_bytes",
+		.data		= &sysctl_tcp_rdb_max_bytes,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= &proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "tcp_rdb_max_packets",
+		.data		= &sysctl_tcp_rdb_max_packets,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0a2144a..c9d0424 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -288,6 +288,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
 
 int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
+int sysctl_tcp_rdb __read_mostly;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -407,6 +409,7 @@ void tcp_init_sock(struct sock *sk)
 	u64_stats_init(&tp->syncp);
 
 	tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
+	tp->rdb = sysctl_tcp_rdb;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
@@ -2412,6 +2415,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_RDB:
+		if (val < 0 || val > 1)
+			err = -EINVAL;
+		else
+			tp->rdb = val;
+		break;
+
 	case TCP_REPAIR:
 		if (!tcp_can_repair_sock(sk))
 			err = -EPERM;
@@ -2836,7 +2846,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		val = tp->thin_dupack;
 		break;
-
+	case TCP_RDB:
+		val = tp->rdb;
+		break;
 	case TCP_REPAIR:
 		val = tp->repair;
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5ee6fe0..02247cb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3534,6 +3534,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
 
 	if (icsk->icsk_ca_ops->in_ack_event)
 		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+
+	if (unlikely(tcp_sk(sk)->rdb))
+		rdb_ack_event(sk, flags);
 }
 
 /* Congestion control has updated the cwnd already. So if we're in
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7d2c7a4..5222ed7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -897,8 +897,8 @@ out:
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-			    gfp_t gfp_mask)
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_sock *inet;
@@ -2110,9 +2110,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+		if (unlikely(tcp_sk(sk)->rdb)) {
+			if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
+				break;
+		} else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
 			break;
-
+		}
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.
@@ -2439,15 +2442,33 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
+/**
+ * skb_append_data() - copy data from an SKB to the end of another
+ *                     update end sequence number and checksum
+ * @from_skb: the SKB to copy data from
+ * @to_skb: the SKB to copy data to
+ */
+void skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
+{
+	skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len),
+				  from_skb->len);
+	/* Update sequence range on original skb. */
+	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
+
+	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
+		to_skb->ip_summed = CHECKSUM_PARTIAL;
+
+	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
+		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
+					      to_skb->len);
+}
+EXPORT_SYMBOL(skb_append_data);
+
 /* Collapses two adjacent SKB's during retransmission. */
 static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
-	int skb_size, next_skb_size;
-
-	skb_size = skb->len;
-	next_skb_size = next_skb->len;
 
 	BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
 
@@ -2455,17 +2476,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
 	tcp_unlink_write_queue(next_skb, sk);
 
-	skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
-				  next_skb_size);
-
-	if (next_skb->ip_summed == CHECKSUM_PARTIAL)
-		skb->ip_summed = CHECKSUM_PARTIAL;
-
-	if (skb->ip_summed != CHECKSUM_PARTIAL)
-		skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
-
-	/* Update sequence range on original skb. */
-	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
+	skb_append_data(next_skb, skb);
 
 	/* Merge over control information. This moves PSH/FIN etc. over */
 	TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
new file mode 100644
index 0000000..edc789d
--- /dev/null
+++ b/net/ipv4/tcp_rdb.c
@@ -0,0 +1,215 @@
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+int sysctl_tcp_rdb_max_bytes __read_mostly;
+int sysctl_tcp_rdb_max_packets __read_mostly = 1;
+
+/**
+ * rdb_check_rtx_queue_loss() - perform loss detection by analysing ACKs.
+ * @sk: socket.
+ *
+ * Return: The number of packets that are presumed to be lost.
+ */
+static unsigned int rdb_check_rtx_queue_loss(struct sock *sk)
+{
+	struct sk_buff *skb, *tmp;
+	struct tcp_skb_cb *scb;
+	u32 seq_acked = tcp_sk(sk)->snd_una;
+	unsigned int packets_lost = 0;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == tcp_send_head(sk))
+			break;
+
+		scb = TCP_SKB_CB(skb);
+		/* The ACK acknowledges parts of the data in this SKB.
+		 * Can be caused by:
+		 * - TSO: We abort as RDB is not used on SKBs split across
+		 *        multiple packets on lower layers as these are greater
+		 *        than one MSS.
+		 * - Retrans collapse: We've had a retrans, so loss has already
+		 *                     been detected.
+		 */
+		if (after(scb->end_seq, seq_acked)) {
+			break;
+		/* The ACKed packet */
+		} else if (scb->end_seq == seq_acked) {
+			/* This SKB was sent with no RDB data, or no prior
+			 * unacked SKBs in output queue, so break here.
+			 */
+			if (scb->tx.rdb_start_seq == scb->seq ||
+			    skb_queue_is_first(&sk->sk_write_queue, skb))
+				break;
+			/* Find number of prior SKBs who's data was bundled in
+			 * this (ACKed) SKB. We presume any redundant data
+			 * covering previous SKB's are due to loss. (An
+			 * exception would be reordering).
+			 */
+			skb = skb->prev;
+			tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+				if (!before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
+					packets_lost++;
+				else
+					break;
+			}
+			break;
+		}
+	}
+	return packets_lost;
+}
+
+/**
+ * rdb_ack_event() - initiate loss detection
+ * @sk: socket
+ * @flags: flags
+ */
+void rdb_ack_event(struct sock *sk, u32 flags)
+{
+	if (rdb_check_rtx_queue_loss(sk))
+		tcp_enter_cwr(sk);
+}
+
+/**
+ * rdb_build_skb() - build the new RDB SKB and copy all the data into the
+ *                   linear page buffer.
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission in the output engine
+ * @first_skb: the first SKB in the output queue to be bundled
+ * @bytes_in_rdb_skb: the total number of data bytes for the new rdb_skb
+ *                    (NEW + Redundant)
+ * @gfp_mask: gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if memory allocation
+ *         failed
+ */
+static struct sk_buff *rdb_build_skb(const struct sock *sk,
+				     struct sk_buff *xmit_skb,
+				     struct sk_buff *first_skb,
+				     u32 bytes_in_rdb_skb,
+				     gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb, *tmp_skb = first_skb;
+
+	rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
+				      (int)bytes_in_rdb_skb,
+				      gfp_mask, true);
+	if (!rdb_skb)
+		return NULL;
+	copy_skb_header(rdb_skb, xmit_skb);
+	rdb_skb->ip_summed = xmit_skb->ip_summed;
+	TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
+
+	tcp_for_write_queue_from(tmp_skb, sk) {
+		/* Copy data from tmp_skb to rdb_skb */
+		skb_append_data(tmp_skb, rdb_skb);
+
+		/* We are at the last skb that should be included (The unsent
+		 * one)
+		 */
+		if (tmp_skb == xmit_skb)
+			break;
+	}
+	return rdb_skb;
+}
+
+/**
+ * rdb_can_bundle_test() - test if redundant data can be bundled
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @mss_now: current mss value
+ * @bytes_in_rdb_skb: store the total number of payload bytes in the
+ *                    RDB SKB if bundling can be performed.
+ *
+ * Traverse the output queue and check if any un-acked data may be
+ * bundled.
+ *
+ * Return: The first SKB to be in the bundle, or NULL if no bundling
+ */
+static struct sk_buff *rdb_can_bundle_test(const struct sock *sk,
+					   struct sk_buff *xmit_skb,
+					   unsigned int mss_now,
+					   u32 *bytes_in_rdb_skb)
+{
+	struct sk_buff *first_to_bundle = NULL;
+	struct sk_buff *tmp, *skb = xmit_skb->prev;
+	u32 skbs_in_bundle_count = 1; /* Start on 1 to account for xmit_skb */
+	u32 total_payload = xmit_skb->len;
+
+	/* We start at xmit_skb->prev, and go backwards. */
+	tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+		if ((total_payload + skb->len) > mss_now)
+			break;
+
+		if (sysctl_tcp_rdb_max_bytes &&
+		    ((total_payload + skb->len) > sysctl_tcp_rdb_max_bytes))
+			break;
+
+		if (sysctl_tcp_rdb_max_packets &&
+		    (skbs_in_bundle_count > sysctl_tcp_rdb_max_packets))
+			break;
+
+		total_payload += skb->len;
+		skbs_in_bundle_count++;
+		first_to_bundle = skb;
+	}
+	*bytes_in_rdb_skb = total_payload;
+	return first_to_bundle;
+}
+
+/**
+ * tcp_transmit_rdb_skb() - try to create and send an RDB packet
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @mss_now: current mss value
+ * @gfp_mask: gfp_t allocation
+ *
+ * If an RDB packet could not be created and sent, transmit the original
+ * xmit_skb.
+ *
+ * Return: 0 if successfully sent packet, else error
+ */
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb = NULL;
+	struct sk_buff *first_to_bundle;
+	u32 bytes_in_rdb_skb = 0;
+
+	/* How we detect that RDB was used. When equal, no RDB data was sent */
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
+
+	if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
+		goto xmit_default;
+
+	/* No bundling if first in queue, or on FIN packet */
+	if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) ||
+		(TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN))
+		goto xmit_default;
+
+	/* Find number of (previous) SKBs to get data from */
+	first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now,
+					      &bytes_in_rdb_skb);
+	if (!first_to_bundle)
+		goto xmit_default;
+
+	/* Create an SKB that contains redundant data starting from
+	 * first_to_bundle.
+	 */
+	rdb_skb = rdb_build_skb(sk, xmit_skb, first_to_bundle,
+				bytes_in_rdb_skb, gfp_mask);
+	if (!rdb_skb)
+		goto xmit_default;
+
+	/* Set tstamp for SKB in output queue, because tcp_transmit_skb
+	 * will do this for the rdb_skb and not the SKB in the output
+	 * queue (xmit_skb).
+	 */
+	skb_mstamp_get(&xmit_skb->skb_mstamp);
+	rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
+	return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
+
+xmit_default:
+	/* Transmit the unmodified SKB from output queue */
+	return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-02-16 13:51 ` [PATCH v4 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
@ 2016-02-18 15:18   ` Eric Dumazet
  2016-02-19 14:12     ` Bendik Rønning Opstad
  0 siblings, 1 reply; 81+ messages in thread
From: Eric Dumazet @ 2016-02-18 15:18 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On mar., 2016-02-16 at 14:51 +0100, Bendik Rønning Opstad wrote:
> RDB is a mechanism that enables a TCP sender to bundle redundant
> (already sent) data with TCP packets containing new data. By bundling
> (retransmitting) already sent data with each TCP packet containing new
> data, the connection will be more resistant to sporadic packet loss
> which reduces the application layer latency significantly in congested
> scenarios.
>  
> -static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>  {
>  	__copy_skb_header(new, old);
>  
> @@ -1061,6 +1061,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>  	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
>  	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
>  }
> +EXPORT_SYMBOL(copy_skb_header);

Why are you exporting this ? tcp is statically linked into vmlinux.


>  
> +/**
> + * skb_append_data() - copy data from an SKB to the end of another
> + *                     update end sequence number and checksum
> + * @from_skb: the SKB to copy data from
> + * @to_skb: the SKB to copy data to
> + */
> +void skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
> +{
> +	skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len),
> +				  from_skb->len);
> +	/* Update sequence range on original skb. */
> +	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
> +
> +	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
> +		to_skb->ip_summed = CHECKSUM_PARTIAL;
> +
> +	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
> +		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
> +					      to_skb->len);
> +}
> +EXPORT_SYMBOL(skb_append_data);

Same remark here.

And this is really a tcp helper, you should add a tcp_ prefix.

About rdb_build_skb() : I do not see where you make sure
@bytes_in_rdb_skb is not too big ?

tcp_rdb_max_bytes & tcp_rdb_max_packets seem to have no .extra2 upper
limit, so a user could do something really stupid and attempt to crash
the kernel.

Presumably I would use SKB_MAX_HEAD(MAX_TCP_HEADER) so that we do not
try high order page allocation.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-02-18 15:18   ` Eric Dumazet
@ 2016-02-19 14:12     ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-19 14:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On 02/18/2016 04:18 PM, Eric Dumazet wrote:
>>  
>> -static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>>  {
>>  	__copy_skb_header(new, old);
>>  
>> @@ -1061,6 +1061,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>>  	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
>>  	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
>>  }
>> +EXPORT_SYMBOL(copy_skb_header);
> 
> Why are you exporting this ? tcp is statically linked into vmlinux.

Ah, this is actually leftover from the earlier module based
implementation of RDB. Will remove.

>> +EXPORT_SYMBOL(skb_append_data);
> 
> Same remark here.

Will remove.

> And this is really a tcp helper, you should add a tcp_ prefix.

Certainly.

> About rdb_build_skb() : I do not see where you make sure
> @bytes_in_rdb_skb is not too big ?

The number of previous SKBs in the queue to copy data from is given
by rdb_can_bundle_test(), which tests if total payload does not
exceed the MSS. Only if there is room (within the MSS) will it test
the sysctl options to further restrict bundling:

+       /* We start at xmit_skb->prev, and go backwards. */
+       tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+               if ((total_payload + skb->len) > mss_now)
+                       break;
+
+               if (sysctl_tcp_rdb_max_bytes &&
+                   ((total_payload + skb->len) > sysctl_tcp_rdb_max_bytes))
+                       break;

I'll combine these two to (total_payload + skb->len) > max_payload

> tcp_rdb_max_bytes & tcp_rdb_max_packets seem to have no .extra2 upper
> limit, so a user could do something really stupid and attempt to crash
> the kernel.

Those sysctl additions are actually a bit buggy, specifically the
proc_handlers.

Is it not sufficient to ensure that 0 is the lowest possible value?
The max payload limit is really min(mss_now, sysctl_tcp_rdb_max_bytes),
so if sysctl_tcp_rdb_max_bytes or sysctl_tcp_rdb_max_packets are set too
large, bundling will simply be limited by the MSS.

> Presumably I would use SKB_MAX_HEAD(MAX_TCP_HEADER) so that we do not
> try high order page allocation.

Do you suggest something like this?:
bytes_in_rdb_skb = min_t(u32, bytes_in_rdb_skb, SKB_MAX_HEAD(MAX_TCP_HEADER));

Is this necessary when bytes_in_rdb_skb will always contain exactly
the required number of bytes for the payload of the (RDB) packet,
which will never be greater than mss_now?

Or is it aimed at scenarios where the page size is so small that
allocating to an MSS (of e.g. 1460) will require high order page
allocation?


Thanks for looking over the code!

Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v5 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (12 preceding siblings ...)
  (?)
@ 2016-02-24 21:12 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-24 21:12 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
(http://patchwork.ozlabs.org/patch/510674)

These patches have also been tested with as set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v5 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric)
   * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric)
   * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c)
   * Merged the two if tests for max payload of RDB packet in
     rdb_can_bundle_test()
   * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss()
     and restructured to reduce indentation.
   * Improved docs
   * Revised commit message to be more detailed.

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c)

v4 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Moved skb_append_data() to tcp_output.c and call this
     function from tcp_collapse_retrans() as well.
   * Merged functionality of create_rdb_skb() into
     tcp_transmit_rdb_skb()
   * Removed one parameter from rdb_can_bundle_test()

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
     tcp_rdb_max_packets after comment from Eric Dumazet about
     not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
     feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
     to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 ++++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  36 ++++++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   2 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  34 +++++
 net/ipv4/tcp.c                         |  16 ++-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  47 ++++---
 net/ipv4/tcp_rdb.c                     | 225 +++++++++++++++++++++++++++++++++
 12 files changed, 371 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v5 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (13 preceding siblings ...)
  (?)
@ 2016-02-24 21:12 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-24 21:12 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

The existing mechanism for detecting thin streams,
tcp_stream_is_thin(), is based on a static limit of less than 4
packets in flight. This treats streams differently depending on the
connection's RTT, such that a stream on a high RTT link may never be
considered thin, whereas the same application would produce a stream
that would always be thin in a low RTT scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin
stream detection will be independent of the RTT and treat streams
equally based on the transmission pattern, i.e. the inter-transmission
time (ITT).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/net/tcp.h                      | 21 +++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp.c                         |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 24ce97f..3b23ed8f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN
 	Documentation/networking/tcp-thin.txt
 	Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+	Controls the lower bound inter-transmission time (ITT) threshold
+	for when a stream is considered thin. The value is specified in
+	microseconds, and may not be lower than 10000 (10 ms). Based on
+	this threshold, a dynamic packets in flight limit (DPIFL) is
+	calculated, which is used to classify whether a stream is thin.
+	Default: 10000
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 692db63..a29300a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6       /* After 6 linear retries, do exp. backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (10000 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 10000
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND		10
@@ -264,6 +266,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1645,6 +1648,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
+ *                              limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ *         the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+	/* Calculate the maximum allowed PIF limit by dividing the RTT by
+	 * the minimum allowed inter-transmission time (ITT).
+	 * Tests if PIF < RTT / ITT-lower-bound
+	 */
+	return (u64) tcp_packets_in_flight(tp) *
+		sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1e1fe60..f04320a 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -572,6 +573,14 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_thin_dpifl_itt_lower_bound",
+		.data		= &sysctl_tcp_thin_dpifl_itt_lower_bound,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f9faadb..8421f3d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -286,6 +286,8 @@ int sysctl_tcp_min_tso_segs __read_mostly = 2;
 
 int sysctl_tcp_autocorking __read_mostly = 1;
 
+int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v5 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (14 preceding siblings ...)
  (?)
@ 2016-02-24 21:12 ` Bendik Rønning Opstad
  2016-03-02 19:52   ` David Miller
  -1 siblings, 1 reply; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-02-24 21:12 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.

Latency-sensitive applications or services, such as online games,
remote control systems, and VoIP, produce traffic with thin-stream
characteristics, characterized by small packets and relatively high
inter-transmission times (ITT). When experiencing packet loss, such
latency-sensitive applications are heavily penalized by the need to
retransmit lost packets, which increases the latency by a minimum of
one RTT for the lost packet. Packets coming after a lost packet are
held back due to head-of-line blocking, causing increased delays for
all data segments until the lost packet has been retransmitted.

RDB enables a TCP sender to bundle redundant (already sent) data with
TCP packets containing small segments of new data. By resending
un-ACKed data from the output queue in packets with new data, RDB
reduces the need to retransmit data segments on connections
experiencing sporadic packet loss. By avoiding a retransmit, RDB
evades the latency increase of at least one RTT for the lost packet,
as well as alleviating head-of-line blocking for the packets following
the lost packet. This makes the TCP connection more resistant to
latency fluctuations, and reduces the application layer latency
significantly in lossy environments.

Main functionality added:

  o When a packet is scheduled for transmission, RDB builds and
    transmits a new SKB containing both the unsent data as well as
    data of previously sent packets from the TCP output queue.

  o RDB will only be used for streams classified as thin by the
    function tcp_stream_is_thin_dpifl(). This enforces a lower bound
    on the ITT for streams that may benefit from RDB, controlled by
    the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound.

  o Loss detection of hidden loss events: When bundling redundant data
    with each packet, packet loss can be hidden from the TCP engine due
    to lack of dupACKs. This is because the loss is "repaired" by the
    redundant data in the packet coming after the lost packet. Based on
    incoming ACKs, such hidden loss events are detected, and CWR state
    is entered.

RDB can be enabled on a connection with the socket option TCP_RDB, or
on all new connections by setting the sysctl variable
net.ipv4.tcp_rdb=1

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 +++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  15 +++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   2 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  25 ++++
 net/ipv4/tcp.c                         |  14 +-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  47 ++++---
 net/ipv4/tcp_rdb.c                     | 225 +++++++++++++++++++++++++++++++++
 12 files changed, 331 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 3b23ed8f..e3869e7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
 	calculated, which is used to classify whether a stream is thin.
 	Default: 10000
 
+tcp_rdb - BOOLEAN
+	Enable RDB for all new TCP connections.
+	Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+	Enable restriction on how many bytes an RDB packet can contain.
+	This is the total amount of payload including the new unsent data.
+	Default: 0
+
+tcp_rdb_max_packets - INTEGER
+	Enable restriction on how many previous packets in the output queue
+	RDB may include data from. A value of 1 will restrict bundling to
+	only the data from the last packet that was sent.
+	Default: 1
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index eab4f8f..af15a3c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2931,6 +2931,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index bcbf51d..c84de15 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -207,9 +207,10 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 3,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
+		rdb         : 1,/* Redundant Data Bundling enabled      */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a29300a..dadbcee 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -267,6 +267,9 @@ extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
+extern int sysctl_tcp_rdb;
+extern int sysctl_tcp_rdb_max_bytes;
+extern int sysctl_tcp_rdb_max_packets;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -539,6 +542,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 bool tcp_may_send_now(struct sock *sk);
 int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
 int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
@@ -556,6 +561,7 @@ void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
 bool tcp_schedule_loss_probe(struct sock *sk);
+void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb);
 
 /* tcp_input.c */
 void tcp_resume_early_retransmit(struct sock *sk);
@@ -565,6 +571,11 @@ void tcp_reset(struct sock *sk);
 void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 void tcp_fin(struct sock *sk);
 
+/* tcp_rdb.c */
+void rdb_ack_event(struct sock *sk, u32 flags);
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask);
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
@@ -763,6 +774,7 @@ struct tcp_skb_cb {
 	union {
 		struct {
 			/* There is space for up to 20 bytes */
+			__u32 rdb_start_seq; /* Start seq of rdb data bundled */
 		} tx;   /* only used for outgoing skbs */
 		union {
 			struct inet_skb_parm	h4;
@@ -1497,6 +1509,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)		\
+	skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+
 static inline struct sk_buff *tcp_send_head(const struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index fe95446..b8f36d3 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -115,6 +115,7 @@ enum {
 #define TCP_CC_INFO		26	/* Get Congestion Control (optional) info */
 #define TCP_SAVE_SYN		27	/* Record SYN headers for new connections */
 #define TCP_SAVED_SYN		28	/* Get SYN headers recorded for connection */
+#define TCP_RDB			29	/* Enable RDB mechanism */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 488566b..fa66a67 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1055,7 +1055,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index bfa1336..459048c 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tcp_rdb.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index f04320a..43b4390 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -581,6 +581,31 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
 	},
 	{
+		.procname	= "tcp_rdb",
+		.data		= &sysctl_tcp_rdb,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_max_bytes",
+		.data		= &sysctl_tcp_rdb_max_bytes,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "tcp_rdb_max_packets",
+		.data		= &sysctl_tcp_rdb_max_packets,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8421f3d..b53d4cb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -288,6 +288,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
 
 int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
+int sysctl_tcp_rdb __read_mostly;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -407,6 +409,7 @@ void tcp_init_sock(struct sock *sk)
 	u64_stats_init(&tp->syncp);
 
 	tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
+	tp->rdb = sysctl_tcp_rdb;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
@@ -2412,6 +2415,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_RDB:
+		if (val < 0 || val > 1)
+			err = -EINVAL;
+		else
+			tp->rdb = val;
+		break;
+
 	case TCP_REPAIR:
 		if (!tcp_can_repair_sock(sk))
 			err = -EPERM;
@@ -2842,7 +2852,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		val = tp->thin_dupack;
 		break;
-
+	case TCP_RDB:
+		val = tp->rdb;
+		break;
 	case TCP_REPAIR:
 		val = tp->repair;
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f7..d38941c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3537,6 +3537,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
 
 	if (icsk->icsk_ca_ops->in_ack_event)
 		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+
+	if (unlikely(tcp_sk(sk)->rdb))
+		rdb_ack_event(sk, flags);
 }
 
 /* Congestion control has updated the cwnd already. So if we're in
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7d2c7a4..4f0335e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -897,8 +897,8 @@ out:
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-			    gfp_t gfp_mask)
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_sock *inet;
@@ -2110,9 +2110,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+		if (unlikely(tcp_sk(sk)->rdb)) {
+			if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
+				break;
+		} else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
 			break;
-
+		}
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.
@@ -2439,15 +2442,31 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
+/**
+ * tcp_skb_append_data() - copy linear data from an SKB to the end of another
+ *                         and update end sequence number and checksum
+ * @from_skb: the SKB to copy data from
+ * @to_skb: the SKB to copy data to
+ */
+void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
+{
+	skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len),
+				  from_skb->len);
+	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
+
+	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
+		to_skb->ip_summed = CHECKSUM_PARTIAL;
+
+	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
+		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
+					      to_skb->len);
+}
+
 /* Collapses two adjacent SKB's during retransmission. */
 static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
-	int skb_size, next_skb_size;
-
-	skb_size = skb->len;
-	next_skb_size = next_skb->len;
 
 	BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
 
@@ -2455,17 +2474,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
 	tcp_unlink_write_queue(next_skb, sk);
 
-	skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
-				  next_skb_size);
-
-	if (next_skb->ip_summed == CHECKSUM_PARTIAL)
-		skb->ip_summed = CHECKSUM_PARTIAL;
-
-	if (skb->ip_summed != CHECKSUM_PARTIAL)
-		skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
-
-	/* Update sequence range on original skb. */
-	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
+	tcp_skb_append_data(next_skb, skb);
 
 	/* Merge over control information. This moves PSH/FIN etc. over */
 	TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
new file mode 100644
index 0000000..4b240d1
--- /dev/null
+++ b/net/ipv4/tcp_rdb.c
@@ -0,0 +1,225 @@
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+int sysctl_tcp_rdb_max_bytes __read_mostly;
+int sysctl_tcp_rdb_max_packets __read_mostly = 1;
+
+/**
+ * rdb_detect_loss() - perform loss detection by analysing ACKs
+ * @sk: socket
+ *
+ * Traverse the output queue and check if the ACKed packet is an RDB packet and
+ * if the redundant data covers one or more un-ACKed SKBs. If the incoming ACK
+ * acknowledges multiple SKBs, we can presume packet loss has occurred.
+ *
+ * We can infer packet loss this way because we can expect one ACK per
+ * transmitted data packet, as delayed ACKs are disabled when a host receives
+ * packets where the sequence number is not the expected sequence number.
+ *
+ * Return: The number of packets that are presumed to be lost
+ */
+static unsigned int rdb_detect_loss(struct sock *sk)
+{
+	struct sk_buff *skb, *tmp;
+	struct tcp_skb_cb *scb;
+	u32 seq_acked = tcp_sk(sk)->snd_una;
+	unsigned int packets_lost = 0;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == tcp_send_head(sk))
+			break;
+
+		scb = TCP_SKB_CB(skb);
+		/* The ACK acknowledges parts of the data in this SKB.
+		 * Can be caused by:
+		 * - TSO: We abort as RDB is not used on SKBs split across
+		 *        multiple packets on lower layers as these are greater
+		 *        than one MSS.
+		 * - Retrans collapse: We've had a retrans, so loss has already
+		 *                     been detected.
+		 */
+		if (after(scb->end_seq, seq_acked))
+			break;
+		else if (scb->end_seq != seq_acked)
+			continue;
+
+		/* We have found the ACKed packet */
+
+		/* This packet was sent with no redundant data, or no prior
+		 * un-ACKed SKBs is in the output queue, so break here.
+		 */
+		if (scb->tx.rdb_start_seq == scb->seq ||
+		    skb_queue_is_first(&sk->sk_write_queue, skb))
+			break;
+		/* Find number of prior SKBs whose data was bundled in this
+		 * (ACKed) SKB. We presume any redundant data covering previous
+		 * SKB's are due to loss. (An exception would be reordering).
+		 */
+		skb = skb->prev;
+		tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+			if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
+				break;
+			packets_lost++;
+		}
+		break;
+	}
+	return packets_lost;
+}
+
+/**
+ * rdb_ack_event() - initiate loss detection
+ * @sk: socket
+ * @flags: flags
+ */
+void rdb_ack_event(struct sock *sk, u32 flags)
+{
+	if (rdb_detect_loss(sk))
+		tcp_enter_cwr(sk);
+}
+
+/**
+ * rdb_build_skb() - build a new RDB SKB and copy redundant + unsent data to
+ *                   the linear page buffer
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission in the output engine
+ * @first_skb: the first SKB in the output queue to be bundled
+ * @bytes_in_rdb_skb: the total number of data bytes for the new rdb_skb
+ *                    (NEW + Redundant)
+ * @gfp_mask: gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if memory allocation
+ *         failed
+ */
+static struct sk_buff *rdb_build_skb(const struct sock *sk,
+				     struct sk_buff *xmit_skb,
+				     struct sk_buff *first_skb,
+				     u32 bytes_in_rdb_skb,
+				     gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb, *tmp_skb = first_skb;
+
+	rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
+				      (int)bytes_in_rdb_skb,
+				      gfp_mask, false);
+	if (!rdb_skb)
+		return NULL;
+	copy_skb_header(rdb_skb, xmit_skb);
+	rdb_skb->ip_summed = xmit_skb->ip_summed;
+	TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
+
+	/* Start on first_skb and append payload from each SKB in the output
+	 * queue onto rdb_skb until we reach xmit_skb.
+	 */
+	tcp_for_write_queue_from(tmp_skb, sk) {
+		tcp_skb_append_data(tmp_skb, rdb_skb);
+
+		/* We reached xmit_skb, containing the unsent data */
+		if (tmp_skb == xmit_skb)
+			break;
+	}
+	return rdb_skb;
+}
+
+/**
+ * rdb_can_bundle_test() - test if redundant data can be bundled
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @max_payload: the maximum allowed payload bytes for the RDB SKB
+ * @bytes_in_rdb_skb: store the total number of payload bytes in the
+ *                    RDB SKB if bundling can be performed
+ *
+ * Traverse the output queue and check if any un-acked data may be
+ * bundled.
+ *
+ * Return: The first SKB to be in the bundle, or NULL if no bundling
+ */
+static struct sk_buff *rdb_can_bundle_test(const struct sock *sk,
+					   struct sk_buff *xmit_skb,
+					   unsigned int max_payload,
+					   u32 *bytes_in_rdb_skb)
+{
+	struct sk_buff *first_to_bundle = NULL;
+	struct sk_buff *tmp, *skb = xmit_skb->prev;
+	u32 skbs_in_bundle_count = 1; /* Start on 1 to account for xmit_skb */
+	u32 total_payload = xmit_skb->len;
+
+	if (sysctl_tcp_rdb_max_bytes)
+		max_payload = min_t(unsigned int, max_payload,
+				    sysctl_tcp_rdb_max_bytes);
+
+	/* We start at xmit_skb->prev, and go backwards */
+	tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+		/* Including data from this SKB would exceed payload limit */
+		if ((total_payload + skb->len) > max_payload)
+			break;
+
+		if (sysctl_tcp_rdb_max_packets &&
+		    (skbs_in_bundle_count > sysctl_tcp_rdb_max_packets))
+			break;
+
+		total_payload += skb->len;
+		skbs_in_bundle_count++;
+		first_to_bundle = skb;
+	}
+	*bytes_in_rdb_skb = total_payload;
+	return first_to_bundle;
+}
+
+/**
+ * tcp_transmit_rdb_skb() - try to create and send an RDB packet
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @mss_now: current mss value
+ * @gfp_mask: gfp_t allocation
+ *
+ * If an RDB packet could not be created and sent, transmit the
+ * original unmodified SKB (xmit_skb).
+ *
+ * Return: 0 if successfully sent packet, else error
+ */
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb = NULL;
+	struct sk_buff *first_to_bundle;
+	u32 bytes_in_rdb_skb = 0;
+
+	/* How we detect that RDB was used. When equal, no RDB data was sent */
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
+
+	if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
+		goto xmit_default;
+
+	/* No bundling if first in queue, or on FIN packet */
+	if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) ||
+	    (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN))
+		goto xmit_default;
+
+	/* Find number of (previous) SKBs to get data from */
+	first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now,
+					      &bytes_in_rdb_skb);
+	if (!first_to_bundle)
+		goto xmit_default;
+
+	/* Create an SKB that contains redundant data starting from
+	 * first_to_bundle.
+	 */
+	rdb_skb = rdb_build_skb(sk, xmit_skb, first_to_bundle,
+				bytes_in_rdb_skb, gfp_mask);
+	if (!rdb_skb)
+		goto xmit_default;
+
+	/* Set skb_mstamp for the SKB in the output queue (xmit_skb) containing
+	 * the yet unsent data. Normally this would be done by
+	 * tcp_transmit_skb(), but as we pass in rdb_skb instead, xmit_skb's
+	 * timestamp will not be touched.
+	 */
+	skb_mstamp_get(&xmit_skb->skb_mstamp);
+	rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
+	return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
+
+xmit_default:
+	/* Transmit the unmodified SKB from output queue */
+	return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v5 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-02-24 21:12 ` [PATCH v5 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
@ 2016-03-02 19:52   ` David Miller
  2016-03-02 22:33     ` Bendik Rønning Opstad
  0 siblings, 1 reply; 81+ messages in thread
From: David Miller @ 2016-03-02 19:52 UTC (permalink / raw)
  To: bro.devel
  Cc: netdev, ycheng, eric.dumazet, ncardwell, apetlund, griff, paalh,
	jonassm, kristian.evensen, kennetkl

From: "Bendik Rønning Opstad" <bro.devel@gmail.com>
Date: Wed, 24 Feb 2016 22:12:55 +0100

> +/* tcp_rdb.c */
> +void rdb_ack_event(struct sock *sk, u32 flags);

Please name globally visible symbols with a prefix of "tcp_*", thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v5 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-02 19:52   ` David Miller
@ 2016-03-02 22:33     ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-03-02 22:33 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ycheng, eric.dumazet, ncardwell, apetlund, griff, paalh,
	jonassm, kristian.evensen, kennetkl

On 03/02/2016 08:52 PM, David Miller wrote:
> From: "Bendik Rønning Opstad" <bro.devel@gmail.com>
> Date: Wed, 24 Feb 2016 22:12:55 +0100
> 
>> +/* tcp_rdb.c */
>> +void rdb_ack_event(struct sock *sk, u32 flags);
> 
> Please name globally visible symbols with a prefix of "tcp_*", thanks.

Yes, of course. I will fix that for the next version.

Thanks.

Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (15 preceding siblings ...)
  (?)
@ 2016-03-03 18:06 ` Bendik Rønning Opstad
  2016-03-07 19:36   ` David Miller
  2016-03-10  0:20   ` Yuchung Cheng
  -1 siblings, 2 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-03-03 18:06 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
(http://patchwork.ozlabs.org/patch/510674)

These patches have also been tested with as set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v6 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Renamed rdb_ack_event() to tcp_rdb_ack_event() (Thanks DaveM)
   * Minor doc changes

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Minor doc changes

v5 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric)
   * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric)
   * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c)
   * Merged the two if tests for max payload of RDB packet in
     rdb_can_bundle_test()
   * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss()
     and restructured to reduce indentation.
   * Improved docs
   * Revised commit message to be more detailed.

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c)

v4 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Moved skb_append_data() to tcp_output.c and call this
     function from tcp_collapse_retrans() as well.
   * Merged functionality of create_rdb_skb() into
     tcp_transmit_rdb_skb()
   * Removed one parameter from rdb_can_bundle_test()

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
     tcp_rdb_max_packets after comment from Eric Dumazet about
     not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
     feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
     to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 ++++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  36 ++++++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   2 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  34 +++++
 net/ipv4/tcp.c                         |  16 ++-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  48 ++++---
 net/ipv4/tcp_rdb.c                     | 228 +++++++++++++++++++++++++++++++++
 12 files changed, 375 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v6 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (16 preceding siblings ...)
  (?)
@ 2016-03-03 18:06 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-03-03 18:06 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

The existing mechanism for detecting thin streams,
tcp_stream_is_thin(), is based on a static limit of less than 4
packets in flight. This treats streams differently depending on the
connection's RTT, such that a stream on a high RTT link may never be
considered thin, whereas the same application would produce a stream
that would always be thin in a low RTT scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin
stream detection will be independent of the RTT and treat streams
equally based on the transmission pattern, i.e. the inter-transmission
time (ITT).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/net/tcp.h                      | 21 +++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp.c                         |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d5df40c..6a92b15 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN
 	Documentation/networking/tcp-thin.txt
 	Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+	Controls the lower bound inter-transmission time (ITT) threshold
+	for when a stream is considered thin. The value is specified in
+	microseconds, and may not be lower than 10000 (10 ms). Based on
+	this threshold, a dynamic packets in flight limit (DPIFL) is
+	calculated, which is used to classify whether a stream is thin.
+	Default: 10000
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 692db63..d38eae9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6       /* After 6 linear retries, do exp. backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (10000 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 10000
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND		10
@@ -264,6 +266,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1645,6 +1648,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Test if the stream is thin based on
+ *                              dynamic PIF limit (DPIFL)
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ *         the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+	/* Calculate the maximum allowed PIF limit by dividing the RTT by
+	 * the minimum allowed inter-transmission time (ITT).
+	 * Tests if PIF < RTT / ITT-lower-bound
+	 */
+	return (u64) tcp_packets_in_flight(tp) *
+		sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1e1fe60..f04320a 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -572,6 +573,14 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_thin_dpifl_itt_lower_bound",
+		.data		= &sysctl_tcp_thin_dpifl_itt_lower_bound,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f9faadb..8421f3d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -286,6 +286,8 @@ int sysctl_tcp_min_tso_segs __read_mostly = 2;
 
 int sysctl_tcp_autocorking __read_mostly = 1;
 
+int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (17 preceding siblings ...)
  (?)
@ 2016-03-03 18:06 ` Bendik Rønning Opstad
  2016-03-14 21:15   ` Eric Dumazet
  2016-03-14 21:54   ` Yuchung Cheng
  -1 siblings, 2 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-03-03 18:06 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.

Latency-sensitive applications or services, such as online games,
remote control systems, and VoIP, produce traffic with thin-stream
characteristics, characterized by small packets and relatively high
inter-transmission times (ITT). When experiencing packet loss, such
latency-sensitive applications are heavily penalized by the need to
retransmit lost packets, which increases the latency by a minimum of
one RTT for the lost packet. Packets coming after a lost packet are
held back due to head-of-line blocking, causing increased delays for
all data segments until the lost packet has been retransmitted.

RDB enables a TCP sender to bundle redundant (already sent) data with
TCP packets containing small segments of new data. By resending
un-ACKed data from the output queue in packets with new data, RDB
reduces the need to retransmit data segments on connections
experiencing sporadic packet loss. By avoiding a retransmit, RDB
evades the latency increase of at least one RTT for the lost packet,
as well as alleviating head-of-line blocking for the packets following
the lost packet. This makes the TCP connection more resistant to
latency fluctuations, and reduces the application layer latency
significantly in lossy environments.

Main functionality added:

  o When a packet is scheduled for transmission, RDB builds and
    transmits a new SKB containing both the unsent data as well as
    data of previously sent packets from the TCP output queue.

  o RDB will only be used for streams classified as thin by the
    function tcp_stream_is_thin_dpifl(). This enforces a lower bound
    on the ITT for streams that may benefit from RDB, controlled by
    the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound.

  o Loss detection of hidden loss events: When bundling redundant data
    with each packet, packet loss can be hidden from the TCP engine due
    to lack of dupACKs. This is because the loss is "repaired" by the
    redundant data in the packet coming after the lost packet. Based on
    incoming ACKs, such hidden loss events are detected, and CWR state
    is entered.

RDB can be enabled on a connection with the socket option TCP_RDB, or
on all new connections by setting the sysctl variable
net.ipv4.tcp_rdb=1

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 +++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  15 +++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   2 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  25 ++++
 net/ipv4/tcp.c                         |  14 +-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  48 ++++---
 net/ipv4/tcp_rdb.c                     | 228 +++++++++++++++++++++++++++++++++
 12 files changed, 335 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 6a92b15..8f3f3bf 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
 	calculated, which is used to classify whether a stream is thin.
 	Default: 10000
 
+tcp_rdb - BOOLEAN
+	Enable RDB for all new TCP connections.
+	Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+	Enable restriction on how many bytes an RDB packet can contain.
+	This is the total amount of payload including the new unsent data.
+	Default: 0
+
+tcp_rdb_max_packets - INTEGER
+	Enable restriction on how many previous packets in the output queue
+	RDB may include data from. A value of 1 will restrict bundling to
+	only the data from the last packet that was sent.
+	Default: 1
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 797cefb..0f2c9d1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2927,6 +2927,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index bcbf51d..c84de15 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -207,9 +207,10 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 3,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
+		rdb         : 1,/* Redundant Data Bundling enabled      */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d38eae9..2d42f4a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -267,6 +267,9 @@ extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
+extern int sysctl_tcp_rdb;
+extern int sysctl_tcp_rdb_max_bytes;
+extern int sysctl_tcp_rdb_max_packets;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -539,6 +542,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 bool tcp_may_send_now(struct sock *sk);
 int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
 int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
@@ -556,6 +561,7 @@ void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
 bool tcp_schedule_loss_probe(struct sock *sk);
+void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb);
 
 /* tcp_input.c */
 void tcp_resume_early_retransmit(struct sock *sk);
@@ -565,6 +571,11 @@ void tcp_reset(struct sock *sk);
 void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 void tcp_fin(struct sock *sk);
 
+/* tcp_rdb.c */
+void tcp_rdb_ack_event(struct sock *sk, u32 flags);
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask);
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
@@ -763,6 +774,7 @@ struct tcp_skb_cb {
 	union {
 		struct {
 			/* There is space for up to 20 bytes */
+			__u32 rdb_start_seq; /* Start seq of rdb data */
 		} tx;   /* only used for outgoing skbs */
 		union {
 			struct inet_skb_parm	h4;
@@ -1497,6 +1509,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)		\
+	skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+
 static inline struct sk_buff *tcp_send_head(const struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index fe95446..6799875 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -115,6 +115,7 @@ enum {
 #define TCP_CC_INFO		26	/* Get Congestion Control (optional) info */
 #define TCP_SAVE_SYN		27	/* Record SYN headers for new connections */
 #define TCP_SAVED_SYN		28	/* Get SYN headers recorded for connection */
+#define TCP_RDB			29	/* Enable Redundant Data Bundling mechanism */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7af7ec6..50bc5b0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1055,7 +1055,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index bfa1336..459048c 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tcp_rdb.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index f04320a..43b4390 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -581,6 +581,31 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
 	},
 	{
+		.procname	= "tcp_rdb",
+		.data		= &sysctl_tcp_rdb,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_max_bytes",
+		.data		= &sysctl_tcp_rdb_max_bytes,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "tcp_rdb_max_packets",
+		.data		= &sysctl_tcp_rdb_max_packets,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8421f3d..b53d4cb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -288,6 +288,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
 
 int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
+int sysctl_tcp_rdb __read_mostly;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -407,6 +409,7 @@ void tcp_init_sock(struct sock *sk)
 	u64_stats_init(&tp->syncp);
 
 	tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
+	tp->rdb = sysctl_tcp_rdb;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
@@ -2412,6 +2415,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_RDB:
+		if (val < 0 || val > 1)
+			err = -EINVAL;
+		else
+			tp->rdb = val;
+		break;
+
 	case TCP_REPAIR:
 		if (!tcp_can_repair_sock(sk))
 			err = -EPERM;
@@ -2842,7 +2852,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		val = tp->thin_dupack;
 		break;
-
+	case TCP_RDB:
+		val = tp->rdb;
+		break;
 	case TCP_REPAIR:
 		val = tp->repair;
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f7..7b52ce4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3537,6 +3537,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
 
 	if (icsk->icsk_ca_ops->in_ack_event)
 		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+
+	if (unlikely(tcp_sk(sk)->rdb))
+		tcp_rdb_ack_event(sk, flags);
 }
 
 /* Congestion control has updated the cwnd already. So if we're in
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7d2c7a4..6f92fae 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -897,8 +897,8 @@ out:
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-			    gfp_t gfp_mask)
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_sock *inet;
@@ -2110,9 +2110,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+		if (unlikely(tcp_sk(sk)->rdb)) {
+			if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
+				break;
+		} else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
 			break;
-
+		}
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.
@@ -2439,15 +2442,32 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
+/**
+ * tcp_skb_append_data() - copy the linear data from an SKB to the end
+ *                         of another and update end sequence number
+ *                         and checksum
+ * @from_skb: the SKB to copy data from
+ * @to_skb: the SKB to copy data to
+ */
+void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
+{
+	skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len),
+				  from_skb->len);
+	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
+
+	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
+		to_skb->ip_summed = CHECKSUM_PARTIAL;
+
+	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
+		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
+					      to_skb->len);
+}
+
 /* Collapses two adjacent SKB's during retransmission. */
 static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
-	int skb_size, next_skb_size;
-
-	skb_size = skb->len;
-	next_skb_size = next_skb->len;
 
 	BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
 
@@ -2455,17 +2475,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
 	tcp_unlink_write_queue(next_skb, sk);
 
-	skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
-				  next_skb_size);
-
-	if (next_skb->ip_summed == CHECKSUM_PARTIAL)
-		skb->ip_summed = CHECKSUM_PARTIAL;
-
-	if (skb->ip_summed != CHECKSUM_PARTIAL)
-		skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
-
-	/* Update sequence range on original skb. */
-	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
+	tcp_skb_append_data(next_skb, skb);
 
 	/* Merge over control information. This moves PSH/FIN etc. over */
 	TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
new file mode 100644
index 0000000..2b37957
--- /dev/null
+++ b/net/ipv4/tcp_rdb.c
@@ -0,0 +1,228 @@
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+int sysctl_tcp_rdb_max_bytes __read_mostly;
+int sysctl_tcp_rdb_max_packets __read_mostly = 1;
+
+/**
+ * rdb_detect_loss() - perform RDB loss detection by analysing ACKs
+ * @sk: socket
+ *
+ * Traverse the output queue and check if the ACKed packet is an RDB
+ * packet and if the redundant data covers one or more un-ACKed SKBs.
+ * If the incoming ACK acknowledges multiple SKBs, we can presume
+ * packet loss has occurred.
+ *
+ * We can infer packet loss this way because we can expect one ACK per
+ * transmitted data packet, as delayed ACKs are disabled when a host
+ * receives packets where the sequence number is not the expected
+ * sequence number.
+ *
+ * Return: The number of packets that are presumed to be lost
+ */
+static unsigned int rdb_detect_loss(struct sock *sk)
+{
+	struct sk_buff *skb, *tmp;
+	struct tcp_skb_cb *scb;
+	u32 seq_acked = tcp_sk(sk)->snd_una;
+	unsigned int packets_lost = 0;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == tcp_send_head(sk))
+			break;
+
+		scb = TCP_SKB_CB(skb);
+		/* The ACK acknowledges parts of the data in this SKB.
+		 * Can be caused by:
+		 * - TSO: We abort as RDB is not used on SKBs split across
+		 *        multiple packets on lower layers as these are greater
+		 *        than one MSS.
+		 * - Retrans collapse: We've had a retrans, so loss has already
+		 *                     been detected.
+		 */
+		if (after(scb->end_seq, seq_acked))
+			break;
+		else if (scb->end_seq != seq_acked)
+			continue;
+
+		/* We have found the ACKed packet */
+
+		/* This packet was sent with no redundant data, or no prior
+		 * un-ACKed SKBs is in the output queue, so break here.
+		 */
+		if (scb->tx.rdb_start_seq == scb->seq ||
+		    skb_queue_is_first(&sk->sk_write_queue, skb))
+			break;
+		/* Find number of prior SKBs whose data was bundled in this
+		 * (ACKed) SKB. We presume any redundant data covering previous
+		 * SKB's are due to loss. (An exception would be reordering).
+		 */
+		skb = skb->prev;
+		tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+			if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
+				break;
+			packets_lost++;
+		}
+		break;
+	}
+	return packets_lost;
+}
+
+/**
+ * tcp_rdb_ack_event() - initiate RDB loss detection
+ * @sk: socket
+ * @flags: flags
+ */
+void tcp_rdb_ack_event(struct sock *sk, u32 flags)
+{
+	if (rdb_detect_loss(sk))
+		tcp_enter_cwr(sk);
+}
+
+/**
+ * rdb_build_skb() - build a new RDB SKB and copy redundant + unsent
+ *                   data to the linear page buffer
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission in the output engine
+ * @first_skb: the first SKB in the output queue to be bundled
+ * @bytes_in_rdb_skb: the total number of data bytes for the new
+ *                    rdb_skb (NEW + Redundant)
+ * @gfp_mask: gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if memory
+ *         allocation failed
+ */
+static struct sk_buff *rdb_build_skb(const struct sock *sk,
+				     struct sk_buff *xmit_skb,
+				     struct sk_buff *first_skb,
+				     u32 bytes_in_rdb_skb,
+				     gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb, *tmp_skb = first_skb;
+
+	rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
+				      (int)bytes_in_rdb_skb,
+				      gfp_mask, false);
+	if (!rdb_skb)
+		return NULL;
+	copy_skb_header(rdb_skb, xmit_skb);
+	rdb_skb->ip_summed = xmit_skb->ip_summed;
+	TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
+
+	/* Start on first_skb and append payload from each SKB in the output
+	 * queue onto rdb_skb until we reach xmit_skb.
+	 */
+	tcp_for_write_queue_from(tmp_skb, sk) {
+		tcp_skb_append_data(tmp_skb, rdb_skb);
+
+		/* We reached xmit_skb, containing the unsent data */
+		if (tmp_skb == xmit_skb)
+			break;
+	}
+	return rdb_skb;
+}
+
+/**
+ * rdb_can_bundle_test() - test if redundant data can be bundled
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @max_payload: the maximum allowed payload bytes for the RDB SKB
+ * @bytes_in_rdb_skb: store the total number of payload bytes in the
+ *                    RDB SKB if bundling can be performed
+ *
+ * Traverse the output queue and check if any un-acked data may be
+ * bundled.
+ *
+ * Return: The first SKB to be in the bundle, or NULL if no bundling
+ */
+static struct sk_buff *rdb_can_bundle_test(const struct sock *sk,
+					   struct sk_buff *xmit_skb,
+					   unsigned int max_payload,
+					   u32 *bytes_in_rdb_skb)
+{
+	struct sk_buff *first_to_bundle = NULL;
+	struct sk_buff *tmp, *skb = xmit_skb->prev;
+	u32 skbs_in_bundle_count = 1; /* Start on 1 to account for xmit_skb */
+	u32 total_payload = xmit_skb->len;
+
+	if (sysctl_tcp_rdb_max_bytes)
+		max_payload = min_t(unsigned int, max_payload,
+				    sysctl_tcp_rdb_max_bytes);
+
+	/* We start at xmit_skb->prev, and go backwards */
+	tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+		/* Including data from this SKB would exceed payload limit */
+		if ((total_payload + skb->len) > max_payload)
+			break;
+
+		if (sysctl_tcp_rdb_max_packets &&
+		    (skbs_in_bundle_count > sysctl_tcp_rdb_max_packets))
+			break;
+
+		total_payload += skb->len;
+		skbs_in_bundle_count++;
+		first_to_bundle = skb;
+	}
+	*bytes_in_rdb_skb = total_payload;
+	return first_to_bundle;
+}
+
+/**
+ * tcp_transmit_rdb_skb() - try to create and send an RDB packet
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @mss_now: current mss value
+ * @gfp_mask: gfp_t allocation
+ *
+ * If an RDB packet could not be created and sent, transmit the
+ * original unmodified SKB (xmit_skb).
+ *
+ * Return: 0 if successfully sent packet, else error from
+ *         tcp_transmit_skb
+ */
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb = NULL;
+	struct sk_buff *first_to_bundle;
+	u32 bytes_in_rdb_skb = 0;
+
+	/* How we detect that RDB was used. When equal, no RDB data was sent */
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
+
+	if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
+		goto xmit_default;
+
+	/* No bundling if first in queue, or on FIN packet */
+	if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) ||
+	    (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN))
+		goto xmit_default;
+
+	/* Find number of (previous) SKBs to get data from */
+	first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now,
+					      &bytes_in_rdb_skb);
+	if (!first_to_bundle)
+		goto xmit_default;
+
+	/* Create an SKB that contains redundant data starting from
+	 * first_to_bundle.
+	 */
+	rdb_skb = rdb_build_skb(sk, xmit_skb, first_to_bundle,
+				bytes_in_rdb_skb, gfp_mask);
+	if (!rdb_skb)
+		goto xmit_default;
+
+	/* Set skb_mstamp for the SKB in the output queue (xmit_skb) containing
+	 * the yet unsent data. Normally this would be done by
+	 * tcp_transmit_skb(), but as we pass in rdb_skb instead, xmit_skb's
+	 * timestamp will not be touched.
+	 */
+	skb_mstamp_get(&xmit_skb->skb_mstamp);
+	rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
+	return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
+
+xmit_default:
+	/* Transmit the unmodified SKB from output queue */
+	return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2016-03-03 18:06 ` [PATCH v6 net-next 0/2] tcp: " Bendik Rønning Opstad
@ 2016-03-07 19:36   ` David Miller
  2016-03-10  0:20   ` Yuchung Cheng
  1 sibling, 0 replies; 81+ messages in thread
From: David Miller @ 2016-03-07 19:36 UTC (permalink / raw)
  To: bro.devel
  Cc: netdev, ycheng, eric.dumazet, ncardwell, apetlund, griff, paalh,
	jonassm, kristian.evensen, kennetkl

From: "Bendik Rønning Opstad" <bro.devel@gmail.com>
Date: Thu,  3 Mar 2016 19:06:26 +0100

> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at
> reducing the latency for applications sending time-dependent data.
 ...

Can some TCP experts please review these patches?

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2016-03-03 18:06 ` [PATCH v6 net-next 0/2] tcp: " Bendik Rønning Opstad
  2016-03-07 19:36   ` David Miller
@ 2016-03-10  0:20   ` Yuchung Cheng
  2016-03-10  1:45     ` Jonas Markussen
  2016-03-13 23:18     ` Bendik Rønning Opstad
  1 sibling, 2 replies; 81+ messages in thread
From: Yuchung Cheng @ 2016-03-10  0:20 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Eric Dumazet, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
> Latency-sensitive applications or services, such as online games and
> remote desktop, produce traffic with thin-stream characteristics,
> characterized by small packets and a relatively high ITT. By bundling
> already sent data in packets with new data, RDB alleviates head-of-line
> blocking by reducing the need to retransmit data segments when packets
> are lost. RDB is a continuation on the work on latency improvements for
> TCP in Linux, previously resulting in two thin-stream mechanisms in the
> Linux kernel
> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>
> The RDB implementation has been thoroughly tested, and shows
> significant latency reductions when packet loss occurs[1]. The tests
> show that, by imposing restrictions on the bundling rate, it can be
> made not to negatively affect competing traffic in an unfair manner.
>
> Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
> (http://patchwork.ozlabs.org/patch/510674)
>
> These patches have also been tested with as set of packetdrill scripts
> located at
> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
> (The tests require patching packetdrill with a new socket option:
> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>
> Detailed info about the RDB mechanism can be found at
> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
> in the paper "Latency and Fairness Trade-Off for Thin Streams using
> Redundant Data Bundling in TCP"[2].
>
> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
I read the paper. I think the underlying idea is neat. but the
implementation is little heavy-weight that requires changes on fast
path (tcp_write_xmit) and space in skb control blocks. ultimately this
patch is meant for a small set of specific applications.

In my mental model (please correct me if I am wrong), losses on these
thin streams would mostly resort to RTOs instead of fast recovery, due
to the bursty nature of Internet losses. The HOLB comes from RTO only
retransmit the first (tiny) unacked packet while a small of new data is
readily available. But since Linux congestion control is packet-based,
and loss cwnd is 1, the new data needs to wait until the 1st packet is
acked which is for another RTT.

Instead what if we only perform RDB on the (first and recurring) RTO
retransmission?

PS. I don't understand how (old) RDB can masquerade the losses by
skipping DUPACKs. Perhaps an example helps. Suppose we send 4 packets
and the last 3 were (s)acked. We perform RDB to send a packet that has
previous 4 payloads + 1 new byte. The sender still gets the loss
information?

>
> Changes:
>
> v6 (PATCH):
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>    * Renamed rdb_ack_event() to tcp_rdb_ack_event() (Thanks DaveM)
>    * Minor doc changes
>
>  * tcp-Add-DPIFL-thin-stream-detection-mechanism:
>    * Minor doc changes
>
> v5 (PATCH):
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>    * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric)
>    * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric)
>    * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c)
>    * Merged the two if tests for max payload of RDB packet in
>      rdb_can_bundle_test()
>    * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss()
>      and restructured to reduce indentation.
>    * Improved docs
>    * Revised commit message to be more detailed.
>
>  * tcp-Add-DPIFL-thin-stream-detection-mechanism:
>    * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c)
>
> v4 (PATCH):
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>    * Moved skb_append_data() to tcp_output.c and call this
>      function from tcp_collapse_retrans() as well.
>    * Merged functionality of create_rdb_skb() into
>      tcp_transmit_rdb_skb()
>    * Removed one parameter from rdb_can_bundle_test()
>
> v3 (PATCH):
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>    * Changed name of sysctl variable from tcp_rdb_max_skbs to
>      tcp_rdb_max_packets after comment from Eric Dumazet about
>      not exposing internal (kernel) names like skb.
>    * Formatting and function docs fixes
>
> v2 (RFC/PATCH):
>  * tcp-Add-DPIFL-thin-stream-detection-mechanism:
>    * Change calculation in tcp_stream_is_thin_dpifl based on
>      feedback from Eric Dumazet.
>
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>    * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
>      to reduce complexity as commented by Neal Cardwell.
>    * Cleaned up loss detection code in rdb_check_rtx_queue_loss
>
> v1 (RFC/PATCH)
>
>
> Bendik Rønning Opstad (2):
>   tcp: Add DPIFL thin stream detection mechanism
>   tcp: Add Redundant Data Bundling (RDB)
>
>  Documentation/networking/ip-sysctl.txt |  23 ++++
>  include/linux/skbuff.h                 |   1 +
>  include/linux/tcp.h                    |   3 +-
>  include/net/tcp.h                      |  36 ++++++
>  include/uapi/linux/tcp.h               |   1 +
>  net/core/skbuff.c                      |   2 +-
>  net/ipv4/Makefile                      |   3 +-
>  net/ipv4/sysctl_net_ipv4.c             |  34 +++++
>  net/ipv4/tcp.c                         |  16 ++-
>  net/ipv4/tcp_input.c                   |   3 +
>  net/ipv4/tcp_output.c                  |  48 ++++---
>  net/ipv4/tcp_rdb.c                     | 228 +++++++++++++++++++++++++++++++++
>  12 files changed, 375 insertions(+), 23 deletions(-)
>  create mode 100644 net/ipv4/tcp_rdb.c
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2016-03-10  0:20   ` Yuchung Cheng
@ 2016-03-10  1:45     ` Jonas Markussen
  2016-03-10  2:27       ` Yuchung Cheng
  2016-03-13 23:18     ` Bendik Rønning Opstad
  1 sibling, 1 reply; 81+ messages in thread
From: Jonas Markussen @ 2016-03-10  1:45 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Bendik Rønning Opstad, David S. Miller, netdev,
	Eric Dumazet, Neal Cardwell, Andreas Petlund, Carsten Griwodz,
	Pål Halvorsen, Kristian Evensen, Kenneth Klette Jonassen


> On 10 Mar 2016, at 01:20, Yuchung Cheng <ycheng@google.com> wrote:
> 
> PS. I don't understand how (old) RDB can masquerade the losses by
> skipping DUPACKs. Perhaps an example helps. Suppose we send 4 packets
> and the last 3 were (s)acked. We perform RDB to send a packet that has
> previous 4 payloads + 1 new byte. The sender still gets the loss
> information?
> 

If I’ve understood you correctly, you’re talking about sending 4 
packets and the first one is lost?

In this case, RDB will not only bundle on the last/new packet but also 
as it sends packet 2 (which will contain 1+2), packet 3 (1+2+3) 
and packet 4 (1+2+3+4). 

So the fact that packet 1 was lost is masqueraded when it is 
recovered by packet 2 and there won’t be any gap in the SACK window
indicating that packet 1 was lost.

Best regards,
Jonas Markussen

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2016-03-10  1:45     ` Jonas Markussen
@ 2016-03-10  2:27       ` Yuchung Cheng
  2016-03-12  9:23         ` Jonas Markussen
  0 siblings, 1 reply; 81+ messages in thread
From: Yuchung Cheng @ 2016-03-10  2:27 UTC (permalink / raw)
  To: Jonas Markussen
  Cc: Bendik Rønning Opstad, David S. Miller, netdev,
	Eric Dumazet, Neal Cardwell, Andreas Petlund, Carsten Griwodz,
	Pål Halvorsen, Kristian Evensen, Kenneth Klette Jonassen

On Wed, Mar 9, 2016 at 5:45 PM, Jonas Markussen <jonassm@ifi.uio.no> wrote:
>
>> On 10 Mar 2016, at 01:20, Yuchung Cheng <ycheng@google.com> wrote:
>>
>> PS. I don't understand how (old) RDB can masquerade the losses by
>> skipping DUPACKs. Perhaps an example helps. Suppose we send 4 packets
>> and the last 3 were (s)acked. We perform RDB to send a packet that has
>> previous 4 payloads + 1 new byte. The sender still gets the loss
>> information?
>>
>
> If I’ve understood you correctly, you’re talking about sending 4
> packets and the first one is lost?
>
> In this case, RDB will not only bundle on the last/new packet but also
> as it sends packet 2 (which will contain 1+2), packet 3 (1+2+3)
> and packet 4 (1+2+3+4).
>
> So the fact that packet 1 was lost is masqueraded when it is
> recovered by packet 2 and there won’t be any gap in the SACK window
> indicating that packet 1 was lost.
I see. Thanks for the clarification.

So my question is still if thin-stream app has enough inflight to use
ack-triggered recovery. i.e., it has to send at least twice within an
RTT.

Also have you tested this with non-Linux receivers? Thanks.

>
> Best regards,
> Jonas Markussen

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2016-03-10  2:27       ` Yuchung Cheng
@ 2016-03-12  9:23         ` Jonas Markussen
  0 siblings, 0 replies; 81+ messages in thread
From: Jonas Markussen @ 2016-03-12  9:23 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Bendik Rønning Opstad, David S. Miller, netdev,
	Eric Dumazet, Neal Cardwell, Andreas Petlund, Carsten Griwodz,
	Pål Halvorsen, Kristian Evensen, Kenneth Klette Jonassen


> On 10 Mar 2016, at 03:27, Yuchung Cheng <ycheng@google.com> wrote:
> 
> So my question is still if thin-stream app has enough inflight to use
> ack-triggered recovery. i.e., it has to send at least twice within an
> RTT.
> 

I see. The thin-stream app must send twice before an RTO in order to
use ACK-triggered recovery, My understanding is that the RTO timer
in many cases can be many times the RTT, e.g., for low-RTT networks 
where TCP streams defaults to the default minimal RTT value (200 ms).
Of course, the advantage of RDB is greater when the RTT is high.

The benefit of RDB over other mechanisms that improve how quick 
thin-streams are able to discover and recover loss, such as the
tcp_thin_dupack and tcp_early_retrans sysctls, is that the sender
using RDB will already have recovered the lost packet by the time a
regular TCP connection detects the packet loss (from DUPACKs) and
reacts accordingly. This reduces the recovery time by at least one
RTT since it avoids the retransmission all together.

Another impacting mechanism here is delayed ACKs, which also
may affect how long it takes before (S)ACKS arrive at the sender.
My understanding is that delayed ACKs will be disabled when 
the incoming packet’s is not the expected sequence number, as is
the case for RDB having packets where old and new data is combined.
This improves ACK feedback for the thin-streams using RDB.

> Also have you tested this with non-Linux receivers? Thanks.
> 

We have tested the current version with FreeBSD v10 and Windows 10.
The old version was tested with Windows, FreeBSD, OS X and Linux
back in 2010.

We argue that any TCP implementation complying to the RFCs must
be able to handle segments combining old and new data in the 
same way they handle TCP repacketization on retransmissions
(tcp_retrans_collapse).

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2016-03-10  0:20   ` Yuchung Cheng
  2016-03-10  1:45     ` Jonas Markussen
@ 2016-03-13 23:18     ` Bendik Rønning Opstad
  2016-03-14 21:59       ` Yuchung Cheng
  1 sibling, 1 reply; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-03-13 23:18 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David S. Miller, netdev, Eric Dumazet, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On 03/10/2016 01:20 AM, Yuchung Cheng wrote:
> I read the paper. I think the underlying idea is neat. but the
> implementation is little heavy-weight that requires changes on fast
> path (tcp_write_xmit) and space in skb control blocks.

Yuchung, thank you for taking the time to review the patch submission
and read the paper.

I must admit I was not particularly happy about the extra if-test on the
fast path, and I fully understand the wish to keep the fast path as
simple and clean as possible.
However, is the performance hit that significant considering the branch
prediction hint for the non-RDB path?

The extra variable needed in the SKB CB does not require increasing the
CB buffer size due to the "tcp: refactor struct tcp_skb_cb" patch:
http://patchwork.ozlabs.org/patch/510674 and uses only some of the space
made available in the outgoing SKBs' CB. Therefore I hoped the extra
variable would be acceptable.

> ultimately this
> patch is meant for a small set of specific applications.

Yes, the RDB mechanism is aimed at a limited set of applications,
specifically time-dependent applications that produce non-greedy,
application limited (thin) flows. However, our hope is that RDB may
greatly improve TCP's position as a viable alternative for applications
transmitting latency sensitive data.

> In my mental model (please correct me if I am wrong), losses on these
> thin streams would mostly resort to RTOs instead of fast recovery, due
> to the bursty nature of Internet losses.

This depends on the transmission pattern of the applications, which
varies to a great deal, also between the different types of
time-dependent applications that produce thin streams. For short flows,
(bursty) loss at the end will result in an RTO (if TLP does not probe),
but the thin streams are often long lived, and the applications
producing them continue to write small data segments to the socket at
intervals of tens to hundreds of milliseconds.

What controls if an RTO and not fast retransmit will resend the packet,
is the number of PIFs, which directly correlates to how often the
application writes data to the socket in relation to the RTT. As long as
the number of packets successfully completing a round trip before the
RTO is >= the dupACK threshold, they will not depend on RTOs (not
considering TLP). Early retransmit and the TCP_THIN_DUPACK socket option
will also affect the likelihood of RTOs vs fast retransmits.

> The HOLB comes from RTO only
> retransmit the first (tiny) unacked packet while a small of new data is
> readily available. But since Linux congestion control is packet-based,
> and loss cwnd is 1, the new data needs to wait until the 1st packet is
> acked which is for another RTT.

If I understand you correctly, you are referring to HOLB on the sender
side, which is the extra delay on new data that is held back when the
connection is CWND-limited. In the paper, we refer to this extra delay
as increased sojourn times for the outgoing data segments.

We do not include this additional sojourn time for the segments on the
sender side in the ACK Latency plots (Fig. 4 in the paper). This is
simply because the pcap traces contain the timestamps when the packets
are sent, and not when the segments are added to the output queue.

When we refer to the HOLB effect in the paper as well as the thesis, we
refer to the extra delays (sojourn times) on the receiver side where
segments are held back (not made available to user space) due to gaps in
the sequence range when packets are lost (we had no reordering).

So, when considering the increased delays due to HOLB on the receiver
side, HOLB is not at all limited to RTOs. Actually, it's mostly not due
to RTOs in the tests we've run, however, this also depends very much on
the transmission pattern of the application as well as loss levels.
In general, HOLB on the receiver side will affect any flow that
transmits a packet with new data after a packet is lost (sender may not
know yet), where the lost packet has not already been retransmitted.

Consider a sender application that performs write calls every 30 ms on a
150 ms RTT link. It will need a CWND that allows 5-6 PIFs to be able to
transmit all new data segments with no extra sojourn times on the sender
side.
When one packet is lost, the next 5 packets that are sent will be held
back on the receiver side due to the missing segment (HOLB). In the best
case scenario, the first dupACK triggers a fast retransmit around the
same time as the fifth packet (after the lost packet) is sent. In that
case, the first segment sent after the lost segment is held back on the
receiver for 150 ms (the time it takes for the dupACK to reach the
sender, and the fast retrans to arrive at the receiver). The second is
held back 120 ms, the third 90 ms, the fourth 60 ms, an the fifth 30 ms.

All of this extra delay is added before the sender even knows there was
a loss. How it decides to react to the loss signal (dupACKs) will
further decide how much extra delays will be added in addition to the
delays already inflicted on the segments by the HOLB.

> Instead what if we only perform RDB on the (first and recurring) RTO
> retransmission?

That will change RDB from being a proactive mechanism, to being
reactive, i.e. change how the sender responds to the loss signal. The
problem is that by this point (when the sender has received the loss
signal), the HOLB on the receiver side has already caused significant
increases to the application layer latency.

The reason the RDB streams (in red) in fig. 4 in the paper get such low
latencies is because there are almost no retransmissions. With 10%
uniform loss, the latency for 90% of the packets is not affected at all.
The latency for most of the lost segments is only increased by 30 ms,
which is when the next RDB packet arrives at the receiver with the lost
segment bundled in the payload.
For the regular TCP streams (blue), the latency for 40% of the segments
is affected, where almost 30% of the segments have additional delays of
150 ms or more.
It is important to note that the increases to the latencies for the
regular TCP streams compared to the RDB streams are solely due to HOLB
on the receiver side.

The longer the RTT, the greater the gains are by using RDB, considering
the best case scenario of minimum one RTT required for a retransmission.
As such, RDB will reduce the latencies the most for those that also need
it the most.

However, even with an RTT of 20 ms, an application writing a data
segment every 10 ms will still get significant latency reductions simply
because a retransmission will require a minimum of 20 ms, compared to
the 10 ms it takes for the next RDB packet to arrive at the receiver.


Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-03 18:06 ` [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
@ 2016-03-14 21:15   ` Eric Dumazet
  2016-03-15  1:04     ` Rick Jones
  2016-03-18 17:58     ` Bendik Rønning Opstad
  2016-03-14 21:54   ` Yuchung Cheng
  1 sibling, 2 replies; 81+ messages in thread
From: Eric Dumazet @ 2016-03-14 21:15 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On Thu, 2016-03-03 at 19:06 +0100, Bendik Rønning Opstad wrote:
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
> 
> Latency-sensitive applications or services, such as online games,
> remote control systems, and VoIP, produce traffic with thin-stream
> characteristics, characterized by small packets and relatively high
> inter-transmission times (ITT). When experiencing packet loss, such
> latency-sensitive applications are heavily penalized by the need to
> retransmit lost packets, which increases the latency by a minimum of
> one RTT for the lost packet. Packets coming after a lost packet are
> held back due to head-of-line blocking, causing increased delays for
> all data segments until the lost packet has been retransmitted.

Acked-by: Eric Dumazet <edumazet@google.com>

Note that RDB probably should get some SNMP counters,
so that we get an idea of how many times a loss could be repaired.

Ideally, if the path happens to be lossless, all these pro active
bundles are overhead. Might be useful to make RDB conditional to
tp->total_retrans or something.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-03 18:06 ` [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
  2016-03-14 21:15   ` Eric Dumazet
@ 2016-03-14 21:54   ` Yuchung Cheng
  2016-03-15  0:40     ` Bill Fink
  2016-03-17 23:26     ` Bendik Rønning Opstad
  1 sibling, 2 replies; 81+ messages in thread
From: Yuchung Cheng @ 2016-03-14 21:54 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Eric Dumazet, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
>
> Latency-sensitive applications or services, such as online games,
> remote control systems, and VoIP, produce traffic with thin-stream
> characteristics, characterized by small packets and relatively high
> inter-transmission times (ITT). When experiencing packet loss, such
> latency-sensitive applications are heavily penalized by the need to
> retransmit lost packets, which increases the latency by a minimum of
> one RTT for the lost packet. Packets coming after a lost packet are
> held back due to head-of-line blocking, causing increased delays for
> all data segments until the lost packet has been retransmitted.
>
> RDB enables a TCP sender to bundle redundant (already sent) data with
> TCP packets containing small segments of new data. By resending
> un-ACKed data from the output queue in packets with new data, RDB
> reduces the need to retransmit data segments on connections
> experiencing sporadic packet loss. By avoiding a retransmit, RDB
> evades the latency increase of at least one RTT for the lost packet,
> as well as alleviating head-of-line blocking for the packets following
> the lost packet. This makes the TCP connection more resistant to
> latency fluctuations, and reduces the application layer latency
> significantly in lossy environments.
>
> Main functionality added:
>
>   o When a packet is scheduled for transmission, RDB builds and
>     transmits a new SKB containing both the unsent data as well as
>     data of previously sent packets from the TCP output queue.
>
>   o RDB will only be used for streams classified as thin by the
>     function tcp_stream_is_thin_dpifl(). This enforces a lower bound
>     on the ITT for streams that may benefit from RDB, controlled by
>     the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound.
>
>   o Loss detection of hidden loss events: When bundling redundant data
>     with each packet, packet loss can be hidden from the TCP engine due
>     to lack of dupACKs. This is because the loss is "repaired" by the
>     redundant data in the packet coming after the lost packet. Based on
>     incoming ACKs, such hidden loss events are detected, and CWR state
>     is entered.
>
> RDB can be enabled on a connection with the socket option TCP_RDB, or
> on all new connections by setting the sysctl variable
> net.ipv4.tcp_rdb=1
>
> Cc: Andreas Petlund <apetlund@simula.no>
> Cc: Carsten Griwodz <griff@simula.no>
> Cc: Pål Halvorsen <paalh@simula.no>
> Cc: Jonas Markussen <jonassm@ifi.uio.no>
> Cc: Kristian Evensen <kristian.evensen@gmail.com>
> Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
> Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
> ---
>  Documentation/networking/ip-sysctl.txt |  15 +++
>  include/linux/skbuff.h                 |   1 +
>  include/linux/tcp.h                    |   3 +-
>  include/net/tcp.h                      |  15 +++
>  include/uapi/linux/tcp.h               |   1 +
>  net/core/skbuff.c                      |   2 +-
>  net/ipv4/Makefile                      |   3 +-
>  net/ipv4/sysctl_net_ipv4.c             |  25 ++++
>  net/ipv4/tcp.c                         |  14 +-
>  net/ipv4/tcp_input.c                   |   3 +
>  net/ipv4/tcp_output.c                  |  48 ++++---
>  net/ipv4/tcp_rdb.c                     | 228 +++++++++++++++++++++++++++++++++
>  12 files changed, 335 insertions(+), 23 deletions(-)
>  create mode 100644 net/ipv4/tcp_rdb.c
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 6a92b15..8f3f3bf 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
>         calculated, which is used to classify whether a stream is thin.
>         Default: 10000
>
> +tcp_rdb - BOOLEAN
> +       Enable RDB for all new TCP connections.
  Please describe RDB briefly, perhaps with a pointer to your paper.
   I suggest have three level of controls:
   0: disable RDB completely
   1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
options
   2: enable RDB on all thin-stream conn. by default

   currently it only provides mode 1 and 2. but there may be cases where
   the administrator wants to disallow it (e.g., broken middle-boxes).

> +       Default: 0
> +
> +tcp_rdb_max_bytes - INTEGER
> +       Enable restriction on how many bytes an RDB packet can contain.
> +       This is the total amount of payload including the new unsent data.
> +       Default: 0
> +
> +tcp_rdb_max_packets - INTEGER
> +       Enable restriction on how many previous packets in the output queue
> +       RDB may include data from. A value of 1 will restrict bundling to
> +       only the data from the last packet that was sent.
> +       Default: 1
 why two metrics on redundancy? It also seems better to
 allow individual socket to select the redundancy level (e.g.,
 setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting.
 This requires more bits in tcp_sock but 2-3 more is suffice.
/

> +
>  tcp_limit_output_bytes - INTEGER
>         Controls TCP Small Queue limit per tcp socket.
>         TCP bulk sender tends to increase packets in flight until it
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 797cefb..0f2c9d1 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2927,6 +2927,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
>  void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
>  void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
>  int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
>  int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
>  int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
>  __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index bcbf51d..c84de15 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -207,9 +207,10 @@ struct tcp_sock {
>         } rack;
>         u16     advmss;         /* Advertised MSS                       */
>         u8      unused;
> -       u8      nonagle     : 4,/* Disable Nagle algorithm?             */
> +       u8      nonagle     : 3,/* Disable Nagle algorithm?             */
>                 thin_lto    : 1,/* Use linear timeouts for thin streams */
>                 thin_dupack : 1,/* Fast retransmit on first dupack      */
> +               rdb         : 1,/* Redundant Data Bundling enabled      */
>                 repair      : 1,
>                 frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
>         u8      repair_queue;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d38eae9..2d42f4a 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -267,6 +267,9 @@ extern int sysctl_tcp_slow_start_after_idle;
>  extern int sysctl_tcp_thin_linear_timeouts;
>  extern int sysctl_tcp_thin_dupack;
>  extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
> +extern int sysctl_tcp_rdb;
> +extern int sysctl_tcp_rdb_max_bytes;
> +extern int sysctl_tcp_rdb_max_packets;
>  extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
> @@ -539,6 +542,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
>  bool tcp_may_send_now(struct sock *sk);
>  int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
>  int tcp_retransmit_skb(struct sock *, struct sk_buff *);
> +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> +                    gfp_t gfp_mask);
>  void tcp_retransmit_timer(struct sock *sk);
>  void tcp_xmit_retransmit_queue(struct sock *);
>  void tcp_simple_retransmit(struct sock *);
> @@ -556,6 +561,7 @@ void tcp_send_ack(struct sock *sk);
>  void tcp_send_delayed_ack(struct sock *sk);
>  void tcp_send_loss_probe(struct sock *sk);
>  bool tcp_schedule_loss_probe(struct sock *sk);
> +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb);
>
>  /* tcp_input.c */
>  void tcp_resume_early_retransmit(struct sock *sk);
> @@ -565,6 +571,11 @@ void tcp_reset(struct sock *sk);
>  void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
>  void tcp_fin(struct sock *sk);
>
> +/* tcp_rdb.c */
> +void tcp_rdb_ack_event(struct sock *sk, u32 flags);
> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
> +                        unsigned int mss_now, gfp_t gfp_mask);
> +
>  /* tcp_timer.c */
>  void tcp_init_xmit_timers(struct sock *);
>  static inline void tcp_clear_xmit_timers(struct sock *sk)
> @@ -763,6 +774,7 @@ struct tcp_skb_cb {
>         union {
>                 struct {
>                         /* There is space for up to 20 bytes */
> +                       __u32 rdb_start_seq; /* Start seq of rdb data */
>                 } tx;   /* only used for outgoing skbs */
>                 union {
>                         struct inet_skb_parm    h4;
> @@ -1497,6 +1509,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
>         skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>
> +#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)            \
> +       skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
> +
>  static inline struct sk_buff *tcp_send_head(const struct sock *sk)
>  {
>         return sk->sk_send_head;
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index fe95446..6799875 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -115,6 +115,7 @@ enum {
>  #define TCP_CC_INFO            26      /* Get Congestion Control (optional) info */
>  #define TCP_SAVE_SYN           27      /* Record SYN headers for new connections */
>  #define TCP_SAVED_SYN          28      /* Get SYN headers recorded for connection */
> +#define TCP_RDB                        29      /* Enable Redundant Data Bundling mechanism */
>
>  struct tcp_repair_opt {
>         __u32   opt_code;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7af7ec6..50bc5b0 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1055,7 +1055,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
>         skb->inner_mac_header += off;
>  }
>
> -static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>  {
>         __copy_skb_header(new, old);
>
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index bfa1336..459048c 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
>              tcp_offload.o datagram.o raw.o udp.o udplite.o \
>              udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
>              fib_frontend.o fib_semantics.o fib_trie.o \
> -            inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
> +            inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
> +            tcp_rdb.o
>
>  obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
>  obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index f04320a..43b4390 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -581,6 +581,31 @@ static struct ctl_table ipv4_table[] = {
>                 .extra1         = &tcp_thin_dpifl_itt_lower_bound_min,
>         },
>         {
> +               .procname       = "tcp_rdb",
> +               .data           = &sysctl_tcp_rdb,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +               .extra2         = &one,
> +       },
> +       {
> +               .procname       = "tcp_rdb_max_bytes",
> +               .data           = &sysctl_tcp_rdb_max_bytes,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +       },
> +       {
> +               .procname       = "tcp_rdb_max_packets",
> +               .data           = &sysctl_tcp_rdb_max_packets,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +       },
> +       {
>                 .procname       = "tcp_early_retrans",
>                 .data           = &sysctl_tcp_early_retrans,
>                 .maxlen         = sizeof(int),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 8421f3d..b53d4cb 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -288,6 +288,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
>
>  int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
>
> +int sysctl_tcp_rdb __read_mostly;
> +
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>
> @@ -407,6 +409,7 @@ void tcp_init_sock(struct sock *sk)
>         u64_stats_init(&tp->syncp);
>
>         tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
> +       tp->rdb = sysctl_tcp_rdb;
>         tcp_enable_early_retrans(tp);
>         tcp_assign_congestion_control(sk);
>
> @@ -2412,6 +2415,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>                 }
>                 break;
>
> +       case TCP_RDB:
> +               if (val < 0 || val > 1)
> +                       err = -EINVAL;
> +               else
> +                       tp->rdb = val;
> +               break;
> +
>         case TCP_REPAIR:
>                 if (!tcp_can_repair_sock(sk))
>                         err = -EPERM;
> @@ -2842,7 +2852,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
>         case TCP_THIN_DUPACK:
>                 val = tp->thin_dupack;
>                 break;
> -
> +       case TCP_RDB:
> +               val = tp->rdb;
> +               break;
>         case TCP_REPAIR:
>                 val = tp->repair;
>                 break;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e6e65f7..7b52ce4 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3537,6 +3537,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
>
>         if (icsk->icsk_ca_ops->in_ack_event)
>                 icsk->icsk_ca_ops->in_ack_event(sk, flags);
> +
> +       if (unlikely(tcp_sk(sk)->rdb))
> +               tcp_rdb_ack_event(sk, flags);
>  }
>
>  /* Congestion control has updated the cwnd already. So if we're in
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 7d2c7a4..6f92fae 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -897,8 +897,8 @@ out:
>   * We are working here with either a clone of the original
>   * SKB, or a fresh unique copy made by the retransmit engine.
>   */
> -static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> -                           gfp_t gfp_mask)
> +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> +                    gfp_t gfp_mask)
>  {
>         const struct inet_connection_sock *icsk = inet_csk(sk);
>         struct inet_sock *inet;
> @@ -2110,9 +2110,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                                 break;
>                 }
>
> -               if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> +               if (unlikely(tcp_sk(sk)->rdb)) {
> +                       if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
> +                               break;
> +               } else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
>                         break;
> -
> +               }
>  repair:
>                 /* Advance the send_head.  This one is sent out.
>                  * This call will increment packets_out.
> @@ -2439,15 +2442,32 @@ u32 __tcp_select_window(struct sock *sk)
>         return window;
>  }
>
> +/**
> + * tcp_skb_append_data() - copy the linear data from an SKB to the end
> + *                         of another and update end sequence number
> + *                         and checksum
> + * @from_skb: the SKB to copy data from
> + * @to_skb: the SKB to copy data to
> + */
> +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
> +{
> +       skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len),
> +                                 from_skb->len);
> +       TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
> +
> +       if (from_skb->ip_summed == CHECKSUM_PARTIAL)
> +               to_skb->ip_summed = CHECKSUM_PARTIAL;
> +
> +       if (to_skb->ip_summed != CHECKSUM_PARTIAL)
> +               to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
> +                                             to_skb->len);
> +}
> +
>  /* Collapses two adjacent SKB's during retransmission. */
>  static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
> -       int skb_size, next_skb_size;
> -
> -       skb_size = skb->len;
> -       next_skb_size = next_skb->len;
>
>         BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
>
> @@ -2455,17 +2475,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>
>         tcp_unlink_write_queue(next_skb, sk);
>
> -       skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
> -                                 next_skb_size);
> -
> -       if (next_skb->ip_summed == CHECKSUM_PARTIAL)
> -               skb->ip_summed = CHECKSUM_PARTIAL;
> -
> -       if (skb->ip_summed != CHECKSUM_PARTIAL)
> -               skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
> -
> -       /* Update sequence range on original skb. */
> -       TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
> +       tcp_skb_append_data(next_skb, skb);
>
>         /* Merge over control information. This moves PSH/FIN etc. over */
>         TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
> diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
> new file mode 100644
> index 0000000..2b37957
> --- /dev/null
> +++ b/net/ipv4/tcp_rdb.c
> @@ -0,0 +1,228 @@
> +#include <linux/skbuff.h>
> +#include <net/tcp.h>
> +
> +int sysctl_tcp_rdb_max_bytes __read_mostly;
> +int sysctl_tcp_rdb_max_packets __read_mostly = 1;
> +
> +/**
> + * rdb_detect_loss() - perform RDB loss detection by analysing ACKs
> + * @sk: socket
> + *
> + * Traverse the output queue and check if the ACKed packet is an RDB
> + * packet and if the redundant data covers one or more un-ACKed SKBs.
> + * If the incoming ACK acknowledges multiple SKBs, we can presume
> + * packet loss has occurred.
> + *
> + * We can infer packet loss this way because we can expect one ACK per
> + * transmitted data packet, as delayed ACKs are disabled when a host
> + * receives packets where the sequence number is not the expected
> + * sequence number.
> + *
> + * Return: The number of packets that are presumed to be lost
> + */
> +static unsigned int rdb_detect_loss(struct sock *sk)
> +{
> +       struct sk_buff *skb, *tmp;
> +       struct tcp_skb_cb *scb;
> +       u32 seq_acked = tcp_sk(sk)->snd_una;
> +       unsigned int packets_lost = 0;
> +
> +       tcp_for_write_queue(skb, sk) {
> +               if (skb == tcp_send_head(sk))
> +                       break;
> +
> +               scb = TCP_SKB_CB(skb);
> +               /* The ACK acknowledges parts of the data in this SKB.
> +                * Can be caused by:
> +                * - TSO: We abort as RDB is not used on SKBs split across
> +                *        multiple packets on lower layers as these are greater
> +                *        than one MSS.
> +                * - Retrans collapse: We've had a retrans, so loss has already
> +                *                     been detected.
> +                */
> +               if (after(scb->end_seq, seq_acked))
> +                       break;
> +               else if (scb->end_seq != seq_acked)
> +                       continue;
> +
> +               /* We have found the ACKed packet */
> +
> +               /* This packet was sent with no redundant data, or no prior
> +                * un-ACKed SKBs is in the output queue, so break here.
> +                */
> +               if (scb->tx.rdb_start_seq == scb->seq ||
> +                   skb_queue_is_first(&sk->sk_write_queue, skb))
> +                       break;
> +               /* Find number of prior SKBs whose data was bundled in this
> +                * (ACKed) SKB. We presume any redundant data covering previous
> +                * SKB's are due to loss. (An exception would be reordering).
> +                */
> +               skb = skb->prev;
> +               tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
> +                       if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
> +                               break;
> +                       packets_lost++;
since we only care if there is packet loss or not, we can return early here?

> +               }
> +               break;
> +       }
> +       return packets_lost;
> +}
> +
> +/**
> + * tcp_rdb_ack_event() - initiate RDB loss detection
> + * @sk: socket
> + * @flags: flags
> + */
> +void tcp_rdb_ack_event(struct sock *sk, u32 flags)
flags are not used
> +{
> +       if (rdb_detect_loss(sk))
> +               tcp_enter_cwr(sk);
> +}
> +
> +/**
> + * rdb_build_skb() - build a new RDB SKB and copy redundant + unsent
> + *                   data to the linear page buffer
> + * @sk: socket
> + * @xmit_skb: the SKB processed for transmission in the output engine
> + * @first_skb: the first SKB in the output queue to be bundled
> + * @bytes_in_rdb_skb: the total number of data bytes for the new
> + *                    rdb_skb (NEW + Redundant)
> + * @gfp_mask: gfp_t allocation
> + *
> + * Return: A new SKB containing redundant data, or NULL if memory
> + *         allocation failed
> + */
> +static struct sk_buff *rdb_build_skb(const struct sock *sk,
> +                                    struct sk_buff *xmit_skb,
> +                                    struct sk_buff *first_skb,
> +                                    u32 bytes_in_rdb_skb,
> +                                    gfp_t gfp_mask)
> +{
> +       struct sk_buff *rdb_skb, *tmp_skb = first_skb;
> +
> +       rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
> +                                     (int)bytes_in_rdb_skb,
> +                                     gfp_mask, false);
> +       if (!rdb_skb)
> +               return NULL;
> +       copy_skb_header(rdb_skb, xmit_skb);
> +       rdb_skb->ip_summed = xmit_skb->ip_summed;
> +       TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
> +       TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
> +
> +       /* Start on first_skb and append payload from each SKB in the output
> +        * queue onto rdb_skb until we reach xmit_skb.
> +        */
> +       tcp_for_write_queue_from(tmp_skb, sk) {
> +               tcp_skb_append_data(tmp_skb, rdb_skb);
> +
> +               /* We reached xmit_skb, containing the unsent data */
> +               if (tmp_skb == xmit_skb)
> +                       break;
> +       }
> +       return rdb_skb;
> +}
> +
> +/**
> + * rdb_can_bundle_test() - test if redundant data can be bundled
> + * @sk: socket
> + * @xmit_skb: the SKB processed for transmission by the output engine
> + * @max_payload: the maximum allowed payload bytes for the RDB SKB
> + * @bytes_in_rdb_skb: store the total number of payload bytes in the
> + *                    RDB SKB if bundling can be performed
> + *
> + * Traverse the output queue and check if any un-acked data may be
> + * bundled.
> + *
> + * Return: The first SKB to be in the bundle, or NULL if no bundling
> + */
> +static struct sk_buff *rdb_can_bundle_test(const struct sock *sk,
> +                                          struct sk_buff *xmit_skb,
> +                                          unsigned int max_payload,
> +                                          u32 *bytes_in_rdb_skb)
> +{
> +       struct sk_buff *first_to_bundle = NULL;
> +       struct sk_buff *tmp, *skb = xmit_skb->prev;
> +       u32 skbs_in_bundle_count = 1; /* Start on 1 to account for xmit_skb */
> +       u32 total_payload = xmit_skb->len;
> +
> +       if (sysctl_tcp_rdb_max_bytes)
> +               max_payload = min_t(unsigned int, max_payload,
> +                                   sysctl_tcp_rdb_max_bytes);
> +
> +       /* We start at xmit_skb->prev, and go backwards */
> +       tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
> +               /* Including data from this SKB would exceed payload limit */
> +               if ((total_payload + skb->len) > max_payload)
> +                       break;
> +
> +               if (sysctl_tcp_rdb_max_packets &&
> +                   (skbs_in_bundle_count > sysctl_tcp_rdb_max_packets))
> +                       break;
> +
> +               total_payload += skb->len;
> +               skbs_in_bundle_count++;
> +               first_to_bundle = skb;
> +       }
> +       *bytes_in_rdb_skb = total_payload;
> +       return first_to_bundle;
> +}
> +
> +/**
> + * tcp_transmit_rdb_skb() - try to create and send an RDB packet
> + * @sk: socket
> + * @xmit_skb: the SKB processed for transmission by the output engine
> + * @mss_now: current mss value
> + * @gfp_mask: gfp_t allocation
> + *
> + * If an RDB packet could not be created and sent, transmit the
> + * original unmodified SKB (xmit_skb).
> + *
> + * Return: 0 if successfully sent packet, else error from
> + *         tcp_transmit_skb
> + */
> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
> +                        unsigned int mss_now, gfp_t gfp_mask)
> +{
> +       struct sk_buff *rdb_skb = NULL;
> +       struct sk_buff *first_to_bundle;
> +       u32 bytes_in_rdb_skb = 0;
> +
> +       /* How we detect that RDB was used. When equal, no RDB data was sent */
> +       TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;

> +
> +       if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
During loss recovery tcp inflight fluctuates and would like to trigger
this check even for non-thin-stream connections. Since the loss
already occurs, RDB can only take advantage from limited-transmit,
which it likely does not have (b/c its a thin-stream). It might be
checking if the state is open.

> +               goto xmit_default;
> +
> +       /* No bundling if first in queue, or on FIN packet */
> +       if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) ||
> +           (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN))
seems there are still benefit to bundle packets up to FIN?

> +               goto xmit_default;
> +
> +       /* Find number of (previous) SKBs to get data from */
> +       first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now,
> +                                             &bytes_in_rdb_skb);
> +       if (!first_to_bundle)
> +               goto xmit_default;
> +
> +       /* Create an SKB that contains redundant data starting from
> +        * first_to_bundle.
> +        */
> +       rdb_skb = rdb_build_skb(sk, xmit_skb, first_to_bundle,
> +                               bytes_in_rdb_skb, gfp_mask);
> +       if (!rdb_skb)
> +               goto xmit_default;
> +
> +       /* Set skb_mstamp for the SKB in the output queue (xmit_skb) containing
> +        * the yet unsent data. Normally this would be done by
> +        * tcp_transmit_skb(), but as we pass in rdb_skb instead, xmit_skb's
> +        * timestamp will not be touched.
> +        */
> +       skb_mstamp_get(&xmit_skb->skb_mstamp);
> +       rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
> +       return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
> +
> +xmit_default:
> +       /* Transmit the unmodified SKB from output queue */
> +       return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
> +}
> --
> 1.9.1
>

since RDB will cause DSACKs, and we only blindly count DSACKs to
perform CWND undo. How does RDB handle that false positives?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2016-03-13 23:18     ` Bendik Rønning Opstad
@ 2016-03-14 21:59       ` Yuchung Cheng
  2016-03-18 14:25         ` Bendik Rønning Opstad
  0 siblings, 1 reply; 81+ messages in thread
From: Yuchung Cheng @ 2016-03-14 21:59 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Eric Dumazet, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On Sun, Mar 13, 2016 at 4:18 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
> On 03/10/2016 01:20 AM, Yuchung Cheng wrote:
>> I read the paper. I think the underlying idea is neat. but the
>> implementation is little heavy-weight that requires changes on fast
>> path (tcp_write_xmit) and space in skb control blocks.
>
> Yuchung, thank you for taking the time to review the patch submission
> and read the paper.
>
> I must admit I was not particularly happy about the extra if-test on the
> fast path, and I fully understand the wish to keep the fast path as
> simple and clean as possible.
> However, is the performance hit that significant considering the branch
> prediction hint for the non-RDB path?
>
> The extra variable needed in the SKB CB does not require increasing the
> CB buffer size due to the "tcp: refactor struct tcp_skb_cb" patch:
> http://patchwork.ozlabs.org/patch/510674 and uses only some of the space
> made available in the outgoing SKBs' CB. Therefore I hoped the extra
> variable would be acceptable.
>
>> ultimately this
>> patch is meant for a small set of specific applications.
>
> Yes, the RDB mechanism is aimed at a limited set of applications,
> specifically time-dependent applications that produce non-greedy,
> application limited (thin) flows. However, our hope is that RDB may
> greatly improve TCP's position as a viable alternative for applications
> transmitting latency sensitive data.
>
>> In my mental model (please correct me if I am wrong), losses on these
>> thin streams would mostly resort to RTOs instead of fast recovery, due
>> to the bursty nature of Internet losses.
>
> This depends on the transmission pattern of the applications, which
> varies to a great deal, also between the different types of
> time-dependent applications that produce thin streams. For short flows,
> (bursty) loss at the end will result in an RTO (if TLP does not probe),
> but the thin streams are often long lived, and the applications
> producing them continue to write small data segments to the socket at
> intervals of tens to hundreds of milliseconds.
>
> What controls if an RTO and not fast retransmit will resend the packet,
> is the number of PIFs, which directly correlates to how often the
> application writes data to the socket in relation to the RTT. As long as
> the number of packets successfully completing a round trip before the
> RTO is >= the dupACK threshold, they will not depend on RTOs (not
> considering TLP). Early retransmit and the TCP_THIN_DUPACK socket option
> will also affect the likelihood of RTOs vs fast retransmits.
>
>> The HOLB comes from RTO only
>> retransmit the first (tiny) unacked packet while a small of new data is
>> readily available. But since Linux congestion control is packet-based,
>> and loss cwnd is 1, the new data needs to wait until the 1st packet is
>> acked which is for another RTT.
>
> If I understand you correctly, you are referring to HOLB on the sender
> side, which is the extra delay on new data that is held back when the
> connection is CWND-limited. In the paper, we refer to this extra delay
> as increased sojourn times for the outgoing data segments.
>
> We do not include this additional sojourn time for the segments on the
> sender side in the ACK Latency plots (Fig. 4 in the paper). This is
> simply because the pcap traces contain the timestamps when the packets
> are sent, and not when the segments are added to the output queue.
>
> When we refer to the HOLB effect in the paper as well as the thesis, we
> refer to the extra delays (sojourn times) on the receiver side where
> segments are held back (not made available to user space) due to gaps in
> the sequence range when packets are lost (we had no reordering).
>
> So, when considering the increased delays due to HOLB on the receiver
> side, HOLB is not at all limited to RTOs. Actually, it's mostly not due
> to RTOs in the tests we've run, however, this also depends very much on
> the transmission pattern of the application as well as loss levels.
> In general, HOLB on the receiver side will affect any flow that
> transmits a packet with new data after a packet is lost (sender may not
> know yet), where the lost packet has not already been retransmitted.
OK that makes sense.

I left some detailed comments on the actual patches. I would encourage
to submit an IETF draft to gather feedback from tcpm b/c the feature
seems portable.

>
> Consider a sender application that performs write calls every 30 ms on a
> 150 ms RTT link. It will need a CWND that allows 5-6 PIFs to be able to
> transmit all new data segments with no extra sojourn times on the sender
> side.
> When one packet is lost, the next 5 packets that are sent will be held
> back on the receiver side due to the missing segment (HOLB). In the best
> case scenario, the first dupACK triggers a fast retransmit around the
> same time as the fifth packet (after the lost packet) is sent. In that
> case, the first segment sent after the lost segment is held back on the
> receiver for 150 ms (the time it takes for the dupACK to reach the
> sender, and the fast retrans to arrive at the receiver). The second is
> held back 120 ms, the third 90 ms, the fourth 60 ms, an the fifth 30 ms.
>
> All of this extra delay is added before the sender even knows there was
> a loss. How it decides to react to the loss signal (dupACKs) will
> further decide how much extra delays will be added in addition to the
> delays already inflicted on the segments by the HOLB.
>
>> Instead what if we only perform RDB on the (first and recurring) RTO
>> retransmission?
>
> That will change RDB from being a proactive mechanism, to being
> reactive, i.e. change how the sender responds to the loss signal. The
> problem is that by this point (when the sender has received the loss
> signal), the HOLB on the receiver side has already caused significant
> increases to the application layer latency.
>
> The reason the RDB streams (in red) in fig. 4 in the paper get such low
> latencies is because there are almost no retransmissions. With 10%
> uniform loss, the latency for 90% of the packets is not affected at all.
> The latency for most of the lost segments is only increased by 30 ms,
> which is when the next RDB packet arrives at the receiver with the lost
> segment bundled in the payload.
> For the regular TCP streams (blue), the latency for 40% of the segments
> is affected, where almost 30% of the segments have additional delays of
> 150 ms or more.
> It is important to note that the increases to the latencies for the
> regular TCP streams compared to the RDB streams are solely due to HOLB
> on the receiver side.
>
> The longer the RTT, the greater the gains are by using RDB, considering
> the best case scenario of minimum one RTT required for a retransmission.
> As such, RDB will reduce the latencies the most for those that also need
> it the most.
>
> However, even with an RTT of 20 ms, an application writing a data
> segment every 10 ms will still get significant latency reductions simply
> because a retransmission will require a minimum of 20 ms, compared to
> the 10 ms it takes for the next RDB packet to arrive at the receiver.
>
>
> Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-14 21:54   ` Yuchung Cheng
@ 2016-03-15  0:40     ` Bill Fink
  2016-03-17 23:26     ` Bendik Rønning Opstad
  1 sibling, 0 replies; 81+ messages in thread
From: Bill Fink @ 2016-03-15  0:40 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Bendik Rønning Opstad, David S. Miller, netdev,
	Eric Dumazet, Neal Cardwell, Andreas Petlund, Carsten Griwodz,
	Pål Halvorsen, Jonas Markussen, Kristian Evensen,
	Kenneth Klette Jonassen

On Mon, 14 Mar 2016, Yuchung Cheng wrote:

> On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad
> <bro.devel@gmail.com> wrote:
> >
> > Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> > the latency for applications sending time-dependent data.
...
> > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> > index 6a92b15..8f3f3bf 100644
> > --- a/Documentation/networking/ip-sysctl.txt
> > +++ b/Documentation/networking/ip-sysctl.txt
> > @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
> >         calculated, which is used to classify whether a stream is thin.
> >         Default: 10000
> >
> > +tcp_rdb - BOOLEAN
> > +       Enable RDB for all new TCP connections.
>   Please describe RDB briefly, perhaps with a pointer to your paper.
>    I suggest have three level of controls:
>    0: disable RDB completely
>    1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
> options
>    2: enable RDB on all thin-stream conn. by default
> 
>    currently it only provides mode 1 and 2. but there may be cases where
>    the administrator wants to disallow it (e.g., broken middle-boxes).
> 
> > +       Default: 0

A per route setting to enable or disable tcp_rdb, overriding
the global setting, could also be useful to the administrator.
Just a suggestion for potential followup work.

					-Bill

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-14 21:15   ` Eric Dumazet
@ 2016-03-15  1:04     ` Rick Jones
  2016-03-15 18:09       ` Yuchung Cheng
  2016-03-18 17:58     ` Bendik Rønning Opstad
  1 sibling, 1 reply; 81+ messages in thread
From: Rick Jones @ 2016-03-15  1:04 UTC (permalink / raw)
  To: Eric Dumazet, Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On 03/14/2016 02:15 PM, Eric Dumazet wrote:
> On Thu, 2016-03-03 at 19:06 +0100, Bendik Rønning Opstad wrote:
>> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
>> the latency for applications sending time-dependent data.
>>
>> Latency-sensitive applications or services, such as online games,
>> remote control systems, and VoIP, produce traffic with thin-stream
>> characteristics, characterized by small packets and relatively high
>> inter-transmission times (ITT). When experiencing packet loss, such
>> latency-sensitive applications are heavily penalized by the need to
>> retransmit lost packets, which increases the latency by a minimum of
>> one RTT for the lost packet. Packets coming after a lost packet are
>> held back due to head-of-line blocking, causing increased delays for
>> all data segments until the lost packet has been retransmitted.
>
> Acked-by: Eric Dumazet <edumazet@google.com>
>
> Note that RDB probably should get some SNMP counters,
> so that we get an idea of how many times a loss could be repaired.

And some idea of the duplication seen by receivers, assuming there isn't 
already a counter for such a thing in Linux.

happy benchmarking,

rick jones

>
> Ideally, if the path happens to be lossless, all these pro active
> bundles are overhead. Might be useful to make RDB conditional to
> tp->total_retrans or something.
>
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-15  1:04     ` Rick Jones
@ 2016-03-15 18:09       ` Yuchung Cheng
  0 siblings, 0 replies; 81+ messages in thread
From: Yuchung Cheng @ 2016-03-15 18:09 UTC (permalink / raw)
  To: Rick Jones
  Cc: Eric Dumazet, Bendik Rønning Opstad, David S. Miller,
	netdev, Neal Cardwell, Andreas Petlund, Carsten Griwodz,
	Pål Halvorsen, Jonas Markussen, Kristian Evensen,
	Kenneth Klette Jonassen

On Mon, Mar 14, 2016 at 6:04 PM, Rick Jones <rick.jones2@hpe.com> wrote:
>
> On 03/14/2016 02:15 PM, Eric Dumazet wrote:
>>
>> On Thu, 2016-03-03 at 19:06 +0100, Bendik Rønning Opstad wrote:
>>>
>>> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
>>> the latency for applications sending time-dependent data.
>>>
>>> Latency-sensitive applications or services, such as online games,
>>> remote control systems, and VoIP, produce traffic with thin-stream
>>> characteristics, characterized by small packets and relatively high
>>> inter-transmission times (ITT). When experiencing packet loss, such
>>> latency-sensitive applications are heavily penalized by the need to
>>> retransmit lost packets, which increases the latency by a minimum of
>>> one RTT for the lost packet. Packets coming after a lost packet are
>>> held back due to head-of-line blocking, causing increased delays for
>>> all data segments until the lost packet has been retransmitted.
>>
>>
>> Acked-by: Eric Dumazet <edumazet@google.com>
>>
>> Note that RDB probably should get some SNMP counters,
>> so that we get an idea of how many times a loss could be repaired.
>
>
> And some idea of the duplication seen by receivers, assuming there isn't already a counter for such a thing in Linux.

We sort of track that in the awkwardly named LINUX_MIB_DELAYEDACKLOST



>
> happy benchmarking,
>
> rick jones
>
>
>>
>> Ideally, if the path happens to be lossless, all these pro active
>> bundles are overhead. Might be useful to make RDB conditional to
>> tp->total_retrans or something.
>>
>>
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-14 21:54   ` Yuchung Cheng
  2016-03-15  0:40     ` Bill Fink
@ 2016-03-17 23:26     ` Bendik Rønning Opstad
  2016-03-21 18:54       ` Yuchung Cheng
  1 sibling, 1 reply; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-03-17 23:26 UTC (permalink / raw)
  To: Yuchung Cheng, Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Eric Dumazet, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>> index 6a92b15..8f3f3bf 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
>>         calculated, which is used to classify whether a stream is thin.
>>         Default: 10000
>>
>> +tcp_rdb - BOOLEAN
>> +       Enable RDB for all new TCP connections.
>   Please describe RDB briefly, perhaps with a pointer to your paper.

Ah, yes, that description may have been a bit too brief...

What about pointing to tcp-thin.txt in the brief description, and
rewrite tcp-thin.txt with a more detailed description of RDB along
with a paper reference?

>    I suggest have three level of controls:
>    0: disable RDB completely
>    1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
> options
>    2: enable RDB on all thin-stream conn. by default
>
>    currently it only provides mode 1 and 2. but there may be cases where
>    the administrator wants to disallow it (e.g., broken middle-boxes).

Good idea. Will change this.

>> +       Default: 0
>> +
>> +tcp_rdb_max_bytes - INTEGER
>> +       Enable restriction on how many bytes an RDB packet can contain.
>> +       This is the total amount of payload including the new unsent data.
>> +       Default: 0
>> +
>> +tcp_rdb_max_packets - INTEGER
>> +       Enable restriction on how many previous packets in the output queue
>> +       RDB may include data from. A value of 1 will restrict bundling to
>> +       only the data from the last packet that was sent.
>> +       Default: 1
>  why two metrics on redundancy?

We have primarily used the packet based limit in our tests. This is
also the most important knob as it directly controls how many lost
packets each RDB packet may recover.

We believe that the byte based limit can also be useful because it
allows more fine grained control on how much impact RDB can have on
the increased bandwidth requirements of the flows. If an application
writes 700 bytes per write call, the bandwidth increase can be quite
significant (even with a 1 packet bundling limit) if we consider a
scenario with thousands of RDB streams.

In some of our experiments with many simultaneous thin streams, where
we set up a bottleneck rate limited by a htb with pfifo queue, we
observed considerable difference in loss rates depending on how many
bytes (packets) were allowed to be bundled with each packet. This is
partly why we recommend a default bundling limit of 1 packet.

By limiting the total payload size of RDB packets to e.g. 100 bytes,
only the smallest segments will benefit from RDB, while the segments
that would increase the bandwidth requirements the most, will not.

While a very large number of RDB streams from one sender may be a
corner case, we still think this sysctl knob can be valuable for a
sysadmin that finds himself in such a situation.

> It also seems better to
> allow individual socket to select the redundancy level (e.g.,
> setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting.
> This requires more bits in tcp_sock but 2-3 more is suffice.

Most certainly. We decided not to implement this for the patch to keep
it as simple as possible, however, we surely prefer to have this
functionality included if possible.

>> +static unsigned int rdb_detect_loss(struct sock *sk)
>> +{
...
>> +               tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
>> +                       if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
>> +                               break;
>> +                       packets_lost++;
> since we only care if there is packet loss or not, we can return early here?

Yes, I considered that, and as long as the number of packets presumed
to be lost is not needed, that will suffice. However, could this not
be useful for statistical purposes?

This is also relevant to the comment from Eric on SNMP counters for
how many times losses could be repaired by RDB?

>> +               }
>> +               break;
>> +       }
>> +       return packets_lost;
>> +}
>> +
>> +/**
>> + * tcp_rdb_ack_event() - initiate RDB loss detection
>> + * @sk: socket
>> + * @flags: flags
>> + */
>> +void tcp_rdb_ack_event(struct sock *sk, u32 flags)
> flags are not used

Ah, yes, will remove that.

>> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
>> +                        unsigned int mss_now, gfp_t gfp_mask)
>> +{
>> +       struct sk_buff *rdb_skb = NULL;
>> +       struct sk_buff *first_to_bundle;
>> +       u32 bytes_in_rdb_skb = 0;
>> +
>> +       /* How we detect that RDB was used. When equal, no RDB data was sent */
>> +       TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
>
>> +
>> +       if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
> During loss recovery tcp inflight fluctuates and would like to trigger
> this check even for non-thin-stream connections.

Good point.

> Since the loss
> already occurs, RDB can only take advantage from limited-transmit,
> which it likely does not have (b/c its a thin-stream). It might be
> checking if the state is open.

You mean to test for open state to avoid calling rdb_can_bundle_test()
unnecessarily if we (presume to) know it cannot bundle anyway? That
makes sense, however, I would like to do some tests on whether "state
!= open" is a good indicator on when bundling is not possible.

>> +               goto xmit_default;
>> +
>> +       /* No bundling if first in queue, or on FIN packet */
>> +       if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) ||
>> +           (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN))
> seems there are still benefit to bundle packets up to FIN?

I was close to removing the FIN test, but decided to not remove it
until I could verify that it will not cause any issues on some TCP
receivers. If/(Since?) you are certain it will not cause any issues, I
will remove it.

> since RDB will cause DSACKs, and we only blindly count DSACKs to
> perform CWND undo. How does RDB handle that false positives?

That is a very good question. The simple answer is that the
implementation does not handle any such false positives, which I
expect can result in incorrectly undoing CWND reduction in some cases.
This gets a bit complicated, so I'll have to do some more testing on
this to verify with certainty when it happens.

When there is no loss, and each RDB packet arriving at the receiver
contains both already received and new data, the receiver will respond
with an ACK that acknowledges new data (moves snd_una), with the SACK
field populated with the already received sequence range (DSACK).

The DSACKs in these incoming ACKs are not counted (tp->undo_retrans--)
unless tp->undo_marker has been set by tcp_init_undo(), which is
called by either tcp_enter_loss() or tcp_enter_recovery(). However,
whenever a loss is detected by rdb_detect_loss(), tcp_enter_cwr() is
called, which disables CWND undo. Therefore, I believe the incorrect
counting of DSACKs from ACKs on RDB packets will only be a problem
after the regular loss detection mechanisms (Fast Retransmit/RTO) have
been triggered (i.e. we are in either TCP_CA_Recovery or TCP_CA_Loss).

We have recorded the CWND values for both RDB and non-RDB streams in
our experiments, and have not found any obvious red flags when
analysing the results, so I presume (hope may be more precise) this is
not a major issue we have missed. Nevertheless, I will investigate
this in detail and get back to you.


Thank you for the detailed comments.

Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2016-03-14 21:59       ` Yuchung Cheng
@ 2016-03-18 14:25         ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-03-18 14:25 UTC (permalink / raw)
  To: Yuchung Cheng, Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Eric Dumazet, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On 14/03/16 22:59, Yuchung Cheng wrote:
> OK that makes sense.
>
> I left some detailed comments on the actual patches. I would encourage
> to submit an IETF draft to gather feedback from tcpm b/c the feature
> seems portable.

Thank you for the suggestion, we appreciate the confidence. We have
had in mind to eventually pursue a standardization process, but have
been unsure about how a mechanism that actively introduces redundancy
would be received by the IETF. It may now be the right time to propose
the RDB mechanism, and we will aim to present an IEFT draft in the
near future.


Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-14 21:15   ` Eric Dumazet
  2016-03-15  1:04     ` Rick Jones
@ 2016-03-18 17:58     ` Bendik Rønning Opstad
  1 sibling, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-03-18 17:58 UTC (permalink / raw)
  To: Eric Dumazet, Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On 14/03/16 22:15, Eric Dumazet wrote:
> Acked-by: Eric Dumazet <edumazet@google.com>
>
> Note that RDB probably should get some SNMP counters,
> so that we get an idea of how many times a loss could be repaired.

Good idea. Simply count how many times an RDB packet successfully
repaired loss? Note that this can be one or more lost packets. When
bundling N packets, the RDB packet can repair up to N losses in the
previous N packets that were sent.

Which list should this be added to? snmp4_tcp_list?

Any other counters that would be useful? Total number of RDB packets
transmitted?

> Ideally, if the path happens to be lossless, all these pro active
> bundles are overhead. Might be useful to make RDB conditional to
> tp->total_retrans or something.

Yes, that is a good point. We have discussed this (for years really),
but have not had the opportunity to investigate it in-depth. Having
such a condition hard coded is not ideal, as it very much depends on
the use case if bundling from the beginning is desirable. In most
cases, this is probably a fair compromise, but preferably we would
have some logic/settings to control how the bundling rate can be
dynamically adjusted in response to certain events, defined by a set
of given metrics.

A conservative (default) setting would not do bundling until loss has
been registered, and could also check against some smoothed loss
indicator such that a certain amount of loss must have occurred within
a specific time frame to allow bundling. This could be useful in cases
where the network congestion varies greatly depending on such as the
time of day/night.

In a scenario where minimal application layer latency is very
important, but only sporadic (single) packet loss is expected to
regularly occur, always bundling one previous packet may be both
sufficient and desirable.

In the end, the best settings for an application/service depends on
the degree to which application layer latency (both minimal and
variations) affects the QoE.

There are many possibilities to consider in this regard, and I expect
we will not have this question fully explored any time soon. Most
importantly, we should ensure that such logic can easily be added
later on without breaking backwards compatibility.

Suggestions and comments on this are very welcome.


Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-17 23:26     ` Bendik Rønning Opstad
@ 2016-03-21 18:54       ` Yuchung Cheng
  2016-06-16 17:12         ` Bendik Rønning Opstad
  0 siblings, 1 reply; 81+ messages in thread
From: Yuchung Cheng @ 2016-03-21 18:54 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, netdev, Eric Dumazet, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On Thu, Mar 17, 2016 at 4:26 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>
> >> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> >> index 6a92b15..8f3f3bf 100644
> >> --- a/Documentation/networking/ip-sysctl.txt
> >> +++ b/Documentation/networking/ip-sysctl.txt
> >> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
> >>         calculated, which is used to classify whether a stream is thin.
> >>         Default: 10000
> >>
> >> +tcp_rdb - BOOLEAN
> >> +       Enable RDB for all new TCP connections.
> >   Please describe RDB briefly, perhaps with a pointer to your paper.
>
> Ah, yes, that description may have been a bit too brief...
>
> What about pointing to tcp-thin.txt in the brief description, and
> rewrite tcp-thin.txt with a more detailed description of RDB along
> with a paper reference?
+1
>
> >    I suggest have three level of controls:
> >    0: disable RDB completely
> >    1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
> > options
> >    2: enable RDB on all thin-stream conn. by default
> >
> >    currently it only provides mode 1 and 2. but there may be cases where
> >    the administrator wants to disallow it (e.g., broken middle-boxes).
>
> Good idea. Will change this.
>
> >> +       Default: 0
> >> +
> >> +tcp_rdb_max_bytes - INTEGER
> >> +       Enable restriction on how many bytes an RDB packet can contain.
> >> +       This is the total amount of payload including the new unsent data.
> >> +       Default: 0
> >> +
> >> +tcp_rdb_max_packets - INTEGER
> >> +       Enable restriction on how many previous packets in the output queue
> >> +       RDB may include data from. A value of 1 will restrict bundling to
> >> +       only the data from the last packet that was sent.
> >> +       Default: 1
> >  why two metrics on redundancy?
>
> We have primarily used the packet based limit in our tests. This is
> also the most important knob as it directly controls how many lost
> packets each RDB packet may recover.
>
> We believe that the byte based limit can also be useful because it
> allows more fine grained control on how much impact RDB can have on
> the increased bandwidth requirements of the flows. If an application
> writes 700 bytes per write call, the bandwidth increase can be quite
> significant (even with a 1 packet bundling limit) if we consider a
> scenario with thousands of RDB streams.
>
> In some of our experiments with many simultaneous thin streams, where
> we set up a bottleneck rate limited by a htb with pfifo queue, we
> observed considerable difference in loss rates depending on how many
> bytes (packets) were allowed to be bundled with each packet. This is
> partly why we recommend a default bundling limit of 1 packet.
>
> By limiting the total payload size of RDB packets to e.g. 100 bytes,
> only the smallest segments will benefit from RDB, while the segments
> that would increase the bandwidth requirements the most, will not.
>
> While a very large number of RDB streams from one sender may be a
> corner case, we still think this sysctl knob can be valuable for a
> sysadmin that finds himself in such a situation.
These nice comments would be useful in the sysctl descriptions.

>
> > It also seems better to
> > allow individual socket to select the redundancy level (e.g.,
> > setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting.
> > This requires more bits in tcp_sock but 2-3 more is suffice.
>
> Most certainly. We decided not to implement this for the patch to keep
> it as simple as possible, however, we surely prefer to have this
> functionality included if possible.
>
> >> +static unsigned int rdb_detect_loss(struct sock *sk)
> >> +{
> ...
> >> +               tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
> >> +                       if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
> >> +                               break;
> >> +                       packets_lost++;
> > since we only care if there is packet loss or not, we can return early here?
>
> Yes, I considered that, and as long as the number of packets presumed
> to be lost is not needed, that will suffice. However, could this not
> be useful for statistical purposes?
>
> This is also relevant to the comment from Eric on SNMP counters for
> how many times losses could be repaired by RDB?
>
> >> +               }
> >> +               break;
> >> +       }
> >> +       return packets_lost;
> >> +}
> >> +
> >> +/**
> >> + * tcp_rdb_ack_event() - initiate RDB loss detection
> >> + * @sk: socket
> >> + * @flags: flags
> >> + */
> >> +void tcp_rdb_ack_event(struct sock *sk, u32 flags)
> > flags are not used
>
> Ah, yes, will remove that.
>
> >> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
> >> +                        unsigned int mss_now, gfp_t gfp_mask)
> >> +{
> >> +       struct sk_buff *rdb_skb = NULL;
> >> +       struct sk_buff *first_to_bundle;
> >> +       u32 bytes_in_rdb_skb = 0;
> >> +
> >> +       /* How we detect that RDB was used. When equal, no RDB data was sent */
> >> +       TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
> >
> >> +
> >> +       if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
> > During loss recovery tcp inflight fluctuates and would like to trigger
> > this check even for non-thin-stream connections.
>
> Good point.
>
> > Since the loss
> > already occurs, RDB can only take advantage from limited-transmit,
> > which it likely does not have (b/c its a thin-stream). It might be
> > checking if the state is open.
>
> You mean to test for open state to avoid calling rdb_can_bundle_test()
> unnecessarily if we (presume to) know it cannot bundle anyway? That
> makes sense, however, I would like to do some tests on whether "state
> != open" is a good indicator on when bundling is not possible.
>
> >> +               goto xmit_default;
> >> +
> >> +       /* No bundling if first in queue, or on FIN packet */
> >> +       if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) ||
> >> +           (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN))
> > seems there are still benefit to bundle packets up to FIN?
>
> I was close to removing the FIN test, but decided to not remove it
> until I could verify that it will not cause any issues on some TCP
> receivers. If/(Since?) you are certain it will not cause any issues, I
> will remove it.
>
> > since RDB will cause DSACKs, and we only blindly count DSACKs to
> > perform CWND undo. How does RDB handle that false positives?
>
> That is a very good question. The simple answer is that the
> implementation does not handle any such false positives, which I
> expect can result in incorrectly undoing CWND reduction in some cases.
> This gets a bit complicated, so I'll have to do some more testing on
> this to verify with certainty when it happens.
>
> When there is no loss, and each RDB packet arriving at the receiver
> contains both already received and new data, the receiver will respond
> with an ACK that acknowledges new data (moves snd_una), with the SACK
> field populated with the already received sequence range (DSACK).
>
> The DSACKs in these incoming ACKs are not counted (tp->undo_retrans--)
> unless tp->undo_marker has been set by tcp_init_undo(), which is
> called by either tcp_enter_loss() or tcp_enter_recovery(). However,
> whenever a loss is detected by rdb_detect_loss(), tcp_enter_cwr() is
> called, which disables CWND undo. Therefore, I believe the incorrect
thanks for the clarification. it might worth a short comment on why we
use tcp_enter_cwr() (to disable undo)


> counting of DSACKs from ACKs on RDB packets will only be a problem
> after the regular loss detection mechanisms (Fast Retransmit/RTO) have
> been triggered (i.e. we are in either TCP_CA_Recovery or TCP_CA_Loss).
>
> We have recorded the CWND values for both RDB and non-RDB streams in
> our experiments, and have not found any obvious red flags when
> analysing the results, so I presume (hope may be more precise) this is
> not a major issue we have missed. Nevertheless, I will investigate
> this in detail and get back to you.
>
>
> Thank you for the detailed comments.
>
> Bendik
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2016-03-21 18:54       ` Yuchung Cheng
@ 2016-06-16 17:12         ` Bendik Rønning Opstad
  0 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-06-16 17:12 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David S. Miller, netdev, Eric Dumazet, Neal Cardwell,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On 21/03/16 19:54, Yuchung Cheng wrote:
> On Thu, Mar 17, 2016 at 4:26 PM, Bendik Rønning Opstad
> <bro.devel@gmail.com> wrote:
>>
>>>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>>>> index 6a92b15..8f3f3bf 100644
>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
>>>>         calculated, which is used to classify whether a stream is thin.
>>>>         Default: 10000
>>>>
>>>> +tcp_rdb - BOOLEAN
>>>> +       Enable RDB for all new TCP connections.
>>>   Please describe RDB briefly, perhaps with a pointer to your paper.
>>
>> Ah, yes, that description may have been a bit too brief...
>>
>> What about pointing to tcp-thin.txt in the brief description, and
>> rewrite tcp-thin.txt with a more detailed description of RDB along
>> with a paper reference?
> +1
>>
>>>    I suggest have three level of controls:
>>>    0: disable RDB completely
>>>    1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
>>> options
>>>    2: enable RDB on all thin-stream conn. by default
>>>
>>>    currently it only provides mode 1 and 2. but there may be cases where
>>>    the administrator wants to disallow it (e.g., broken middle-boxes).
>>
>> Good idea. Will change this.

I have implemented your suggestion in the next patch.

>>> It also seems better to
>>> allow individual socket to select the redundancy level (e.g.,
>>> setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting.
>>> This requires more bits in tcp_sock but 2-3 more is suffice.
>>
>> Most certainly. We decided not to implement this for the patch to keep
>> it as simple as possible, however, we surely prefer to have this
>> functionality included if possible.

Next patch version has a socket option to allow modifying the different
RDB settings.

>>>> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
>>>> +                        unsigned int mss_now, gfp_t gfp_mask)
>>>> +{
>>>> +       struct sk_buff *rdb_skb = NULL;
>>>> +       struct sk_buff *first_to_bundle;
>>>> +       u32 bytes_in_rdb_skb = 0;
>>>> +
>>>> +       /* How we detect that RDB was used. When equal, no RDB data was sent */
>>>> +       TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
>>>
>>>> +
>>>> +       if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
>>> During loss recovery tcp inflight fluctuates and would like to trigger
>>> this check even for non-thin-stream connections.
>>
>> Good point.
>>
>>> Since the loss
>>> already occurs, RDB can only take advantage from limited-transmit,
>>> which it likely does not have (b/c its a thin-stream). It might be
>>> checking if the state is open.
>>
>> You mean to test for open state to avoid calling rdb_can_bundle_test()
>> unnecessarily if we (presume to) know it cannot bundle anyway? That
>> makes sense, however, I would like to do some tests on whether "state
>> != open" is a good indicator on when bundling is not possible.

When testing this I found that bundling can often be performed when
not in Open state. For the most part in CWR mode, but also the other
modes, so this does not seem like a good indicator.

The only problem with tcp_stream_is_thin_dpifl() triggering for
non-thin streams in loss recovery would be the performance penalty of
calling rdb_can_bundle_test(). It would not be able to bundle anyways
since the previous SKB would contain >= mss worth of data.

The most reliable test is to check available space in the previous
SKB, i.e. if (xmit_skb->prev->len == mss_now). Do you suggest, for
performance reasons, to do this before the call to
tcp_stream_is_thin_dpifl()?

>>> since RDB will cause DSACKs, and we only blindly count DSACKs to
>>> perform CWND undo. How does RDB handle that false positives?
>>
>> That is a very good question. The simple answer is that the
>> implementation does not handle any such false positives, which I
>> expect can result in incorrectly undoing CWND reduction in some cases.
>> This gets a bit complicated, so I'll have to do some more testing on
>> this to verify with certainty when it happens.
>>
>> When there is no loss, and each RDB packet arriving at the receiver
>> contains both already received and new data, the receiver will respond
>> with an ACK that acknowledges new data (moves snd_una), with the SACK
>> field populated with the already received sequence range (DSACK).
>>
>> The DSACKs in these incoming ACKs are not counted (tp->undo_retrans--)
>> unless tp->undo_marker has been set by tcp_init_undo(), which is
>> called by either tcp_enter_loss() or tcp_enter_recovery(). However,
>> whenever a loss is detected by rdb_detect_loss(), tcp_enter_cwr() is
>> called, which disables CWND undo. Therefore, I believe the incorrect
> thanks for the clarification. it might worth a short comment on why we
> use tcp_enter_cwr() (to disable undo)
>
>
>> counting of DSACKs from ACKs on RDB packets will only be a problem
>> after the regular loss detection mechanisms (Fast Retransmit/RTO) have
>> been triggered (i.e. we are in either TCP_CA_Recovery or TCP_CA_Loss).
>>
>> We have recorded the CWND values for both RDB and non-RDB streams in
>> our experiments, and have not found any obvious red flags when
>> analysing the results, so I presume (hope may be more precise) this is
>> not a major issue we have missed. Nevertheless, I will investigate
>> this in detail and get back to you.

I've looked into this and tried to figure out in which cases this is
actually a problem, but I have failed to find any.

One scenario I considered is when an RDB packet is sent right after a
retransmit, which would result in DSACK in the ACK in response to the
RDB packet.

With a bundling limit of 1 packet, two packets must be lost for RDB to
fail to repair the loss, causing dupACKs. So if three packets are sent,
where the first two are lost, the last packet will cause a dupACK,
resulting in a fast retransmit (and entering recovery which calls
tcp_init_undo()).

By writing new data to the socket right after the fast retransmit,
a new RDB packet is built with some old data that was just
retransmitted.

On the ACK on the fast retransmit the state is changed from Recovery
to Open. The next incoming ACK (on the RDB packet) will contain a DSACK
range, but it will not be considered dubious (tcp_ack_is_dubious())
since "!(flag & FLAG_NOT_DUP)" is false (new data was acked), state is
Open, and "flag & FLAG_CA_ALERT" evaluates to false.


Feel free to suggest scenarios (as detailed as possible) with the
potential to cause such false positives, and I'll test them with
packetdrill.

Bendik

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (18 preceding siblings ...)
  (?)
@ 2016-06-22 14:56 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-06-22 14:56 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

These patches have also been tested with a set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf


Changes:

v7 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed sysctl_tcp_rdb to accept three values (Thanks Yuchung):
     - 0: Disable system wide (RDB cannot be enabled with TCP_RDB socket option)
     - 1: Allow enabling RDB with TCP_RDB socket option.
     - 2: Enable RDB by default on all TCP sockets and allow to modify with TCP_RDB
   * Added sysctl tcp_rdb_wait_congestion to control if RDB by default should
     wait for congestion before bundling. (Ref. comment by Eric on lossless conns)
   * Changed socket options to modify per-socket RDB settings:
     - Added flags to TCP_RDB to allow bundling without waiting for loss with
       TCP_RDB_BUNDLE_IMMEDIATE.
     - Added socket option TCP_RDB_MAX_BYTES: Set max bytes per RDB packet.
     - Added socket option TCP_RDB_MAX_PACKETS: Set max packets allowed to be
       bundled by RDB.
   * Added SNMP counter LINUX_MIB_TCPRDBLOSSREPAIRS to count the occurences
     where RDB repaired a loss (Thanks Eric).
   * Bundle on FIN packets (Thanks Yuchung).
   * Updated docs in Documentation/networking/{ip-sysctl.txt,tcp-thin.txt}
   * Removed flags parameter from tcp_rdb_ack_event()
   * Changed sysctl knobs to using network namespace.

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Changed sysctl knobs to using network namespace

v6 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Renamed rdb_ack_event() to tcp_rdb_ack_event() (Thanks DaveM)
   * Minor doc changes

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Minor doc changes

v5 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric)
   * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric)
   * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c)
   * Merged the two if tests for max payload of RDB packet in
     rdb_can_bundle_test()
   * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss()
     and restructured to reduce indentation.
   * Improved docs
   * Revised commit message to be more detailed.

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c)

v4 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Moved skb_append_data() to tcp_output.c and call this
     function from tcp_collapse_retrans() as well.
   * Merged functionality of create_rdb_skb() into
     tcp_transmit_rdb_skb()
   * Removed one parameter from rdb_can_bundle_test()

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
     tcp_rdb_max_packets after comment from Eric Dumazet about
     not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
     feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
     to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  43 ++++++
 Documentation/networking/tcp-thin.txt  | 188 ++++++++++++++++++++------
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |  11 +-
 include/net/netns/ipv4.h               |   6 +
 include/net/tcp.h                      |  33 +++++
 include/uapi/linux/snmp.h              |   1 +
 include/uapi/linux/tcp.h               |  10 ++
 net/core/skbuff.c                      |   2 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/proc.c                        |   1 +
 net/ipv4/sysctl_net_ipv4.c             |  43 ++++++
 net/ipv4/tcp.c                         |  42 +++++-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_ipv4.c                    |   6 +
 net/ipv4/tcp_output.c                  |  49 ++++---
 net/ipv4/tcp_rdb.c                     | 240 +++++++++++++++++++++++++++++++++
 17 files changed, 619 insertions(+), 63 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
2.1.4

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (19 preceding siblings ...)
  (?)
@ 2016-06-22 14:56 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-06-22 14:56 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

The existing mechanism for detecting thin streams,
tcp_stream_is_thin(), is based on a static limit of less than 4
packets in flight. This treats streams differently depending on the
connection's RTT, such that a stream on a high RTT link may never be
considered thin, whereas the same application would produce a stream
that would always be thin in a low RTT scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin
stream detection will be independent of the RTT and treat streams
equally based on the transmission pattern, i.e. the inter-transmission
time (ITT).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/net/netns/ipv4.h               |  1 +
 include/net/tcp.h                      | 21 +++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp_ipv4.c                    |  1 +
 5 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 9ae9293..d856b98 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -718,6 +718,14 @@ tcp_thin_dupack - BOOLEAN
 	Documentation/networking/tcp-thin.txt
 	Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+	Controls the lower bound inter-transmission time (ITT) threshold
+	for when a stream is considered thin. The value is specified in
+	microseconds, and may not be lower than 10000 (10 ms). Based on
+	this threshold, a dynamic packets in flight limit (DPIFL) is
+	calculated, which is used to classify whether a stream is thin.
+	Default: 10000
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d061ffe..71be4ac 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -111,6 +111,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_orphan_retries;
 	int sysctl_tcp_fin_timeout;
 	unsigned int sysctl_tcp_notsent_lowat;
+	int sysctl_tcp_thin_dpifl_itt_lower_bound;
 
 	int sysctl_igmp_max_memberships;
 	int sysctl_igmp_max_msf;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a79894b..9956af9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -214,6 +214,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6       /* After 6 linear retries, do exp. backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (10000 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 10000
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND		10
@@ -1652,6 +1654,25 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Test if the stream is thin based on
+ *                              dynamic PIF limit (DPIFL)
+ * @sk: socket
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ *         the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct sock *sk)
+{
+	/* Calculate the maximum allowed PIF limit by dividing the RTT by
+	 * the minimum allowed inter-transmission time (ITT).
+	 * Tests if PIF < RTT / ITT-lower-bound
+	 */
+	return (u64) tcp_packets_in_flight(tcp_sk(sk)) *
+		sock_net(sk)->ipv4.sysctl_tcp_thin_dpifl_itt_lower_bound <
+		(tcp_sk(sk)->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1cb67de..150969d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -960,6 +961,14 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_thin_dpifl_itt_lower_bound",
+		.data		= &init_net.ipv4.sysctl_tcp_thin_dpifl_itt_lower_bound,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_thin_dpifl_itt_lower_bound),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
+	},
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	{
 		.procname	= "fib_multipath_use_neigh",
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3708de2..4e5e8e6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2412,6 +2412,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_orphan_retries = 0;
 	net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT;
 	net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX;
+	net->ipv4.sysctl_tcp_thin_dpifl_itt_lower_bound = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 	return 0;
 fail:
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v7 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` Bendik Rønning Opstad
                   ` (20 preceding siblings ...)
  (?)
@ 2016-06-22 14:56 ` Bendik Rønning Opstad
  -1 siblings, 0 replies; 81+ messages in thread
From: Bendik Rønning Opstad @ 2016-06-22 14:56 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Yuchung Cheng, Eric Dumazet, Neal Cardwell, Andreas Petlund,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.

Latency-sensitive applications or services, such as online games,
remote control systems, and VoIP, produce traffic with thin-stream
characteristics, characterized by small packets and relatively high
inter-transmission times (ITT). When experiencing packet loss, such
latency-sensitive applications are heavily penalized by the need to
retransmit lost packets, which increases the latency by a minimum of
one RTT for the lost packet. Packets coming after a lost packet are
held back due to head-of-line blocking, causing increased delays for
all data segments until the lost packet has been retransmitted.

RDB enables a TCP sender to bundle redundant (already sent) data with
TCP packets containing small segments of new data. By resending
un-ACKed data from the output queue in packets with new data, RDB
reduces the need to retransmit data segments on connections
experiencing sporadic packet loss. By avoiding a retransmit, RDB
evades the latency increase of at least one RTT for the lost packet,
as well as alleviating head-of-line blocking for the packets following
the lost packet. This makes the TCP connection more resistant to
latency fluctuations, and reduces the application layer latency
significantly in lossy environments.

Main functionality added:

  o When a packet is scheduled for transmission, RDB builds and
    transmits a new SKB containing both the unsent data as well as
    data of previously sent packets from the TCP output queue.

  o RDB will only be used for streams classified as thin by the
    function tcp_stream_is_thin_dpifl(). This enforces a lower bound
    on the ITT for streams that may benefit from RDB, controlled by
    the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound.

  o Loss detection of hidden loss events: When bundling redundant data
    with each packet, packet loss can be hidden from the TCP engine due
    to lack of dupACKs. This is because the loss is "repaired" by the
    redundant data in the packet coming after the lost packet. Based on
    incoming ACKs, such hidden loss events are detected, and CWR state
    is entered.

RDB can be enabled on a connection with the socket option TCP_RDB or
on all new connections by setting the sysctl variable
net.ipv4.tcp_rdb=2

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  35 +++++
 Documentation/networking/tcp-thin.txt  | 188 ++++++++++++++++++++------
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |  11 +-
 include/net/netns/ipv4.h               |   5 +
 include/net/tcp.h                      |  12 ++
 include/uapi/linux/snmp.h              |   1 +
 include/uapi/linux/tcp.h               |  10 ++
 net/core/skbuff.c                      |   2 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/proc.c                        |   1 +
 net/ipv4/sysctl_net_ipv4.c             |  34 +++++
 net/ipv4/tcp.c                         |  42 +++++-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_ipv4.c                    |   5 +
 net/ipv4/tcp_output.c                  |  49 ++++---
 net/ipv4/tcp_rdb.c                     | 240 +++++++++++++++++++++++++++++++++
 17 files changed, 579 insertions(+), 63 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d856b98..d26d12b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -726,6 +726,41 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
 	calculated, which is used to classify whether a stream is thin.
 	Default: 10000
 
+tcp_rdb - BOOLEAN
+	Controls the use of the Redundant Data Bundling (RDB) mechanism
+	for TCP connections.
+
+	RDB is a TCP mechanism aimed at reducing the latency for
+	applications transmitting time-dependent data. By bundling already
+	sent data in packets with new data, RDB alleviates head-of-line
+	blocking on the receiver side by reducing the need to retransmit
+	data segments when packets are lost. See tcp-thin.txt for further
+	details.
+	Possible values:
+	0 - Disable RDB system wide, i.e. disallow enabling RDB on TCP
+	    sockets with the socket option TCP_RDB.
+	1 - Allow enabling/disabling RDB with socket option TCP_RDB.
+	2 - Set RDB to be enabled by default for all new TCP connections
+	    and allow modifying socket with socket option TCP_RDB.
+	Default: 1
+
+tcp_rdb_await_congestion - BOOLEAN
+	Controls whether an RDB-enabled connection, by default, should
+	postpone bundling until congestion has been detected.
+
+tcp_rdb_max_bytes - INTEGER
+	Enable restriction on how many bytes an RDB packet can contain.
+	This is the total amount of payload including the new unsent data.
+	A value of 0 will disable bytes based limitation.
+	Default: 0
+
+tcp_rdb_max_packets - INTEGER
+	Enable restriction on how many previous packets in the output queue
+	RDB may include data from. A value of 1 will restrict bundling to
+	only the data from the last packet that was sent.
+	A value of 0 will disable packet based limitation.
+	Default: 1
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/Documentation/networking/tcp-thin.txt b/Documentation/networking/tcp-thin.txt
index 151e229..e3752e7 100644
--- a/Documentation/networking/tcp-thin.txt
+++ b/Documentation/networking/tcp-thin.txt
@@ -1,47 +1,159 @@
 Thin-streams and TCP
-====================
+-----------------------
+
 A wide range of Internet-based services that use reliable transport
-protocols display what we call thin-stream properties. This means
-that the application sends data with such a low rate that the
-retransmission mechanisms of the transport protocol are not fully
-effective. In time-dependent scenarios (like online games, control
-systems, stock trading etc.) where the user experience depends
-on the data delivery latency, packet loss can be devastating for
-the service quality. Extreme latencies are caused by TCP's
-dependency on the arrival of new data from the application to trigger
-retransmissions effectively through fast retransmit instead of
-waiting for long timeouts.
+protocols display what we call thin-stream properties. Traffic with
+thin-stream characteristics, characterized by small packets and a
+relatively high inter-transmission time (ITT), is often produced by
+latency-sensitive applications or services that rely on minimal
+latencies.
+
+In time-dependent scenarios (like online games, remote desktop,
+control systems, stock trading etc.) where the user experience depends
+on the data delivery latency, packet loss can be devastating for the
+service quality.
+
+Applications with a low write frequency, i.e. that write to the socket
+with with a low rate resulting in few packets in flight (PIF), render
+the retransmission mechanisms of the transport protocol ineffective.
+Thin streams experience increased latencies due to TCP's dependency on
+the arrival of dupACKs to trigger retransmissions effectively through
+fast retransmit instead of waiting for long timeouts.
 
 After analysing a large number of time-dependent interactive
-applications, we have seen that they often produce thin streams
-and also stay with this traffic pattern throughout its entire
-lifespan. The combination of time-dependency and the fact that the
-streams provoke high latencies when using TCP is unfortunate.
+applications, we have seen that they often produce thin streams and
+also stay with this traffic pattern throughout its entire lifespan.
+The combination of time-dependency and the fact that the streams
+provoke high latencies when using TCP is unfortunate.
+
+In order to reduce application-layer latency when packets are lost, a
+set of mechanisms have been made, which address these latency issues
+for thin streams.
+
+Two reactive mechanisms will reduce the time it takes to trigger
+retransmits when a stream has less than four PIFs:
+
+* TCP_THIN_DUPACK: Do Fast Retransmit on the first dupACK.
+
+* TCP_THIN_LINEAR_TIMEOUTS: Instead of exponential backoff after RTOs,
+  perform up to 6 (TCP_THIN_LINEAR_RETRIES) linear timeouts before
+  initiating exponential backoff.
+
+The threshold of 4 PIFs is used because when there are less than 4
+PIFs, the three dupACKs usually required to trigger a fast retransmit
+may not be produced, rendering the stream prone to experience high
+retransmission latencies.
 
-In order to reduce application-layer latency when packets are lost,
-a set of mechanisms has been made, which address these latency issues
-for thin streams. In short, if the kernel detects a thin stream,
-the retransmission mechanisms are modified in the following manner:
+Redundant Data Bundling
+***********************
 
-1) If the stream is thin, fast retransmit on the first dupACK.
-2) If the stream is thin, do not apply exponential backoff.
+Redundant Data Bundling (RDB) is a mechanism aimed at reducing the
+latency for applications sending time-dependent data by proactively
+retransmitting un-ACKed segments. By bundling (retransmitting) already
+sent data with packets containing new data, the connection will be
+more resistant to sporadic packet loss which reduces the application
+layer latency significantly in congested scenarios.
 
-These enhancements are applied only if the stream is detected as
-thin. This is accomplished by defining a threshold for the number
-of packets in flight. If there are less than 4 packets in flight,
-fast retransmissions can not be triggered, and the stream is prone
-to experience high retransmission latencies.
+Retransmitting data segments before they are known to be lost is a
+proactive approach at preventing increased latencies when packets are
+lost. By bundling redundant data before the retransmission mechanisms
+are triggered, RDB is very effective at alleviating head-of-line
+blocking on the receiving side, simply by reducing the need to perform
+regular retransmissions.
+
+With RDB enabled, an application that writes less frequently than the
+limit defined by the sysctl tcp_thin_dpifl_itt_lower_bound will be
+allowed to bundle.
+
+Using the thin-stream mechanisms
+********************************
 
 Since these mechanisms are targeted at time-dependent applications,
-they must be specifically activated by the application using the
-TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK IOCTLS or the
-tcp_thin_linear_timeouts and tcp_thin_dupack sysctls. Both
-modifications are turned off by default.
-
-References
-==========
-More information on the modifications, as well as a wide range of
-experimental data can be found here:
-"Improving latency for interactive, thin-stream applications over
-reliable transport"
-http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file
+they are by default off.
+
+The socket options TCP_THIN_DUPACK and TCP_THIN_LINEAR_TIMEOUTS can be
+used to enable the mechanisms on a socket. Alternatively, they can be
+enabled system-wide by setting the sysctl variables
+net.ipv4.tcp_thin_dupack and net.ipv4.tcp_thin_linear_timeouts to 1.
+
+Using RDB
+=========
+
+By default, applications are allowed to enable RDB on a socket with
+the socket option TCP_RDB. By setting the sysctl net.ipv4.tcp_rdb=0,
+application are not allowed to enable RDB on a socket. For testing
+purposes, it is possible to enable RDB system-wide for all new TCP
+connections by setting net.ipv4.tcp_rdb=2.
+
+For RDB to be fully efficient, Nagle must be disabled with the socket
+option TCP_NODELAY.
+
+
+Limitations on how much is bundled
+==================================
+
+Applying limitations on how much RDB may bundle can help control how
+RDB affects the bandwidth usage and effects on competing traffic. With
+few active RDB enabled streams, the total increase of bandwidth usage
+and negative effect on competing traffic will be minimal, unless the
+total bandwidth capacity is very limited.
+
+In scenarios with many RDB enabled streams, the total effect may
+become significant, which may justify imposing limitations on RDB.
+
+The two sysctls tcp_rdb_max_bytes and tcp_rdb_max_packets contain the
+default values used to limit how much can be bundled with each packet.
+
+tcp_rdb_max_bytes limits the payload size of an RDB packet which is
+the size including both the new (unsent) data as well as the already
+sent data. tcp_rdb_max_packets specifies the number of packets that
+may be bundled with each RDB packet. This is the most important knob
+as it directly controls how many lost packets each RDB packet may
+recover.
+
+If more fine grained control is required, tcp_rdb_max_bytes is useful
+to control how much impact RDB can have on the increased bandwidth
+requirements of the flows. If an application writes 700 bytes per
+write call, the bandwidth increase can be quite significant (even with
+a 1 packet bundling limit) if we consider a scenario with thousands of
+RDB streams.
+
+By limiting the total payload size of RDB packets to e.g. 100 bytes,
+only the smallest segments will benefit from RDB, while the segments
+that would increase the bandwidth requirements the most, will not.
+
+tcp_rdb_max_packets defaults to 1 as that allows RDB to recover from
+sporadic packet loss while still affecting competing traffic to a
+small degree[2].
+
+The sysctl tcp_rdb_await_congestion specifies whether a connection
+should bundle only after congestion has been detected.
+
+The default bundling limitations defined by the sysctl variables may
+be overridden with the socket options TCP_RDB_MAX_BYTES and
+TCP_RDB_MAX_PACKETS. To ensure bundling is performed immediately
+instead of waiting until after packet loss, pass the following flags
+to TCP_RDB socket option: (TCP_RDB_ENABLE | TCP_RDB_BUNDLE_IMMEDIATE).
+
+
+Further reading
+***********************
+
+[1] provides information on the modifications thin_dupack and
+thin_linear_timeouts, as well as a wide range of experimental data
+
+[2] presents RDB and the motivation behind the mechanism. [3] provides
+a detailed overview of the RDB mechanism and the experiments performed
+to test the effects of RDB.
+
+[1] "Improving latency for interactive, thin-stream applications over
+     reliable transport"
+    http://urn.nb.no/URN:NBN:no-24274
+
+[2] "Latency and fairness trade-off for thin streams using redundant
+     data bundling in TCP."
+    http://dx.doi.org/10.1109/LCN.2015.7366322
+
+[3] "Taming Redundant Data Bundling: Balancing fairness and latency
+     for redundant bundling in TCP"
+    http://urn.nb.no/URN:NBN:no-48283
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dc0fca7..20c74c3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2973,6 +2973,7 @@ static inline void skb_free_datagram_locked(struct sock *sk,
 	__skb_free_datagram_locked(sk, skb, 0);
 }
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7be9b12..7a53644 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -213,11 +213,12 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 3,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
 		repair      : 1,
-		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
+		frto        : 1,/* F-RTO (RFC5682) activated in CA_Loss */
+		is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
 	u8	repair_queue;
 	u8	do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
 		syn_data:1,	/* SYN includes data */
@@ -225,7 +226,11 @@ struct tcp_sock {
 		syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
 		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
 		save_syn:1,	/* Save headers of SYN packet */
-		is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
+		rdb:1,                  /* Redundant Data Bundling enabled     */
+		rdb_await_congestion:1; /* RDB wait to bundle until next loss  */
+
+	u16 rdb_max_bytes;      /* Max payload bytes in an RDB packet       */
+	u16 rdb_max_packets;    /* Max packets allowed to be bundled by RDB */
 	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
 
 /* RTT measurement */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 71be4ac..eb45f73 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -113,6 +113,11 @@ struct netns_ipv4 {
 	unsigned int sysctl_tcp_notsent_lowat;
 	int sysctl_tcp_thin_dpifl_itt_lower_bound;
 
+	int sysctl_tcp_rdb;
+	int sysctl_tcp_rdb_await_congestion;
+	int sysctl_tcp_rdb_max_bytes;
+	int sysctl_tcp_rdb_max_packets;
+
 	int sysctl_igmp_max_memberships;
 	int sysctl_igmp_max_msf;
 	int sysctl_igmp_llm_reports;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9956af9..013d08a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -541,6 +541,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 bool tcp_may_send_now(struct sock *sk);
 int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
 int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
@@ -560,6 +562,7 @@ void tcp_send_loss_probe(struct sock *sk);
 bool tcp_schedule_loss_probe(struct sock *sk);
 void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 			     const struct sk_buff *next_skb);
+void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb);
 
 /* tcp_input.c */
 void tcp_resume_early_retransmit(struct sock *sk);
@@ -569,6 +572,11 @@ void tcp_reset(struct sock *sk);
 void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 void tcp_fin(struct sock *sk);
 
+/* tcp_rdb.c */
+void tcp_rdb_ack_event(struct sock *sk);
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask);
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
@@ -770,6 +778,7 @@ struct tcp_skb_cb {
 		struct {
 			/* There is space for up to 20 bytes */
 			__u32 in_flight;/* Bytes in flight when packet sent */
+			__u32 rdb_start_seq; /* Start seq of RDB data */
 		} tx;   /* only used for outgoing skbs */
 		union {
 			struct inet_skb_parm	h4;
@@ -1503,6 +1512,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)		\
+	skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+
 static inline struct sk_buff *tcp_send_head(const struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 25a9ad8..0bdeb06 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -280,6 +280,7 @@ enum
 	LINUX_MIB_TCPKEEPALIVE,			/* TCPKeepAlive */
 	LINUX_MIB_TCPMTUPFAIL,			/* TCPMTUPFail */
 	LINUX_MIB_TCPMTUPSUCCESS,		/* TCPMTUPSuccess */
+	LINUX_MIB_TCPRDBLOSSREPAIRS,		/* TCPRDBLossRepairs */
 	__LINUX_MIB_MAX
 };
 
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 53e8e3f..33ece78 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -115,6 +115,9 @@ enum {
 #define TCP_CC_INFO		26	/* Get Congestion Control (optional) info */
 #define TCP_SAVE_SYN		27	/* Record SYN headers for new connections */
 #define TCP_SAVED_SYN		28	/* Get SYN headers recorded for connection */
+#define TCP_RDB			29	/* Enable Redundant Data Bundling mechanism */
+#define TCP_RDB_MAX_BYTES	30	/* Max payload bytes in an RDB packet */
+#define TCP_RDB_MAX_PACKETS	31	/* Max packets allowed to be bundled by RDB */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
@@ -214,4 +217,11 @@ struct tcp_md5sig {
 	__u8	tcpm_key[TCP_MD5SIG_MAXKEYLEN];		/* key (binary) */
 };
 
+/*
+ * TCP_RDB socket option flags
+ */
+#define TCP_RDB_DISABLE          0 /* Disble RDB */
+#define TCP_RDB_ENABLE           1 /* Enable RDB */
+#define TCP_RDB_BUNDLE_IMMEDIATE 2 /* Force immediate bundling (Do not wait for congestion) */
+
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e7ec6d3..77edf5a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1056,7 +1056,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 24629b6..fac88b5 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tcp_rdb.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..b839022 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -302,6 +302,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPKeepAlive", LINUX_MIB_TCPKEEPALIVE),
 	SNMP_MIB_ITEM("TCPMTUPFail", LINUX_MIB_TCPMTUPFAIL),
 	SNMP_MIB_ITEM("TCPMTUPSuccess", LINUX_MIB_TCPMTUPSUCCESS),
+	SNMP_MIB_ITEM("TCPRDBLossRepairs", LINUX_MIB_TCPRDBLOSSREPAIRS),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 150969d..3b6c3cb 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -731,6 +731,40 @@ static struct ctl_table ipv4_net_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
+		.procname	= "tcp_rdb",
+		.data		= &init_net.ipv4.sysctl_tcp_rdb,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_rdb),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_await_congestion",
+		.data		= &init_net.ipv4.sysctl_tcp_rdb_await_congestion,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_rdb_await_congestion),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_max_bytes",
+		.data		= &init_net.ipv4.sysctl_tcp_rdb_max_bytes,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_rdb_max_bytes),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "tcp_rdb_max_packets",
+		.data		= &init_net.ipv4.sysctl_tcp_rdb_max_packets,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_rdb_max_packets),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "ip_dynaddr",
 		.data		= &init_net.ipv4.sysctl_ip_dynaddr,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5c7ed14..9fb012b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -405,6 +405,12 @@ void tcp_init_sock(struct sock *sk)
 	u64_stats_init(&tp->syncp);
 
 	tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
+
+	tp->rdb = sock_net(sk)->ipv4.sysctl_tcp_rdb == 2;
+	tp->rdb_await_congestion = sock_net(sk)->ipv4.sysctl_tcp_rdb_await_congestion;
+	tp->rdb_max_packets = sock_net(sk)->ipv4.sysctl_tcp_rdb_max_packets;
+	tp->rdb_max_bytes = sock_net(sk)->ipv4.sysctl_tcp_rdb_max_bytes;
+
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
@@ -2416,6 +2422,29 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_RDB:
+		if (val && !sock_net(sk)->ipv4.sysctl_tcp_rdb)
+			err = -EPERM;
+		else {
+			tp->rdb = val & TCP_RDB_ENABLE;
+			tp->rdb_await_congestion = !(val & TCP_RDB_BUNDLE_IMMEDIATE);
+		}
+		break;
+
+	case TCP_RDB_MAX_BYTES:
+		if (val < 0 || val > USHRT_MAX)
+			err = -EINVAL;
+		else
+			tp->rdb_max_bytes = val;
+		break;
+
+	case TCP_RDB_MAX_PACKETS:
+		if (val < 0 || val > USHRT_MAX)
+			err = -EINVAL;
+		else
+			tp->rdb_max_packets = val;
+		break;
+
 	case TCP_REPAIR:
 		if (!tcp_can_repair_sock(sk))
 			err = -EPERM;
@@ -2848,7 +2877,18 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		val = tp->thin_dupack;
 		break;
-
+	case TCP_RDB:
+		if (tp->rdb)
+			val |= TCP_RDB_ENABLE;
+		if (!tp->rdb_await_congestion)
+			val |= TCP_RDB_BUNDLE_IMMEDIATE;
+		break;
+	case TCP_RDB_MAX_BYTES:
+		val = tp->rdb_max_bytes;
+		break;
+	case TCP_RDB_MAX_PACKETS:
+		val = tp->rdb_max_packets;
+		break;
 	case TCP_REPAIR:
 		val = tp->repair;
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff..35a3d1a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3540,6 +3540,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
 
 	if (icsk->icsk_ca_ops->in_ack_event)
 		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+
+	if (unlikely(tcp_sk(sk)->rdb))
+		tcp_rdb_ack_event(sk);
 }
 
 /* Congestion control has updated the cwnd already. So if we're in
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4e5e8e6..7f06c52 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2395,6 +2395,11 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_ecn_fallback = 1;
 
+	net->ipv4.sysctl_tcp_rdb = 1;
+	net->ipv4.sysctl_tcp_rdb_await_congestion = 1;
+	net->ipv4.sysctl_tcp_rdb_max_bytes = 0;
+	net->ipv4.sysctl_tcp_rdb_max_packets = 1;
+
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
 	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
 	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b1bcba0..30f2d47 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -897,8 +897,8 @@ out:
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-			    gfp_t gfp_mask)
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_sock *inet;
@@ -2129,9 +2129,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+		if (unlikely(tcp_sk(sk)->rdb)) {
+			if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
+				break;
+		} else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
 			break;
-
+		}
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.
@@ -2472,15 +2475,33 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 	}
 }
 
+/**
+ * tcp_skb_append_data() - copy the linear data from an SKB to the end
+ *                         of another and update end sequence number
+ *                         and checksum
+ * @from_skb: the SKB to copy data from
+ * @to_skb: the SKB to copy data to
+ */
+void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
+{
+	skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len),
+				  from_skb->len);
+	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
+
+	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
+		to_skb->ip_summed = CHECKSUM_PARTIAL;
+
+	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
+		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
+					      to_skb->len);
+
+}
+
 /* Collapses two adjacent SKB's during retransmission. */
 static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
-	int skb_size, next_skb_size;
-
-	skb_size = skb->len;
-	next_skb_size = next_skb->len;
 
 	BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
 
@@ -2488,17 +2509,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
 	tcp_unlink_write_queue(next_skb, sk);
 
-	skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
-				  next_skb_size);
-
-	if (next_skb->ip_summed == CHECKSUM_PARTIAL)
-		skb->ip_summed = CHECKSUM_PARTIAL;
-
-	if (skb->ip_summed != CHECKSUM_PARTIAL)
-		skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
-
-	/* Update sequence range on original skb. */
-	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
+	tcp_skb_append_data(next_skb, skb);
 
 	/* Merge over control information. This moves PSH/FIN etc. over */
 	TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
new file mode 100644
index 0000000..0c1790a
--- /dev/null
+++ b/net/ipv4/tcp_rdb.c
@@ -0,0 +1,240 @@
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+/**
+ * rdb_detect_loss() - perform RDB loss detection by analysing ACKs
+ * @sk: socket
+ *
+ * Traverse the output queue and check if the ACKed packet is an RDB
+ * packet and if the redundant data covers one or more un-ACKed SKBs.
+ * If the incoming ACK acknowledges multiple SKBs, we can presume
+ * packet loss has occurred.
+ *
+ * We can infer packet loss this way because we can expect one ACK per
+ * transmitted data packet, as delayed ACKs are disabled when a host
+ * receives packets where the sequence number is not the expected
+ * sequence number.
+ *
+ * Return: 1 if packet loss, else 0
+ */
+static unsigned int rdb_detect_loss(struct sock *sk)
+{
+	struct sk_buff *skb, *tmp;
+	struct tcp_skb_cb *scb;
+	u32 seq_acked = tcp_sk(sk)->snd_una;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == tcp_send_head(sk))
+			break;
+
+		scb = TCP_SKB_CB(skb);
+		/* The ACK acknowledges parts of the data in this SKB.
+		 * Can be caused by:
+		 * - TSO: We abort as RDB is not used on SKBs split across
+		 *        multiple packets on lower layers as these are greater
+		 *        than one MSS.
+		 * - Retrans collapse: We've had a retrans, so loss has already
+		 *                     been detected.
+		 */
+		if (after(scb->end_seq, seq_acked))
+			break;
+		else if (scb->end_seq != seq_acked)
+			continue;
+
+		/* We have found the ACKed packet */
+
+		/* This packet was sent with no redundant data, or no prior
+		 * un-ACKed SKBs is in the output queue, so break here.
+		 */
+		if (scb->tx.rdb_start_seq == scb->seq ||
+		    skb_queue_is_first(&sk->sk_write_queue, skb))
+			break;
+		/* Find number of prior SKBs whose data was bundled in this
+		 * (ACKed) SKB. We presume any redundant data covering previous
+		 * SKB's are due to loss. (An exception would be reordering).
+		 */
+		skb = skb->prev;
+		tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+			if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
+				break;
+			return 1;
+		}
+		break;
+	}
+	return 0;
+}
+
+/**
+ * tcp_rdb_ack_event() - initiate RDB loss detection
+ * @sk: socket
+ *
+ * When RDB is able to repair a packet loss, the loss event is hidden
+ * from the regular loss detection mechanisms. To ensure RDB streams
+ * behave fairly towards competing TCP traffic, we call tcp_enter_cwr()
+ * to enter congestion window reduction state.
+ * tcp_enter_cwr() disables undoing the CWND reduction, which avoids
+ * incorrectly undoing the reduction later on.
+ */
+void tcp_rdb_ack_event(struct sock *sk)
+{
+	unsigned int lost = rdb_detect_loss(sk);
+	if (lost) {
+		tcp_enter_cwr(sk);
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRDBLOSSREPAIRS);
+	}
+}
+
+/**
+ * rdb_build_skb() - build a new RDB SKB and copy redundant + unsent
+ *                   data to the linear page buffer
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission in the output engine
+ * @first_skb: the first SKB in the output queue to be bundled
+ * @bytes_in_rdb_skb: the total number of data bytes for the new
+ *                    rdb_skb (NEW + Redundant)
+ * @gfp_mask: gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if memory
+ *         allocation failed
+ */
+static struct sk_buff *rdb_build_skb(const struct sock *sk,
+				     struct sk_buff *xmit_skb,
+				     struct sk_buff *first_skb,
+				     u32 bytes_in_rdb_skb,
+				     gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb, *tmp_skb = first_skb;
+
+	rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
+				      (int)bytes_in_rdb_skb,
+				      gfp_mask, false);
+	if (!rdb_skb)
+		return NULL;
+	copy_skb_header(rdb_skb, xmit_skb);
+	rdb_skb->ip_summed = xmit_skb->ip_summed;
+	TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
+
+	/* Start on first_skb and append payload from each SKB in the output
+	 * queue onto rdb_skb until we reach xmit_skb.
+	 */
+	tcp_for_write_queue_from(tmp_skb, sk) {
+		tcp_skb_append_data(tmp_skb, rdb_skb);
+
+		/* We reached xmit_skb, containing the unsent data */
+		if (tmp_skb == xmit_skb)
+			break;
+	}
+	return rdb_skb;
+}
+
+/**
+ * rdb_can_bundle_test() - test if redundant data can be bundled
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @max_payload: the maximum allowed payload bytes for the RDB SKB
+ * @bytes_in_rdb_skb: store the total number of payload bytes in the
+ *                    RDB SKB if bundling can be performed
+ *
+ * Traverse the output queue and check if any un-acked data may be
+ * bundled.
+ *
+ * Return: The first SKB to be in the bundle, or NULL if no bundling
+ */
+static struct sk_buff *rdb_can_bundle_test(const struct sock *sk,
+					   struct sk_buff *xmit_skb,
+					   unsigned int max_payload,
+					   u32 *bytes_in_rdb_skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *first_to_bundle = NULL;
+	struct sk_buff *tmp, *skb = xmit_skb->prev;
+	u32 skbs_in_bundle_count = 1; /* Start on 1 to account for xmit_skb */
+	u32 total_payload = xmit_skb->len;
+
+	if (tp->rdb_max_bytes)
+		max_payload = min_t(unsigned int, max_payload,
+				    tp->rdb_max_bytes);
+
+	/* We start at xmit_skb->prev, and go backwards */
+	tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+		/* Including data from this SKB would exceed payload limit */
+		if ((total_payload + skb->len) > max_payload)
+			break;
+
+		if (tp->rdb_max_packets &&
+		    (skbs_in_bundle_count > tp->rdb_max_packets))
+			break;
+
+		total_payload += skb->len;
+		skbs_in_bundle_count++;
+		first_to_bundle = skb;
+	}
+	*bytes_in_rdb_skb = total_payload;
+	return first_to_bundle;
+}
+
+/**
+ * tcp_transmit_rdb_skb() - try to create and send an RDB packet
+ * @sk: socket
+ * @xmit_skb: the SKB processed for transmission by the output engine
+ * @mss_now: current mss value
+ * @gfp_mask: gfp_t allocation
+ *
+ * If an RDB packet could not be created and sent, transmit the
+ * original unmodified SKB (xmit_skb).
+ *
+ * Return: 0 if successfully sent packet, else error from
+ *         tcp_transmit_skb
+ */
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb = NULL;
+	struct sk_buff *first_to_bundle;
+	u32 bytes_in_rdb_skb = 0;
+
+	/* How we detect that RDB was used. When equal, no RDB data was sent */
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
+
+	/* We must wait for a retransmission to occur before bundling */
+	if (tcp_sk(sk)->rdb_await_congestion) {
+		if (tcp_in_initial_slowstart(tcp_sk(sk)))
+			goto xmit_default;
+		tcp_sk(sk)->rdb_await_congestion = 0;
+	}
+
+	if (!tcp_stream_is_thin_dpifl(sk))
+		goto xmit_default;
+
+	/* No bundling if first in queue */
+	if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb))
+		goto xmit_default;
+
+	/* Find number of (previous) SKBs to get data from */
+	first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now,
+					      &bytes_in_rdb_skb);
+	if (!first_to_bundle)
+		goto xmit_default;
+
+	/* Create an SKB that contains redundant data starting from
+	 * first_to_bundle.
+	 */
+	rdb_skb = rdb_build_skb(sk, xmit_skb, first_to_bundle,
+				bytes_in_rdb_skb, gfp_mask);
+	if (!rdb_skb)
+		goto xmit_default;
+
+	/* Set skb_mstamp for the SKB in the output queue (xmit_skb) containing
+	 * the yet unsent data. Normally this would be done by
+	 * tcp_transmit_skb(), but as we pass in rdb_skb instead, xmit_skb's
+	 * timestamp will not be touched.
+	 */
+	skb_mstamp_get(&xmit_skb->skb_mstamp);
+	rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
+	return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
+
+xmit_default:
+	/* Transmit the unmodified SKB from output queue */
+	return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
+}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2016-06-22 14:56 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23 20:50 [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB) Bendik Rønning Opstad
2015-10-23 20:50 ` Bendik Rønning Opstad
2015-10-23 20:50 ` [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
2015-10-23 20:50   ` Bendik Rønning Opstad
2015-10-23 21:44   ` Eric Dumazet
2015-10-23 21:44     ` Eric Dumazet
2015-10-25  5:56     ` Bendik Rønning Opstad
2015-10-25  5:56       ` Bendik Rønning Opstad
2015-10-23 20:50 ` [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
2015-10-23 20:50   ` Bendik Rønning Opstad
2015-10-26 14:50   ` Neal Cardwell
2015-10-26 14:50     ` Neal Cardwell
2015-10-26 21:35     ` Andreas Petlund
2015-10-26 21:35       ` Andreas Petlund
2015-10-26 21:58       ` Yuchung Cheng
2015-10-26 21:58         ` Yuchung Cheng
2015-10-27 19:15         ` Jonas Markussen
2015-10-27 19:15           ` Jonas Markussen
2015-10-29 22:53         ` Bendik Rønning Opstad
2015-10-29 22:53           ` Bendik Rønning Opstad
2015-11-02  9:18           ` David Laight
2015-11-02  9:18             ` David Laight
2015-11-02  9:37   ` David Laight
2015-11-02  9:37     ` David Laight
2015-11-05  2:06     ` Bendik Rønning Opstad
2015-11-05  2:06       ` Bendik Rønning Opstad
2015-11-05  2:06       ` Bendik Rønning Opstad
2015-10-24  6:11 ` [PATCH RFC net-next 0/2] tcp: " Yuchung Cheng
2015-10-24  6:11   ` Yuchung Cheng
2015-10-24  6:11   ` Yuchung Cheng
2015-10-24  8:00   ` Jonas Markussen
2015-10-24  8:00     ` Jonas Markussen
2015-10-24 12:57     ` Eric Dumazet
2015-10-24 12:57       ` Eric Dumazet
2015-11-09 19:40       ` Bendik Rønning Opstad
2015-11-23 16:26 ` [PATCH RFC v2 " Bendik Rønning Opstad
2015-11-23 16:26 ` [PATCH RFC v2 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
2015-11-23 16:26 ` [PATCH RFC v2 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
2015-11-23 17:43   ` Eric Dumazet
2015-11-23 20:05     ` Bendik Rønning Opstad
2016-02-02 19:23 ` [PATCH v3 net-next 0/2] tcp: " Bendik Rønning Opstad
2016-02-02 19:23 ` [PATCH v3 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
2016-02-02 19:23 ` [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
2016-02-02 20:35   ` Eric Dumazet
2016-02-03 18:17     ` Bendik Rønning Opstad
2016-02-03 19:34       ` Eric Dumazet
     [not found]         ` <CAF8eE=VOuoNLQHtkRwM9ZG+vJ-uH2ufVW5y_pS24rGqWh4Qa2g@mail.gmail.com>
2016-02-08 17:30           ` Bendik Rønning Opstad
2016-02-08 17:38         ` Bendik Rønning Opstad
2016-02-16 13:51 ` [PATCH v4 net-next 0/2] tcp: " Bendik Rønning Opstad
2016-02-16 13:51 ` [PATCH v4 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
2016-02-16 13:51 ` [PATCH v4 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
2016-02-18 15:18   ` Eric Dumazet
2016-02-19 14:12     ` Bendik Rønning Opstad
2016-02-24 21:12 ` [PATCH v5 net-next 0/2] tcp: " Bendik Rønning Opstad
2016-02-24 21:12 ` [PATCH v5 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
2016-02-24 21:12 ` [PATCH v5 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
2016-03-02 19:52   ` David Miller
2016-03-02 22:33     ` Bendik Rønning Opstad
2016-03-03 18:06 ` [PATCH v6 net-next 0/2] tcp: " Bendik Rønning Opstad
2016-03-07 19:36   ` David Miller
2016-03-10  0:20   ` Yuchung Cheng
2016-03-10  1:45     ` Jonas Markussen
2016-03-10  2:27       ` Yuchung Cheng
2016-03-12  9:23         ` Jonas Markussen
2016-03-13 23:18     ` Bendik Rønning Opstad
2016-03-14 21:59       ` Yuchung Cheng
2016-03-18 14:25         ` Bendik Rønning Opstad
2016-03-03 18:06 ` [PATCH v6 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
2016-03-03 18:06 ` [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
2016-03-14 21:15   ` Eric Dumazet
2016-03-15  1:04     ` Rick Jones
2016-03-15 18:09       ` Yuchung Cheng
2016-03-18 17:58     ` Bendik Rønning Opstad
2016-03-14 21:54   ` Yuchung Cheng
2016-03-15  0:40     ` Bill Fink
2016-03-17 23:26     ` Bendik Rønning Opstad
2016-03-21 18:54       ` Yuchung Cheng
2016-06-16 17:12         ` Bendik Rønning Opstad
2016-06-22 14:56 ` [PATCH v7 net-next 0/2] tcp: " Bendik Rønning Opstad
2016-06-22 14:56 ` [PATCH v7 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
2016-06-22 14:56 ` [PATCH v7 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.