linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
@ 2015-10-23 20:50 Bendik Rønning Opstad
  2015-10-23 20:50 ` [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-23 20:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad


This is a request for comments.

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be made
not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on a recently submitted patch for
tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)

These patches have been tested with as set of packetdrill scripts located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper
"Latency and Fairness Trade-Off for Thin Streams using Redundant Data
Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 +++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   9 +-
 include/net/tcp.h                      |  34 ++++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  35 ++++
 net/ipv4/tcp.c                         |  19 ++-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
 12 files changed, 415 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50 [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB) Bendik Rønning Opstad
@ 2015-10-23 20:50 ` Bendik Rønning Opstad
  2015-10-23 21:44   ` Eric Dumazet
  2015-10-23 20:50 ` [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
  2015-10-24  6:11 ` [PATCH RFC net-next 0/2] tcp: " Yuchung Cheng
  2 siblings, 1 reply; 16+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-23 20:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/linux/tcp.h                    |  6 ++++++
 include/net/tcp.h                      | 20 ++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp.c                         |  3 +++
 5 files changed, 46 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 85752c8..b841a76 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -700,6 +700,14 @@ tcp_thin_dupack - BOOLEAN
 	Documentation/networking/tcp-thin.txt
 	Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+	Controls the lower bound for ITT (inter-transmission time) threshold
+	for when a stream is considered thin. The value is specified in
+	microseconds, and may not be lower than 10000 (10 ms). This theshold
+	is used to calculate a dynamic packets in flight limit (DPIFL) which
+	is used to classify whether a stream is thin.
+	Default: 10000
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c906f45..fc885db 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -269,6 +269,12 @@ struct tcp_sock {
 	struct sk_buff* lost_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 
+	/* The limit used to identify when a stream is thin based in a minimum
+	 * allowed inter-transmission time (ITT) in microseconds. This is used
+	 * to dynamically calculate a max packets in flight limit (DPIFL).
+	*/
+	int thin_dpifl_itt_lower_bound;
+
 	/* OOO segments go in this list. Note that socket lock must be held,
 	 * as we do not use sk_buff_head lock.
 	 */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4fc457b..6534836 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6       /* After 6 linear retries, do exp. backoff */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 10000  /* Minimum lower bound is 10 ms (10000 usec) */
 
 /* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
 #define TCP_INIT_CWND		10
@@ -274,6 +275,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1631,6 +1633,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
+ *                              limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ *         the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+	u64 dpif_lim = tp->srtt_us >> 3;
+	/* Div by is_thin_min_itt_lim, the minimum allowed ITT
+	 * (Inter-transmission time) in usecs.
+	 */
+	do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
+	return tcp_packets_in_flight(tp) < dpif_lim;
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 25300c5..917fdde 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -709,6 +710,14 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_thin_dpifl_itt_lower_bound",
+		.data		= &sysctl_tcp_thin_dpifl_itt_lower_bound,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0cfa7c0..f712d7c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -287,6 +287,8 @@ int sysctl_tcp_min_tso_segs __read_mostly = 2;
 
 int sysctl_tcp_autocorking __read_mostly = 1;
 
+int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -406,6 +408,7 @@ void tcp_init_sock(struct sock *sk)
 	u64_stats_init(&tp->syncp);
 
 	tp->reordering = sysctl_tcp_reordering;
+	tp->thin_dpifl_itt_lower_bound = sysctl_tcp_thin_dpifl_itt_lower_bound;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB) Bendik Rønning Opstad
  2015-10-23 20:50 ` [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
@ 2015-10-23 20:50 ` Bendik Rønning Opstad
  2015-10-26 14:50   ` Neal Cardwell
  2015-11-02  9:37   ` David Laight
  2015-10-24  6:11 ` [PATCH RFC net-next 0/2] tcp: " Yuchung Cheng
  2 siblings, 2 replies; 16+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-23 20:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
    with each packet, packet loss can be hidden from the TCP engine due
    to lack of dupACKs. This is because the loss is "repaired" by the
    redundant data in the packet coming after the lost packet. Based on
    incoming ACKs, such hidden loss events are detected, and CWR state
    is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
    be sent with a modified SKB containing the redundant data of
    previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
    tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
    for streams that may benefit from RDB, controlled by the sysctl
    variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl variable tcp_rdb=1.

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Pål Halvorsen <paalh@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 ++
 include/linux/skbuff.h                 |   1 +
 include/linux/tcp.h                    |   3 +-
 include/net/tcp.h                      |  14 ++
 include/uapi/linux/tcp.h               |   1 +
 net/core/skbuff.c                      |   3 +-
 net/ipv4/Makefile                      |   3 +-
 net/ipv4/sysctl_net_ipv4.c             |  26 +++
 net/ipv4/tcp.c                         |  16 +-
 net/ipv4/tcp_input.c                   |   3 +
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
 12 files changed, 369 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index b841a76..740e6a3 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
 	is used to classify whether a stream is thin.
 	Default: 10000
 
+tcp_rdb - BOOLEAN
+	Enable RDB for all new TCP connections.
+	Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+	Enable restriction on how many bytes an RDB packet can contain.
+	This is the total amount of payload including the new unsent data.
+	Default: 0
+
+tcp_rdb_max_skbs - INTEGER
+	Enable restriction on how many previous SKBs in the output queue
+	RDB may include data from. A value of 1 will restrict bundling to
+	only the data from the last packet that was sent.
+	Default: 1
+
 tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 24f4dfd..3572d21 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2809,6 +2809,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fc885db..f38b889 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -202,9 +202,10 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 3,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
+		rdb         : 1,/* Redundant Data Bundling enabled */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6534836..dce46c2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -276,6 +276,9 @@ extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
+extern int sysctl_tcp_rdb;
+extern int sysctl_tcp_rdb_max_bytes;
+extern int sysctl_tcp_rdb_max_skbs;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -548,6 +551,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 bool tcp_may_send_now(struct sock *sk);
 int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
 int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
@@ -573,6 +578,11 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
 void tcp_reset(struct sock *sk);
 void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 
+/* tcp_rdb.c */
+void rdb_ack_event(struct sock *sk, u32 flags);
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask);
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
@@ -771,6 +781,7 @@ struct tcp_skb_cb {
 	union {
 		struct {
 			/* There is space for up to 20 bytes */
+			__u32 rdb_start_seq; /* Start seq of rdb data bundled */
 		} tx;   /* only used for outgoing skbs */
 		union {
 			struct inet_skb_parm	h4;
@@ -1494,6 +1505,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)		\
+	skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+
 static inline struct sk_buff *tcp_send_head(const struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 65a77b0..ae0fba3 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -115,6 +115,7 @@ enum {
 #define TCP_CC_INFO		26	/* Get Congestion Control (optional) info */
 #define TCP_SAVE_SYN		27	/* Record SYN headers for new connections */
 #define TCP_SAVED_SYN		28	/* Get SYN headers recorded for connection */
+#define TCP_RDB			29	/* Enable RDB mechanism */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fab4599..544f8cc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -978,7 +978,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -986,6 +986,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(copy_skb_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index c29809f..f2cf496 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     tcp_rdb.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 917fdde..703078f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -718,6 +718,32 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &tcp_thin_dpifl_itt_lower_bound_min,
 	},
 	{
+		.procname	= "tcp_rdb",
+		.data		= &sysctl_tcp_rdb,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "tcp_rdb_max_bytes",
+		.data		= &sysctl_tcp_rdb_max_bytes,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= &proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "tcp_rdb_max_skbs",
+		.data		= &sysctl_tcp_rdb_max_skbs,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "tcp_early_retrans",
 		.data		= &sysctl_tcp_early_retrans,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f712d7c..11d45d4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -289,6 +289,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
 
 int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
+int sysctl_tcp_rdb __read_mostly;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -409,6 +411,7 @@ void tcp_init_sock(struct sock *sk)
 
 	tp->reordering = sysctl_tcp_reordering;
 	tp->thin_dpifl_itt_lower_bound = sysctl_tcp_thin_dpifl_itt_lower_bound;
+	tp->rdb = sysctl_tcp_rdb;
 	tcp_enable_early_retrans(tp);
 	tcp_assign_congestion_control(sk);
 
@@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_RDB:
+		if (val < 0 || val > 1) {
+			err = -EINVAL;
+		} else {
+			tp->rdb = val;
+			tp->nonagle = val;
+		}
+		break;
+
 	case TCP_REPAIR:
 		if (!tcp_can_repair_sock(sk))
 			err = -EPERM;
@@ -2828,7 +2840,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		val = tp->thin_dupack;
 		break;
-
+	case TCP_RDB:
+		val = tp->rdb;
+		break;
 	case TCP_REPAIR:
 		val = tp->repair;
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fdd88c3..a4901b3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3503,6 +3503,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
 
 	if (icsk->icsk_ca_ops->in_ack_event)
 		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+
+	if (unlikely(tcp_sk(sk)->rdb))
+		rdb_ack_event(sk, flags);
 }
 
 /* This routine deals with incoming acks, but not outgoing ones. */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f6f7f9b..6d4ea7d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -900,8 +900,8 @@ out:
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-			    gfp_t gfp_mask)
+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+		     gfp_t gfp_mask)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_sock *inet;
@@ -2113,9 +2113,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+		if (unlikely(tcp_sk(sk)->rdb)) {
+			if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
+				break;
+		} else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
 			break;
-
+		}
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.
diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
new file mode 100644
index 0000000..37faf35
--- /dev/null
+++ b/net/ipv4/tcp_rdb.c
@@ -0,0 +1,281 @@
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+int sysctl_tcp_rdb_max_bytes __read_mostly;
+int sysctl_tcp_rdb_max_skbs __read_mostly = 1;
+
+/**
+ * rdb_check_rtx_queue_loss() - Perform loss detection by analysing acks.
+ * @sk: the socket.
+ * @seq_acked: The sequence number that was acked.
+ *
+ * Return: The number of packets that are presumed to be lost.
+ */
+static int rdb_check_rtx_queue_loss(struct sock *sk, u32 seq_acked)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb, *tmp, *prev_skb = NULL;
+	struct sk_buff *send_head = tcp_send_head(sk);
+	struct tcp_skb_cb *scb;
+	bool fully_acked = true;
+	int lost_count = 0;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == send_head)
+			break;
+
+		scb = TCP_SKB_CB(skb);
+
+		/* Determine how many packets and what bytes were acked, no TSO
+		 * support
+		 */
+		if (after(scb->end_seq, tp->snd_una)) {
+			if (tcp_skb_pcount(skb) == 1 ||
+			    !after(tp->snd_una, scb->seq)) {
+				break;
+			}
+
+			/* We do not handle SKBs with gso_segs */
+			if (tcp_skb_pcount(skb))
+				break;
+			fully_acked = false;
+		}
+
+		/* Acks up to this SKB */
+		if (scb->end_seq == seq_acked) {
+			/* This SKB was sent with RDB data, and acked data on
+			 * previous skb
+			 */
+			if (TCP_SKB_CB(skb)->tx.rdb_start_seq != scb->seq &&
+			    prev_skb) {
+				/* Find how many previous packets were Acked
+				 * (and thereby lost)
+				 */
+				tcp_for_write_queue(tmp, sk) {
+					/* We have reached the acked SKB */
+					if (tmp == skb)
+						break;
+					lost_count++;
+				}
+			}
+			break;
+		}
+		if (!fully_acked)
+			break;
+		prev_skb = skb;
+	}
+	return lost_count;
+}
+
+/**
+ * rdb_in_ack_event() - Initiate loss detection
+ * @sk: the socket
+ * @flags: The flags
+ */
+void rdb_ack_event(struct sock *sk, u32 flags)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	if (rdb_check_rtx_queue_loss(sk, tp->snd_una))
+		tcp_enter_cwr(sk);
+}
+
+/**
+ * skb_append_data() - Copy data from an SKB to the end of another
+ * @from_skb: The SKB to copy data from
+ * @to_skb: The SKB to copy data to
+ */
+static int skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
+{
+	/* Copy the linear data and the data from the frags into the linear page
+	 * buffer of to_skb.
+	 */
+	if (WARN_ON(skb_copy_bits(from_skb, 0,
+				  skb_put(to_skb, from_skb->len),
+				  from_skb->len))) {
+		goto fault;
+	}
+
+	TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
+
+	if (from_skb->ip_summed == CHECKSUM_PARTIAL)
+		to_skb->ip_summed = CHECKSUM_PARTIAL;
+
+	if (to_skb->ip_summed != CHECKSUM_PARTIAL)
+		to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
+					      to_skb->len);
+	return 0;
+fault:
+	return -EFAULT;
+}
+
+/**
+ * rdb_build_skb() - Builds the new RDB SKB and copies all the data into the
+ *                   linear page buffer.
+ * @sk: the socket
+ * @xmit_skb: This is the SKB that tcp_write_xmit wants to send
+ * @first_skb: The first SKB in the output queue we will bundle
+ * @gfp_mask: The gfp_t allocation
+ * @bytes_in_rdb_skb: The total number of data bytes for the new rdb_skb
+ *                         (NEW + Redundant)
+ *
+ * Return: A new SKB containing redundant data, or NULL if memory allocation
+ *         failed
+ */
+static struct sk_buff *rdb_build_skb(const struct sock *sk,
+				     struct sk_buff *xmit_skb,
+				     struct sk_buff *first_skb,
+				     u32 bytes_in_rdb_skb,
+				     gfp_t gfp_mask)
+{
+	struct sk_buff *rdb_skb, *tmp_skb;
+
+	rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
+				      (int)bytes_in_rdb_skb,
+				      gfp_mask, true);
+	if (!rdb_skb)
+		return NULL;
+	copy_skb_header(rdb_skb, xmit_skb);
+	rdb_skb->ip_summed = xmit_skb->ip_summed;
+
+	TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
+
+	tmp_skb = first_skb;
+
+	tcp_for_write_queue_from(tmp_skb, sk) {
+		/* Copy data from tmp_skb to rdb_skb */
+		if (skb_append_data(tmp_skb, rdb_skb))
+			return NULL;
+		/* We are at the last skb that should be included (The unsent
+		 * one)
+		 */
+		if (tmp_skb == xmit_skb)
+			break;
+	}
+	return rdb_skb;
+}
+
+/**
+ * rdb_can_bundle_check() - check if redundant data can be bundled
+ * @sk: the socket
+ * @xmit_skb: The SKB processed for transmission by the output engine
+ * @mss_now: The current mss value
+ * @bytes_in_rdb_skb: Will contain the resulting number of bytes to bundle
+ *                         at exit.
+ * @skbs_to_bundle_count: The total number of SKBs to be in the bundle
+ *
+ * Traverses the entire write queue and checks if any un-acked data
+ * may be bundled.
+ *
+ * Return: The first SKB to be in the bundle, or NULL if no bundling
+ */
+static struct sk_buff *rdb_can_bundle_check(const struct sock *sk,
+					    struct sk_buff *xmit_skb,
+					    unsigned int mss_now,
+					    u32 *bytes_in_rdb_skb,
+					    u32 *skbs_to_bundle_count)
+{
+	struct sk_buff *first_to_bundle = NULL;
+	struct sk_buff *tmp, *skb = xmit_skb->prev;
+	u32 skbs_in_bundle_count = 1; /* 1 to account for current skb */
+	u32 byte_count = xmit_skb->len;
+
+	/* We start at the skb before xmit_skb, and go backwards in the list.*/
+	tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+		/* Not enough room to bundle data from this SKB */
+		if ((byte_count + skb->len) > mss_now)
+			break;
+
+		if (sysctl_tcp_rdb_max_bytes &&
+		    ((byte_count + skb->len) > sysctl_tcp_rdb_max_bytes))
+			break;
+
+		if (sysctl_tcp_rdb_max_skbs &&
+		    (skbs_in_bundle_count > sysctl_tcp_rdb_max_skbs))
+			break;
+
+		byte_count += skb->len;
+		skbs_in_bundle_count++;
+		first_to_bundle = skb;
+	}
+	*bytes_in_rdb_skb = byte_count;
+	*skbs_to_bundle_count = skbs_in_bundle_count;
+	return first_to_bundle;
+}
+
+/**
+ * create_rdb_skb() - Try to create RDB SKB
+ * @sk: the socket
+ * @xmit_skb: The SKB that should be sent
+ * @mss_now: Current MSS
+ * @gfp_mask: The gfp_t allocation
+ *
+ * Return: A new SKB containing redundant data, or NULL if no bundling could be
+ *         performed
+ */
+struct sk_buff *create_rdb_skb(const struct sock *sk, struct sk_buff *xmit_skb,
+			       unsigned int mss_now, u32 *bytes_in_rdb_skb,
+			       gfp_t gfp_mask)
+{
+	u32 skb_in_bundle_count;
+	struct sk_buff *first_to_bundle;
+
+	if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb))
+		return NULL;
+
+	/* No bundling on FIN packet */
+	if (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)
+		return NULL;
+
+	/* Find number of (previous) SKBs to get data from */
+	first_to_bundle = rdb_can_bundle_check(sk, xmit_skb, mss_now,
+					       bytes_in_rdb_skb,
+					       &skb_in_bundle_count);
+	if (!first_to_bundle)
+		return NULL;
+
+	/* Create an SKB that contains the data from 'skb_in_bundle_count'
+	 * SKBs.
+	 */
+	return rdb_build_skb(sk, xmit_skb, first_to_bundle,
+			     *bytes_in_rdb_skb, gfp_mask);
+}
+
+/**
+ * tcp_transmit_rdb_skb() - Try to create and send an RDB packet
+ * @sk: the socket
+ * @xmit_skb: The SKB processed for transmission by the output engine
+ * @mss_now: Current MSS
+ * @gfp_mask: The gfp_t allocation
+ *
+ * Return: 0 if successfully sent packet, else != 0
+ */
+int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
+			 unsigned int mss_now, gfp_t gfp_mask)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *rdb_skb = NULL;
+	u32 bytes_in_rdb_skb = 0; /* May be used for statistical purposes */
+
+	/* How we detect that RDB was used. When equal, no RDB data was sent */
+	TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
+
+	if (tcp_stream_is_thin_dpifl(tp)) {
+		rdb_skb = create_rdb_skb(sk, xmit_skb, mss_now,
+					 &bytes_in_rdb_skb, gfp_mask);
+		if (!rdb_skb)
+			goto xmit_default;
+
+		/* Set tstamp for SKB in output queue, because tcp_transmit_skb
+		 * will do this for the rdb_skb and not the SKB in the output
+		 * queue (xmit_skb).
+		 */
+		skb_mstamp_get(&xmit_skb->skb_mstamp);
+		rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
+		return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
+	}
+xmit_default:
+	/* Transmit the unmodified SKB from output queue */
+	return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
+}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 20:50 ` [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
@ 2015-10-23 21:44   ` Eric Dumazet
  2015-10-25  5:56     ` Bendik Rønning Opstad
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-10-23 21:44 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Fri, 2015-10-23 at 22:50 +0200, Bendik Rønning Opstad wrote:

>  
> +/**
> + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
> + *                              limit
> + * @tp: the tcp_sock struct
> + *
> + * Return: true if current packets in flight (PIF) count is lower than
> + *         the dynamic PIF limit, else false
> + */
> +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
> +{
> +	u64 dpif_lim = tp->srtt_us >> 3;
> +	/* Div by is_thin_min_itt_lim, the minimum allowed ITT
> +	 * (Inter-transmission time) in usecs.
> +	 */
> +	do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
> +	return tcp_packets_in_flight(tp) < dpif_lim;
> +}
> +
This is very strange :

You are using a do_div() while both operands are 32bits.  A regular
divide would be ok :

u32 dpif_lim = (tp->srtt_us >> 3) / tp->thin_dpifl_itt_lower_bound;

But then, you can avoid the divide by using a multiply, less expensive :

return	(u64)tcp_packets_in_flight(tp) * tp->thin_dpifl_itt_lower_bound <
	(tp->srtt_us >> 3);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-23 20:50 [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB) Bendik Rønning Opstad
  2015-10-23 20:50 ` [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
  2015-10-23 20:50 ` [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
@ 2015-10-24  6:11 ` Yuchung Cheng
  2015-10-24  8:00   ` Jonas Markussen
  2 siblings, 1 reply; 16+ messages in thread
From: Yuchung Cheng @ 2015-10-24  6:11 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Paolo Abeni,
	Erik Kline, Hannes Frederic Sowa, Al Viro, Jiri Pirko,
	Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Fri, Oct 23, 2015 at 1:50 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>
> This is a request for comments.
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
> Latency-sensitive applications or services, such as online games and
> remote desktop, produce traffic with thin-stream characteristics,
> characterized by small packets and a relatively high ITT. By bundling
> already sent data in packets with new data, RDB alleviates head-of-line
> blocking by reducing the need to retransmit data segments when packets
> are lost. RDB is a continuation on the work on latency improvements for
> TCP in Linux, previously resulting in two thin-stream mechanisms in the
> Linux kernel
> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>
> The RDB implementation has been thoroughly tested, and shows
> significant latency reductions when packet loss occurs[1]. The tests
> show that, by imposing restrictions on the bundling rate, it can be made
> not to negatively affect competing traffic in an unfair manner.
>
> Note: Current patch set depends on a recently submitted patch for
> tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)
>
> These patches have been tested with as set of packetdrill scripts located at
> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
> (The tests require patching packetdrill with a new socket option:
> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>
> Detailed info about the RDB mechanism can be found at
> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper

What's the difference between RDB and TCP repacketization
(http://flylib.com/books/en/3.223.1.226/1/) ?

Reading the blog page, I am concerned the amount of
change (esp on fast path) just to bundle new writes during timeout &
retransmit, for a specific type of application? why not just send X
packets with total bytes < MSS on timeout..

> "Latency and Fairness Trade-Off for Thin Streams using Redundant Data
> Bundling in TCP"[2].
>
> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
>
>
> Bendik Rønning Opstad (2):
>   tcp: Add DPIFL thin stream detection mechanism
>   tcp: Add Redundant Data Bundling (RDB)
>
>  Documentation/networking/ip-sysctl.txt |  23 +++
>  include/linux/skbuff.h                 |   1 +
>  include/linux/tcp.h                    |   9 +-
>  include/net/tcp.h                      |  34 ++++
>  include/uapi/linux/tcp.h               |   1 +
>  net/core/skbuff.c                      |   3 +-
>  net/ipv4/Makefile                      |   3 +-
>  net/ipv4/sysctl_net_ipv4.c             |  35 ++++
>  net/ipv4/tcp.c                         |  19 ++-
>  net/ipv4/tcp_input.c                   |   3 +
>  net/ipv4/tcp_output.c                  |  11 +-
>  net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
>  12 files changed, 415 insertions(+), 8 deletions(-)
>  create mode 100644 net/ipv4/tcp_rdb.c
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-24  6:11 ` [PATCH RFC net-next 0/2] tcp: " Yuchung Cheng
@ 2015-10-24  8:00   ` Jonas Markussen
  2015-10-24 12:57     ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Jonas Markussen @ 2015-10-24  8:00 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Bendik Rønning Opstad, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Jonathan Corbet, Eric Dumazet, Neal Cardwell, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3781 bytes --]



> On 24 Oct 2015, at 08:11, Yuchung Cheng <ycheng@google.com> wrote:
> 
> On Fri, Oct 23, 2015 at 1:50 PM, Bendik Rønning Opstad
> <bro.devel@gmail.com> wrote:
>> 
>> This is a request for comments.
>> 
>> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
>> the latency for applications sending time-dependent data.
>> Latency-sensitive applications or services, such as online games and
>> remote desktop, produce traffic with thin-stream characteristics,
>> characterized by small packets and a relatively high ITT. By bundling
>> already sent data in packets with new data, RDB alleviates head-of-line
>> blocking by reducing the need to retransmit data segments when packets
>> are lost. RDB is a continuation on the work on latency improvements for
>> TCP in Linux, previously resulting in two thin-stream mechanisms in the
>> Linux kernel
>> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>> 
>> The RDB implementation has been thoroughly tested, and shows
>> significant latency reductions when packet loss occurs[1]. The tests
>> show that, by imposing restrictions on the bundling rate, it can be made
>> not to negatively affect competing traffic in an unfair manner.
>> 
>> Note: Current patch set depends on a recently submitted patch for
>> tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674)
>> 
>> These patches have been tested with as set of packetdrill scripts located at
>> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
>> (The tests require patching packetdrill with a new socket option:
>> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>> 
>> Detailed info about the RDB mechanism can be found at
>> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper
> 
> What's the difference between RDB and TCP repacketization
> (http://flylib.com/books/en/3.223.1.226/1/) ?
> 
> Reading the blog page, I am concerned the amount of
> change (esp on fast path) just to bundle new writes during timeout &
> retransmit, for a specific type of application? why not just send X
> packets with total bytes < MSS on timeout..

Repacketization is only on retransmissions; RDB bundles previously sent segments with the next “normal” transmission instead. 

This makes the flow recover the lost segment  before a retransmission is triggered by an RTO or fast retransmit.

>> "Latency and Fairness Trade-Off for Thin Streams using Redundant Data
>> Bundling in TCP"[2].
>> 
>> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
>> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
>> 
>> 
>> Bendik Rønning Opstad (2):
>>  tcp: Add DPIFL thin stream detection mechanism
>>  tcp: Add Redundant Data Bundling (RDB)
>> 
>> Documentation/networking/ip-sysctl.txt |  23 +++
>> include/linux/skbuff.h                 |   1 +
>> include/linux/tcp.h                    |   9 +-
>> include/net/tcp.h                      |  34 ++++
>> include/uapi/linux/tcp.h               |   1 +
>> net/core/skbuff.c                      |   3 +-
>> net/ipv4/Makefile                      |   3 +-
>> net/ipv4/sysctl_net_ipv4.c             |  35 ++++
>> net/ipv4/tcp.c                         |  19 ++-
>> net/ipv4/tcp_input.c                   |   3 +
>> net/ipv4/tcp_output.c                  |  11 +-
>> net/ipv4/tcp_rdb.c                     | 281 +++++++++++++++++++++++++++++++++
>> 12 files changed, 415 insertions(+), 8 deletions(-)
>> create mode 100644 net/ipv4/tcp_rdb.c
>> 
>> --
>> 1.9.1

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
  2015-10-24  8:00   ` Jonas Markussen
@ 2015-10-24 12:57     ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2015-10-24 12:57 UTC (permalink / raw)
  To: Jonas Markussen
  Cc: Yuchung Cheng, Bendik Rønning Opstad, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Neal Cardwell,
	Tom Herbert, Paolo Abeni, Erik Kline, Hannes Frederic Sowa,
	Al Viro, Jiri Pirko, Alexander Duyck, Florian Westphal,
	Daniel Lee, Marcelo Ricardo Leitner, Daniel Borkmann,
	Willem de Bruijn, Linus Lüssing, linux-doc, linux-kernel,
	netdev, linux-api, Andreas Petlund, Carsten Griwodz,
	Pål Halvorsen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Sat, 2015-10-24 at 08:00 +0000, Jonas Markussen wrote:

> Repacketization is only on retransmissions; RDB bundles previously sent segments with the next “normal” transmission instead. 
> 
> This makes the flow recover the lost segment  before a retransmission is triggered by an RTO or fast retransmit.

Thank you for this very high quality patch submission.

Please give us a few days for proper evaluation.

Thanks !



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
  2015-10-23 21:44   ` Eric Dumazet
@ 2015-10-25  5:56     ` Bendik Rønning Opstad
  0 siblings, 0 replies; 16+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-25  5:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen

On Friday, October 23, 2015 02:44:14 PM Eric Dumazet wrote:
> On Fri, 2015-10-23 at 22:50 +0200, Bendik Rønning Opstad wrote:
> 
> >  
> > +/**
> > + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF
> > + *                              limit
> > + * @tp: the tcp_sock struct
> > + *
> > + * Return: true if current packets in flight (PIF) count is lower than
> > + *         the dynamic PIF limit, else false
> > + */
> > +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
> > +{
> > +	u64 dpif_lim = tp->srtt_us >> 3;
> > +	/* Div by is_thin_min_itt_lim, the minimum allowed ITT
> > +	 * (Inter-transmission time) in usecs.
> > +	 */
> > +	do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
> > +	return tcp_packets_in_flight(tp) < dpif_lim;
> > +}
> > +
> This is very strange :
> 
> You are using a do_div() while both operands are 32bits.  A regular
> divide would be ok :
> 
> u32 dpif_lim = (tp->srtt_us >> 3) / tp->thin_dpifl_itt_lower_bound;
> 
> But then, you can avoid the divide by using a multiply, less expensive :
> 
> return	(u64)tcp_packets_in_flight(tp) * tp->thin_dpifl_itt_lower_bound <
> 	(tp->srtt_us >> 3);
> 

You are of course correct. Will fix this and use multiply. Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
@ 2015-10-26 14:50   ` Neal Cardwell
  2015-10-26 21:35     ` Andreas Petlund
  2015-11-02  9:37   ` David Laight
  1 sibling, 1 reply; 16+ messages in thread
From: Neal Cardwell @ 2015-10-26 14:50 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Tom Herbert, Yuchung Cheng, Paolo Abeni,
	Erik Kline, Hannes Frederic Sowa, Al Viro, Jiri Pirko,
	Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>@@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
...
> +       case TCP_RDB:
> +               if (val < 0 || val > 1) {
> +                       err = -EINVAL;
> +               } else {
> +                       tp->rdb = val;
> +                       tp->nonagle = val;

The semantics of the tp->nonagle bits are already a bit complex. My
sense is that having a setsockopt of TCP_RDB transparently modify the
nagle behavior is going to add more extra complexity and unanticipated
behavior than is warranted given the slight possible gain in
convenience to the app writer. What about a model where the
application user just needs to remember to call
setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
sensible? I see your nice tests at

   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b

are already doing that. And my sense is that likewise most
well-engineered "thin stream" apps will already be using
setsockopt(TCP_NODELAY). Is that workable?

neal

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-26 14:50   ` Neal Cardwell
@ 2015-10-26 21:35     ` Andreas Petlund
  2015-10-26 21:58       ` Yuchung Cheng
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Petlund @ 2015-10-26 21:35 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Bendik Rønning Opstad, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Jonathan Corbet, Eric Dumazet, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad


> On 26 Oct 2015, at 15:50, Neal Cardwell <ncardwell@google.com> wrote:
> 
> On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> <bro.devel@gmail.com> wrote:
>> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> ...
>> +       case TCP_RDB:
>> +               if (val < 0 || val > 1) {
>> +                       err = -EINVAL;
>> +               } else {
>> +                       tp->rdb = val;
>> +                       tp->nonagle = val;
> 
> The semantics of the tp->nonagle bits are already a bit complex. My
> sense is that having a setsockopt of TCP_RDB transparently modify the
> nagle behavior is going to add more extra complexity and unanticipated
> behavior than is warranted given the slight possible gain in
> convenience to the app writer. What about a model where the
> application user just needs to remember to call
> setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> sensible? I see your nice tests at
> 
>   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b
> 
> are already doing that. And my sense is that likewise most
> well-engineered "thin stream" apps will already be using
> setsockopt(TCP_NODELAY). Is that workable?

We have been discussing this a bit back and forth. Your suggestion would be the right thing to keep the nagle semantics less complex and to educate developers in the intrinsics of the transport.

We ended up choosing to implicitly disable nagle since it 
1) is incompatible with the logic of RDB.
2) leaving it up to the developer to read the documentation and register the line saying that "failing to set TCP_NODELAY will void the RDB latency gain" will increase the chance of misconfigurations leading to deployment with no effect.

The hope was to help both the well-engineered thin-stream apps and the ones deployed by developers with less detailed knowledge of the transport.

-Andreas


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-26 21:35     ` Andreas Petlund
@ 2015-10-26 21:58       ` Yuchung Cheng
  2015-10-27 19:15         ` Jonas Markussen
  2015-10-29 22:53         ` Bendik Rønning Opstad
  0 siblings, 2 replies; 16+ messages in thread
From: Yuchung Cheng @ 2015-10-26 21:58 UTC (permalink / raw)
  To: Andreas Petlund
  Cc: Neal Cardwell, Bendik Rønning Opstad, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Mon, Oct 26, 2015 at 2:35 PM, Andreas Petlund <apetlund@simula.no> wrote:
>
>
> > On 26 Oct 2015, at 15:50, Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> > <bro.devel@gmail.com> wrote:
> >> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> > ...
> >> +       case TCP_RDB:
> >> +               if (val < 0 || val > 1) {
> >> +                       err = -EINVAL;
> >> +               } else {
> >> +                       tp->rdb = val;
> >> +                       tp->nonagle = val;
> >
> > The semantics of the tp->nonagle bits are already a bit complex. My
> > sense is that having a setsockopt of TCP_RDB transparently modify the
> > nagle behavior is going to add more extra complexity and unanticipated
> > behavior than is warranted given the slight possible gain in
> > convenience to the app writer. What about a model where the
> > application user just needs to remember to call
> > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > sensible? I see your nice tests at
> >
> >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b
> >
> > are already doing that. And my sense is that likewise most
> > well-engineered "thin stream" apps will already be using
> > setsockopt(TCP_NODELAY). Is that workable?
>
> We have been discussing this a bit back and forth. Your suggestion would be the right thing to keep the nagle semantics less complex and to educate developers in the intrinsics of the transport.
>
> We ended up choosing to implicitly disable nagle since it
> 1) is incompatible with the logic of RDB.
> 2) leaving it up to the developer to read the documentation and register the line saying that "failing to set TCP_NODELAY will void the RDB latency gain" will increase the chance of misconfigurations leading to deployment with no effect.
>
> The hope was to help both the well-engineered thin-stream apps and the ones deployed by developers with less detailed knowledge of the transport.
but would RDB be voided if this developer turns on RDB then turns on
Nagle later?

>
> -Andreas
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-26 21:58       ` Yuchung Cheng
@ 2015-10-27 19:15         ` Jonas Markussen
  2015-10-29 22:53         ` Bendik Rønning Opstad
  1 sibling, 0 replies; 16+ messages in thread
From: Jonas Markussen @ 2015-10-27 19:15 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Andreas Petlund, Neal Cardwell, Bendik Rønning Opstad,
	David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Tom Herbert, Paolo Abeni, Erik Kline,
	Hannes Frederic Sowa, Al Viro, Jiri Pirko, Alexander Duyck,
	Florian Westphal, Daniel Lee, Marcelo Ricardo Leitner,
	Daniel Borkmann, Willem de Bruijn, Linus Lüssing, linux-doc,
	LKML, Netdev, linux-api, Carsten Griwodz, Pål Halvorsen,
	Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On 26 Oct 2015, at 22:58, Yuchung Cheng <ycheng@google.com> wrote:
> but would RDB be voided if this developer turns on RDB then turns on
> Nagle later?

The short answer is answer is "kind of"

My understanding is that Nagle will delay segments until they're
either MSS-sized or until segments "down the pipe" are acknowledged.

As RDB isn't able to bundle if the payload is more than MSS/2, only
an application that that sends data less frequent than an RTT would
still theoretically benefit from RDB even if Nagle is on.

However, in my opinion this is a scenario where Nagle itself is void:

If you transmit more rarely than the RTT, enabling Nagle makes no
difference.

If you transfer more frequent than the RTT, enabling Nagle makes
RDB void.

-Jonas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-26 21:58       ` Yuchung Cheng
  2015-10-27 19:15         ` Jonas Markussen
@ 2015-10-29 22:53         ` Bendik Rønning Opstad
  2015-11-02  9:18           ` David Laight
  1 sibling, 1 reply; 16+ messages in thread
From: Bendik Rønning Opstad @ 2015-10-29 22:53 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Andreas Petlund, Neal Cardwell, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

On Monday, October 26, 2015 02:58:03 PM Yuchung Cheng wrote:
> On Mon, Oct 26, 2015 at 2:35 PM, Andreas Petlund <apetlund@simula.no> wrote:
> > > On 26 Oct 2015, at 15:50, Neal Cardwell <ncardwell@google.com> wrote:
> > > 
> > > On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> > > 
> > > <bro.devel@gmail.com> wrote:
> > >> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk,
> > >> int level,> > 
> > > ...
> > > 
> > >> +       case TCP_RDB:
> > >> +               if (val < 0 || val > 1) {
> > >> +                       err = -EINVAL;
> > >> +               } else {
> > >> +                       tp->rdb = val;
> > >> +                       tp->nonagle = val;
> > > 
> > > The semantics of the tp->nonagle bits are already a bit complex. My
> > > sense is that having a setsockopt of TCP_RDB transparently modify the
> > > nagle behavior is going to add more extra complexity and unanticipated
> > > behavior than is warranted given the slight possible gain in
> > > convenience to the app writer. What about a model where the
> > > application user just needs to remember to call
> > > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > > sensible? I see your nice tests at
> > > 
> > >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b
> > >   7d8baf703b2c2ac1b> > 
> > > are already doing that. And my sense is that likewise most
> > > well-engineered "thin stream" apps will already be using
> > > setsockopt(TCP_NODELAY). Is that workable?

This is definitely workable. I agree that it may not be an ideal solution to
have TCP_RDB disable Nagle, however, it would be useful with a way to easily
enable RDB and disable Nagle.

> > We have been discussing this a bit back and forth. Your suggestion would
> > be the right thing to keep the nagle semantics less complex and to
> > educate developers in the intrinsics of the transport.
> > 
> > We ended up choosing to implicitly disable nagle since it
> > 1) is incompatible with the logic of RDB.
> > 2) leaving it up to the developer to read the documentation and register
> > the line saying that "failing to set TCP_NODELAY will void the RDB
> > latency gain" will increase the chance of misconfigurations leading to
> > deployment with no effect.
> > 
> > The hope was to help both the well-engineered thin-stream apps and the
> > ones deployed by developers with less detailed knowledge of the
> > transport.
> but would RDB be voided if this developer turns on RDB then turns on
> Nagle later?

It would (to a large degree), but I believe that's ok? The intention with also
disabling Nagle is not to remove control from the application writer, so if
TCP_RDB disables Nagle, they should not be prevented from explicitly enabling
Nagle after enabling RDB.

The idea is to make it as easy as possible for the application writer, and
since Nagle is on by default, it makes sense to change this behavior when the
application has indicated that it values low latencies.

Would a solution with multiple option values to TCP_RDB be acceptable? E.g.
0 = Disable
1 = Enable RDB
2 = Enable RDB and disable Nagle

If the sysctl tcp_rdb accepts the same values, setting the sysctl to 2 would
allow to use and test RDB (with Nagle off) on applications that haven't
explicitly disabled Nagle, which would make the sysctl tcp_rdb even more useful.

Instead of having TCP_RDB modify Nagle, would it be better/acceptable to have a
separate socket option (e.g. TCP_THIN/TCP_THIN_LOW_LATENCY) that enables RDB and
disables Nagle? e.g.
0 = Use default system options?
1 = Enable RDB and disable Nagle

This would separate the modification of Nagle from the TCP_RDB socket option and
make it cleaner?

Such an option could also enable other latency-reducing options like
TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK:
2 = Enable RDB, TCP_THIN_LINEAR_TIMEOUTS, TCP_THIN_DUPACK, and disable Nagle

Bendik


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-29 22:53         ` Bendik Rønning Opstad
@ 2015-11-02  9:18           ` David Laight
  0 siblings, 0 replies; 16+ messages in thread
From: David Laight @ 2015-11-02  9:18 UTC (permalink / raw)
  To: 'bro.devel+kernel@gmail.com', Yuchung Cheng
  Cc: Andreas Petlund, Neal Cardwell, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Jonathan Corbet, Eric Dumazet, Tom Herbert,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, LKML, Netdev, linux-api,
	Carsten Griwodz, Pål Halvorsen, Jonas Markussen,
	Kristian Evensen, Kenneth Klette Jonassen

From: Bendik Rønning Opstad
> Sent: 29 October 2015 22:54
...
> > > > The semantics of the tp->nonagle bits are already a bit complex. My
> > > > sense is that having a setsockopt of TCP_RDB transparently modify the
> > > > nagle behavior is going to add more extra complexity and unanticipated
> > > > behavior than is warranted given the slight possible gain in
> > > > convenience to the app writer. What about a model where the
> > > > application user just needs to remember to call
> > > > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > > > sensible? I see your nice tests at
> > > >
> > > >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b
> > > >   7d8baf703b2c2ac1b> >
> > > > are already doing that. And my sense is that likewise most
> > > > well-engineered "thin stream" apps will already be using
> > > > setsockopt(TCP_NODELAY). Is that workable?
> 
> This is definitely workable. I agree that it may not be an ideal solution to
> have TCP_RDB disable Nagle, however, it would be useful with a way to easily
> enable RDB and disable Nagle.

If enabling RDB disables Nagle, then what happens when you turn RDB back off?

	David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-10-23 20:50 ` [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
  2015-10-26 14:50   ` Neal Cardwell
@ 2015-11-02  9:37   ` David Laight
  2015-11-05  2:06     ` Bendik Rønning Opstad
  1 sibling, 1 reply; 16+ messages in thread
From: David Laight @ 2015-11-02  9:37 UTC (permalink / raw)
  To: 'Bendik Rønning Opstad',
	David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet
  Cc: Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1506 bytes --]

From: Bendik Rønning Opstad
> Sent: 23 October 2015 21:50
> RDB is a mechanism that enables a TCP sender to bundle redundant
> (already sent) data with TCP packets containing new data. By bundling
> (retransmitting) already sent data with each TCP packet containing new
> data, the connection will be more resistant to sporadic packet loss
> which reduces the application layer latency significantly in congested
> scenarios.

What sort of traffic flows do you expect this to help?

An ssh (or similar) connection will get additional data to send,
but that sort of data flow needs Nagle in order to reduce the
number of packets sent.
OTOH it might benefit from including unacked data if the Nagle
timer expires.
Being able to set the Nagle timer on a per-connection basis
(or maybe using something based on the RTT instead of 2 secs)
might make packet loss less problematic.

Data flows that already have Nagle disabled (probably anything that
isn't command-response and isn't unidirectional bulk data) are
likely to generate a lot of packets within the RTT.
Resending unacked data will just eat into available network bandwidth
and could easily make any congestion worse.

I think that means you shouldn't resend data more than once, and/or
should make sure that the resent data isn't a significant overhead
on the packet being sent.

	David


ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
  2015-11-02  9:37   ` David Laight
@ 2015-11-05  2:06     ` Bendik Rønning Opstad
  0 siblings, 0 replies; 16+ messages in thread
From: Bendik Rønning Opstad @ 2015-11-05  2:06 UTC (permalink / raw)
  To: David Laight
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jonathan Corbet,
	Eric Dumazet, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	Paolo Abeni, Erik Kline, Hannes Frederic Sowa, Al Viro,
	Jiri Pirko, Alexander Duyck, Florian Westphal, Daniel Lee,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Linus Lüssing, linux-doc, linux-kernel, netdev, linux-api,
	Andreas Petlund, Carsten Griwodz, Pål Halvorsen,
	Jonas Markussen, Kristian Evensen, Kenneth Klette Jonassen,
	Bendik Rønning Opstad

On Monday, November 02, 2015 09:37:54 AM David Laight wrote:
> From: Bendik Rønning Opstad
> > Sent: 23 October 2015 21:50
> > RDB is a mechanism that enables a TCP sender to bundle redundant
> > (already sent) data with TCP packets containing new data. By bundling
> > (retransmitting) already sent data with each TCP packet containing new
> > data, the connection will be more resistant to sporadic packet loss
> > which reduces the application layer latency significantly in congested
> > scenarios.
> 
> What sort of traffic flows do you expect this to help?

As mentioned in the cover letter, RDB is aimed at reducing the
latencies for "thin-stream" traffic often produced by
latency-sensitive applications. This blog post describes RDB and the
underlying motivation:
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp

Further information is available in the links referred to in the blog
post.

> An ssh (or similar) connection will get additional data to send,
> but that sort of data flow needs Nagle in order to reduce the
> number of packets sent.

Whether an application needs to reduce the number of packets sent
depends on the perspective of who you ask. If low latency is of high
priority for the application it may need to increase the number of
packets sent by disabling Nagle to reduce the segments sojourn times
on the sender side.

As for SSH clients, it seems OpenSSH disables Nagle for interactive
sessions.

> OTOH it might benefit from including unacked data if the Nagle
> timer expires.
> Being able to set the Nagle timer on a per-connection basis
> (or maybe using something based on the RTT instead of 2 secs)
> might make packet loss less problematic.

There is no timer for Nagle? The current (Minshall variant)
implementation restricts sending a small segment as long as the
previously transmitted packet was small and is not yet ACKed.

> Data flows that already have Nagle disabled (probably anything that
> isn't command-response and isn't unidirectional bulk data) are
> likely to generate a lot of packets within the RTT.

How many packets such applications need to transmit for optimal
latency varies to a great extent. Packets per RTT is not a very useful
metric in this regard, considering the strict dependency on the RTT.

This is why we propose a dynamic packets in flight limit (DPIFL) that
indirectly relies on the application write frequency, i.e. how often
the application performs write systems calls. This limit is used to
ensure that only applications that write data less frequently than a
certain limit may utilize RDB.

> Resending unacked data will just eat into available network bandwidth
> and could easily make any congestion worse.
>
> I think that means you shouldn't resend data more than once, and/or
> should make sure that the resent data isn't a significant overhead
> on the packet being sent.

It is important to remember what type of traffic flows we are
discussing. The applications RDB is aimed at helping produce
application-limited flows that transmit small amounts of data, both in
terms of payload per packet and packets per second.

Analysis of traces from latency-sensitive applications producing
traffic with thin-stream characteristics show inter-transmission times
ranging from a few ms (typically 20-30 ms on average) to many hundred
ms.
(http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#thin_streams)

Increasing the amount of transmitted data will certainly contribute to
congestion to some degree, but it is not (necessarily) an unreasonable
trade-off considering the relatively small amounts of data such
applications transmit compared to greedy flows.

RDB does not cause more packets to be sent through the network, as it
uses available "free" space in packets already scheduled for
transmission. With a bundling limitation of only one previous segment,
the bandwidth requirement is doubled - accounting for headers it would
be less.

By increasing the BW requirement for an application that produces
relatively little data, we still end up with a low BW requirement.
The suggested minimum lower bound inter-transmission time is 10 ms,
meaning that when an application writes data more frequently than
every 10 ms (on average) it will not be allowed to utilize RDB.

To what degree RDB affects competing traffic will of course depend on
the link capacity and the number of simultaneous flows utilizing RDB.
We have performed tests to asses how RDB affects competing traffic. In
one of the test scenarios, 10 RDB-enabled thin streams and 10 regular
TCP thin streams compete against 5 greedy TCP flows over a shared
bottleneck limited to 5Mbit/s. The results from this test show that by
only bundling one previous segment with each packet (segment size: 120
bytes), the effect on the the competing thin-stream traffic is modest.
(http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#latency_test_with_cross_traffic).

Also relevant to the discussion is the paper "Reducing web latency:
the virtue of gentle aggression, (2013)", and one of the presented
mechanisms (called Proactive) which applies redundancy by transmitting
every packet twice. While doubling the bandwidth requirements when
using Proactive, their measurements show negligible effect on the
baseline traffic because, as they explain, the traffic utilizing the
mechanism (Web service traffic in their case) is only a small amount
of the total traffic passing through their servers.

While RDB and the Proactive mechanism have slightly different
approaches, they aim at solving the same basic problem; the increased
latencies caused by the need for normal retransmissions. By
proactively (re)transmitting redundant data they are able to avoid the
need for normal retransmissions to a great extent, which reduces
application layer latency by alleviating head-of-line blocking on the
receiver.

An important property of RDB is that by only using packets already
scheduled for transmission, a limit is naturally imposed when severe
congestion occurs. As soon as loss is detected, resulting in a
reduction of the CWND (i.e. becomes network limited), new data from
the application will be appended to the SKB in the output queue
containing the newest (unsent) data. Depending on the rate at which the
application produces data and the level of congestion (the size of the
CWND), the new data from the application will eventually fill up the
SKBs such that skb->len >= MSS. The result is that there is no "free"
space available to bundle redundant data, effectively disabling RDB
and enforcing a behavior equal to regular TCP.

Bendik


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-11-05  2:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23 20:50 [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB) Bendik Rønning Opstad
2015-10-23 20:50 ` [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism Bendik Rønning Opstad
2015-10-23 21:44   ` Eric Dumazet
2015-10-25  5:56     ` Bendik Rønning Opstad
2015-10-23 20:50 ` [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Bendik Rønning Opstad
2015-10-26 14:50   ` Neal Cardwell
2015-10-26 21:35     ` Andreas Petlund
2015-10-26 21:58       ` Yuchung Cheng
2015-10-27 19:15         ` Jonas Markussen
2015-10-29 22:53         ` Bendik Rønning Opstad
2015-11-02  9:18           ` David Laight
2015-11-02  9:37   ` David Laight
2015-11-05  2:06     ` Bendik Rønning Opstad
2015-10-24  6:11 ` [PATCH RFC net-next 0/2] tcp: " Yuchung Cheng
2015-10-24  8:00   ` Jonas Markussen
2015-10-24 12:57     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).