From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yuchung Cheng Subject: Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB) Date: Mon, 14 Mar 2016 14:54:47 -0700 Message-ID: References: <1445633413-3532-1-git-send-email-bro.devel+kernel@gmail.com> <1457028388-18226-1-git-send-email-bro.devel+kernel@gmail.com> <1457028388-18226-3-git-send-email-bro.devel+kernel@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "David S. Miller" , netdev , Eric Dumazet , Neal Cardwell , Andreas Petlund , Carsten Griwodz , =?UTF-8?Q?P=C3=A5l_Halvorsen?= , Jonas Markussen , Kristian Evensen , Kenneth Klette Jonassen To: =?UTF-8?Q?Bendik_R=C3=B8nning_Opstad?= Return-path: Received: from mail-wm0-f49.google.com ([74.125.82.49]:34190 "EHLO mail-wm0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751734AbcCNVz3 convert rfc822-to-8bit (ORCPT ); Mon, 14 Mar 2016 17:55:29 -0400 Received: by mail-wm0-f49.google.com with SMTP id p65so120042118wmp.1 for ; Mon, 14 Mar 2016 14:55:28 -0700 (PDT) In-Reply-To: <1457028388-18226-3-git-send-email-bro.devel+kernel@gmail.com> Sender: netdev-owner@vger.kernel.org List-ID: On Thu, Mar 3, 2016 at 10:06 AM, Bendik R=C3=B8nning Opstad wrote: > > Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducin= g > the latency for applications sending time-dependent data. > > Latency-sensitive applications or services, such as online games, > remote control systems, and VoIP, produce traffic with thin-stream > characteristics, characterized by small packets and relatively high > inter-transmission times (ITT). When experiencing packet loss, such > latency-sensitive applications are heavily penalized by the need to > retransmit lost packets, which increases the latency by a minimum of > one RTT for the lost packet. Packets coming after a lost packet are > held back due to head-of-line blocking, causing increased delays for > all data segments until the lost packet has been retransmitted. > > RDB enables a TCP sender to bundle redundant (already sent) data with > TCP packets containing small segments of new data. By resending > un-ACKed data from the output queue in packets with new data, RDB > reduces the need to retransmit data segments on connections > experiencing sporadic packet loss. By avoiding a retransmit, RDB > evades the latency increase of at least one RTT for the lost packet, > as well as alleviating head-of-line blocking for the packets followin= g > the lost packet. This makes the TCP connection more resistant to > latency fluctuations, and reduces the application layer latency > significantly in lossy environments. > > Main functionality added: > > o When a packet is scheduled for transmission, RDB builds and > transmits a new SKB containing both the unsent data as well as > data of previously sent packets from the TCP output queue. > > o RDB will only be used for streams classified as thin by the > function tcp_stream_is_thin_dpifl(). This enforces a lower bound > on the ITT for streams that may benefit from RDB, controlled by > the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound. > > o Loss detection of hidden loss events: When bundling redundant dat= a > with each packet, packet loss can be hidden from the TCP engine d= ue > to lack of dupACKs. This is because the loss is "repaired" by the > redundant data in the packet coming after the lost packet. Based = on > incoming ACKs, such hidden loss events are detected, and CWR stat= e > is entered. > > RDB can be enabled on a connection with the socket option TCP_RDB, or > on all new connections by setting the sysctl variable > net.ipv4.tcp_rdb=3D1 > > Cc: Andreas Petlund > Cc: Carsten Griwodz > Cc: P=C3=A5l Halvorsen > Cc: Jonas Markussen > Cc: Kristian Evensen > Cc: Kenneth Klette Jonassen > Signed-off-by: Bendik R=C3=B8nning Opstad > --- > Documentation/networking/ip-sysctl.txt | 15 +++ > include/linux/skbuff.h | 1 + > include/linux/tcp.h | 3 +- > include/net/tcp.h | 15 +++ > include/uapi/linux/tcp.h | 1 + > net/core/skbuff.c | 2 +- > net/ipv4/Makefile | 3 +- > net/ipv4/sysctl_net_ipv4.c | 25 ++++ > net/ipv4/tcp.c | 14 +- > net/ipv4/tcp_input.c | 3 + > net/ipv4/tcp_output.c | 48 ++++--- > net/ipv4/tcp_rdb.c | 228 +++++++++++++++++++++++= ++++++++++ > 12 files changed, 335 insertions(+), 23 deletions(-) > create mode 100644 net/ipv4/tcp_rdb.c > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/n= etworking/ip-sysctl.txt > index 6a92b15..8f3f3bf 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER > calculated, which is used to classify whether a stream is thi= n. > Default: 10000 > > +tcp_rdb - BOOLEAN > + Enable RDB for all new TCP connections. Please describe RDB briefly, perhaps with a pointer to your paper. I suggest have three level of controls: 0: disable RDB completely 1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket options 2: enable RDB on all thin-stream conn. by default currently it only provides mode 1 and 2. but there may be cases wher= e the administrator wants to disallow it (e.g., broken middle-boxes). > + Default: 0 > + > +tcp_rdb_max_bytes - INTEGER > + Enable restriction on how many bytes an RDB packet can contai= n. > + This is the total amount of payload including the new unsent = data. > + Default: 0 > + > +tcp_rdb_max_packets - INTEGER > + Enable restriction on how many previous packets in the output= queue > + RDB may include data from. A value of 1 will restrict bundlin= g to > + only the data from the last packet that was sent. > + Default: 1 why two metrics on redundancy? It also seems better to allow individual socket to select the redundancy level (e.g., setsockopt TCP_RDB=3D3 means <=3D3 pkts per bundle) vs a global settin= g. This requires more bits in tcp_sock but 2-3 more is suffice. / > + > tcp_limit_output_bytes - INTEGER > Controls TCP Small Queue limit per tcp socket. > TCP bulk sender tends to increase packets in flight until it > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 797cefb..0f2c9d1 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -2927,6 +2927,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, = struct iov_iter *frm); > void skb_free_datagram(struct sock *sk, struct sk_buff *skb); > void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); > int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned= int flags); > +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)= ; > int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, i= nt len); > int skb_store_bits(struct sk_buff *skb, int offset, const void *from= , int len); > __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,= u8 *to, > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index bcbf51d..c84de15 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -207,9 +207,10 @@ struct tcp_sock { > } rack; > u16 advmss; /* Advertised MSS = */ > u8 unused; > - u8 nonagle : 4,/* Disable Nagle algorithm? = */ > + u8 nonagle : 3,/* Disable Nagle algorithm? = */ > thin_lto : 1,/* Use linear timeouts for thin strea= ms */ > thin_dupack : 1,/* Fast retransmit on first dupack = */ > + rdb : 1,/* Redundant Data Bundling enabled = */ > repair : 1, > frto : 1;/* F-RTO (RFC5682) activated in CA_Lo= ss */ > u8 repair_queue; > diff --git a/include/net/tcp.h b/include/net/tcp.h > index d38eae9..2d42f4a 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -267,6 +267,9 @@ extern int sysctl_tcp_slow_start_after_idle; > extern int sysctl_tcp_thin_linear_timeouts; > extern int sysctl_tcp_thin_dupack; > extern int sysctl_tcp_thin_dpifl_itt_lower_bound; > +extern int sysctl_tcp_rdb; > +extern int sysctl_tcp_rdb_max_bytes; > +extern int sysctl_tcp_rdb_max_packets; > extern int sysctl_tcp_early_retrans; > extern int sysctl_tcp_limit_output_bytes; > extern int sysctl_tcp_challenge_ack_limit; > @@ -539,6 +542,8 @@ void __tcp_push_pending_frames(struct sock *sk, u= nsigned int cur_mss, > bool tcp_may_send_now(struct sock *sk); > int __tcp_retransmit_skb(struct sock *, struct sk_buff *); > int tcp_retransmit_skb(struct sock *, struct sk_buff *); > +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone= _it, > + gfp_t gfp_mask); > void tcp_retransmit_timer(struct sock *sk); > void tcp_xmit_retransmit_queue(struct sock *); > void tcp_simple_retransmit(struct sock *); > @@ -556,6 +561,7 @@ void tcp_send_ack(struct sock *sk); > void tcp_send_delayed_ack(struct sock *sk); > void tcp_send_loss_probe(struct sock *sk); > bool tcp_schedule_loss_probe(struct sock *sk); > +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *t= o_skb); > > /* tcp_input.c */ > void tcp_resume_early_retransmit(struct sock *sk); > @@ -565,6 +571,11 @@ void tcp_reset(struct sock *sk); > void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_= buff *skb); > void tcp_fin(struct sock *sk); > > +/* tcp_rdb.c */ > +void tcp_rdb_ack_event(struct sock *sk, u32 flags); > +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, > + unsigned int mss_now, gfp_t gfp_mask); > + > /* tcp_timer.c */ > void tcp_init_xmit_timers(struct sock *); > static inline void tcp_clear_xmit_timers(struct sock *sk) > @@ -763,6 +774,7 @@ struct tcp_skb_cb { > union { > struct { > /* There is space for up to 20 bytes */ > + __u32 rdb_start_seq; /* Start seq of rdb data= */ > } tx; /* only used for outgoing skbs */ > union { > struct inet_skb_parm h4; > @@ -1497,6 +1509,9 @@ static inline struct sk_buff *tcp_write_queue_p= rev(const struct sock *sk, > #define tcp_for_write_queue_from_safe(skb, tmp, sk) = \ > skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp) > > +#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) = \ > + skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, = tmp) > + > static inline struct sk_buff *tcp_send_head(const struct sock *sk) > { > return sk->sk_send_head; > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > index fe95446..6799875 100644 > --- a/include/uapi/linux/tcp.h > +++ b/include/uapi/linux/tcp.h > @@ -115,6 +115,7 @@ enum { > #define TCP_CC_INFO 26 /* Get Congestion Control (op= tional) info */ > #define TCP_SAVE_SYN 27 /* Record SYN headers for new= connections */ > #define TCP_SAVED_SYN 28 /* Get SYN headers recorded f= or connection */ > +#define TCP_RDB 29 /* Enable Redundant D= ata Bundling mechanism */ > > struct tcp_repair_opt { > __u32 opt_code; > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 7af7ec6..50bc5b0 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1055,7 +1055,7 @@ static void skb_headers_offset_update(struct sk= _buff *skb, int off) > skb->inner_mac_header +=3D off; > } > > -static void copy_skb_header(struct sk_buff *new, const struct sk_buf= f *old) > +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) > { > __copy_skb_header(new, old); > > diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile > index bfa1336..459048c 100644 > --- a/net/ipv4/Makefile > +++ b/net/ipv4/Makefile > @@ -12,7 +12,8 @@ obj-y :=3D route.o inetpeer.o protocol.o \ > tcp_offload.o datagram.o raw.o udp.o udplite.o \ > udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \ > fib_frontend.o fib_semantics.o fib_trie.o \ > - inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o > + inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \ > + tcp_rdb.o > > obj-$(CONFIG_NET_IP_TUNNEL) +=3D ip_tunnel.o > obj-$(CONFIG_SYSCTL) +=3D sysctl_net_ipv4.o > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index f04320a..43b4390 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -581,6 +581,31 @@ static struct ctl_table ipv4_table[] =3D { > .extra1 =3D &tcp_thin_dpifl_itt_lower_bound_m= in, > }, > { > + .procname =3D "tcp_rdb", > + .data =3D &sysctl_tcp_rdb, > + .maxlen =3D sizeof(int), > + .mode =3D 0644, > + .proc_handler =3D proc_dointvec_minmax, > + .extra1 =3D &zero, > + .extra2 =3D &one, > + }, > + { > + .procname =3D "tcp_rdb_max_bytes", > + .data =3D &sysctl_tcp_rdb_max_bytes, > + .maxlen =3D sizeof(int), > + .mode =3D 0644, > + .proc_handler =3D proc_dointvec_minmax, > + .extra1 =3D &zero, > + }, > + { > + .procname =3D "tcp_rdb_max_packets", > + .data =3D &sysctl_tcp_rdb_max_packets, > + .maxlen =3D sizeof(int), > + .mode =3D 0644, > + .proc_handler =3D proc_dointvec_minmax, > + .extra1 =3D &zero, > + }, > + { > .procname =3D "tcp_early_retrans", > .data =3D &sysctl_tcp_early_retrans, > .maxlen =3D sizeof(int), > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 8421f3d..b53d4cb 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -288,6 +288,8 @@ int sysctl_tcp_autocorking __read_mostly =3D 1; > > int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly =3D TCP_THIN= _DPIFL_ITT_LOWER_BOUND_MIN; > > +int sysctl_tcp_rdb __read_mostly; > + > struct percpu_counter tcp_orphan_count; > EXPORT_SYMBOL_GPL(tcp_orphan_count); > > @@ -407,6 +409,7 @@ void tcp_init_sock(struct sock *sk) > u64_stats_init(&tp->syncp); > > tp->reordering =3D sock_net(sk)->ipv4.sysctl_tcp_reordering; > + tp->rdb =3D sysctl_tcp_rdb; > tcp_enable_early_retrans(tp); > tcp_assign_congestion_control(sk); > > @@ -2412,6 +2415,13 @@ static int do_tcp_setsockopt(struct sock *sk, = int level, > } > break; > > + case TCP_RDB: > + if (val < 0 || val > 1) > + err =3D -EINVAL; > + else > + tp->rdb =3D val; > + break; > + > case TCP_REPAIR: > if (!tcp_can_repair_sock(sk)) > err =3D -EPERM; > @@ -2842,7 +2852,9 @@ static int do_tcp_getsockopt(struct sock *sk, i= nt level, > case TCP_THIN_DUPACK: > val =3D tp->thin_dupack; > break; > - > + case TCP_RDB: > + val =3D tp->rdb; > + break; > case TCP_REPAIR: > val =3D tp->repair; > break; > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index e6e65f7..7b52ce4 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -3537,6 +3537,9 @@ static inline void tcp_in_ack_event(struct sock= *sk, u32 flags) > > if (icsk->icsk_ca_ops->in_ack_event) > icsk->icsk_ca_ops->in_ack_event(sk, flags); > + > + if (unlikely(tcp_sk(sk)->rdb)) > + tcp_rdb_ack_event(sk, flags); > } > > /* Congestion control has updated the cwnd already. So if we're in > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 7d2c7a4..6f92fae 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -897,8 +897,8 @@ out: > * We are working here with either a clone of the original > * SKB, or a fresh unique copy made by the retransmit engine. > */ > -static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, in= t clone_it, > - gfp_t gfp_mask) > +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone= _it, > + gfp_t gfp_mask) > { > const struct inet_connection_sock *icsk =3D inet_csk(sk); > struct inet_sock *inet; > @@ -2110,9 +2110,12 @@ static bool tcp_write_xmit(struct sock *sk, un= signed int mss_now, int nonagle, > break; > } > > - if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) > + if (unlikely(tcp_sk(sk)->rdb)) { > + if (tcp_transmit_rdb_skb(sk, skb, mss_now, gf= p)) > + break; > + } else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)= )) { > break; > - > + } > repair: > /* Advance the send_head. This one is sent out. > * This call will increment packets_out. > @@ -2439,15 +2442,32 @@ u32 __tcp_select_window(struct sock *sk) > return window; > } > > +/** > + * tcp_skb_append_data() - copy the linear data from an SKB to the e= nd > + * of another and update end sequence number > + * and checksum > + * @from_skb: the SKB to copy data from > + * @to_skb: the SKB to copy data to > + */ > +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *t= o_skb) > +{ > + skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb-= >len), > + from_skb->len); > + TCP_SKB_CB(to_skb)->end_seq =3D TCP_SKB_CB(from_skb)->end_seq= ; > + > + if (from_skb->ip_summed =3D=3D CHECKSUM_PARTIAL) > + to_skb->ip_summed =3D CHECKSUM_PARTIAL; > + > + if (to_skb->ip_summed !=3D CHECKSUM_PARTIAL) > + to_skb->csum =3D csum_block_add(to_skb->csum, from_sk= b->csum, > + to_skb->len); > +} > + > /* Collapses two adjacent SKB's during retransmission. */ > static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *sk= b) > { > struct tcp_sock *tp =3D tcp_sk(sk); > struct sk_buff *next_skb =3D tcp_write_queue_next(sk, skb); > - int skb_size, next_skb_size; > - > - skb_size =3D skb->len; > - next_skb_size =3D next_skb->len; > > BUG_ON(tcp_skb_pcount(skb) !=3D 1 || tcp_skb_pcount(next_skb)= !=3D 1); > > @@ -2455,17 +2475,7 @@ static void tcp_collapse_retrans(struct sock *= sk, struct sk_buff *skb) > > tcp_unlink_write_queue(next_skb, sk); > > - skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_siz= e), > - next_skb_size); > - > - if (next_skb->ip_summed =3D=3D CHECKSUM_PARTIAL) > - skb->ip_summed =3D CHECKSUM_PARTIAL; > - > - if (skb->ip_summed !=3D CHECKSUM_PARTIAL) > - skb->csum =3D csum_block_add(skb->csum, next_skb->csu= m, skb_size); > - > - /* Update sequence range on original skb. */ > - TCP_SKB_CB(skb)->end_seq =3D TCP_SKB_CB(next_skb)->end_seq; > + tcp_skb_append_data(next_skb, skb); > > /* Merge over control information. This moves PSH/FIN etc. ov= er */ > TCP_SKB_CB(skb)->tcp_flags |=3D TCP_SKB_CB(next_skb)->tcp_fla= gs; > diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c > new file mode 100644 > index 0000000..2b37957 > --- /dev/null > +++ b/net/ipv4/tcp_rdb.c > @@ -0,0 +1,228 @@ > +#include > +#include > + > +int sysctl_tcp_rdb_max_bytes __read_mostly; > +int sysctl_tcp_rdb_max_packets __read_mostly =3D 1; > + > +/** > + * rdb_detect_loss() - perform RDB loss detection by analysing ACKs > + * @sk: socket > + * > + * Traverse the output queue and check if the ACKed packet is an RDB > + * packet and if the redundant data covers one or more un-ACKed SKBs= =2E > + * If the incoming ACK acknowledges multiple SKBs, we can presume > + * packet loss has occurred. > + * > + * We can infer packet loss this way because we can expect one ACK p= er > + * transmitted data packet, as delayed ACKs are disabled when a host > + * receives packets where the sequence number is not the expected > + * sequence number. > + * > + * Return: The number of packets that are presumed to be lost > + */ > +static unsigned int rdb_detect_loss(struct sock *sk) > +{ > + struct sk_buff *skb, *tmp; > + struct tcp_skb_cb *scb; > + u32 seq_acked =3D tcp_sk(sk)->snd_una; > + unsigned int packets_lost =3D 0; > + > + tcp_for_write_queue(skb, sk) { > + if (skb =3D=3D tcp_send_head(sk)) > + break; > + > + scb =3D TCP_SKB_CB(skb); > + /* The ACK acknowledges parts of the data in this SKB= =2E > + * Can be caused by: > + * - TSO: We abort as RDB is not used on SKBs split a= cross > + * multiple packets on lower layers as these a= re greater > + * than one MSS. > + * - Retrans collapse: We've had a retrans, so loss h= as already > + * been detected. > + */ > + if (after(scb->end_seq, seq_acked)) > + break; > + else if (scb->end_seq !=3D seq_acked) > + continue; > + > + /* We have found the ACKed packet */ > + > + /* This packet was sent with no redundant data, or no= prior > + * un-ACKed SKBs is in the output queue, so break her= e. > + */ > + if (scb->tx.rdb_start_seq =3D=3D scb->seq || > + skb_queue_is_first(&sk->sk_write_queue, skb)) > + break; > + /* Find number of prior SKBs whose data was bundled i= n this > + * (ACKed) SKB. We presume any redundant data coverin= g previous > + * SKB's are due to loss. (An exception would be reor= dering). > + */ > + skb =3D skb->prev; > + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { > + if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_= start_seq)) > + break; > + packets_lost++; since we only care if there is packet loss or not, we can return early = here? > + } > + break; > + } > + return packets_lost; > +} > + > +/** > + * tcp_rdb_ack_event() - initiate RDB loss detection > + * @sk: socket > + * @flags: flags > + */ > +void tcp_rdb_ack_event(struct sock *sk, u32 flags) flags are not used > +{ > + if (rdb_detect_loss(sk)) > + tcp_enter_cwr(sk); > +} > + > +/** > + * rdb_build_skb() - build a new RDB SKB and copy redundant + unsent > + * data to the linear page buffer > + * @sk: socket > + * @xmit_skb: the SKB processed for transmission in the output engin= e > + * @first_skb: the first SKB in the output queue to be bundled > + * @bytes_in_rdb_skb: the total number of data bytes for the new > + * rdb_skb (NEW + Redundant) > + * @gfp_mask: gfp_t allocation > + * > + * Return: A new SKB containing redundant data, or NULL if memory > + * allocation failed > + */ > +static struct sk_buff *rdb_build_skb(const struct sock *sk, > + struct sk_buff *xmit_skb, > + struct sk_buff *first_skb, > + u32 bytes_in_rdb_skb, > + gfp_t gfp_mask) > +{ > + struct sk_buff *rdb_skb, *tmp_skb =3D first_skb; > + > + rdb_skb =3D sk_stream_alloc_skb((struct sock *)sk, > + (int)bytes_in_rdb_skb, > + gfp_mask, false); > + if (!rdb_skb) > + return NULL; > + copy_skb_header(rdb_skb, xmit_skb); > + rdb_skb->ip_summed =3D xmit_skb->ip_summed; > + TCP_SKB_CB(rdb_skb)->seq =3D TCP_SKB_CB(first_skb)->seq; > + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq =3D TCP_SKB_CB(rdb_skb= )->seq; > + > + /* Start on first_skb and append payload from each SKB in the= output > + * queue onto rdb_skb until we reach xmit_skb. > + */ > + tcp_for_write_queue_from(tmp_skb, sk) { > + tcp_skb_append_data(tmp_skb, rdb_skb); > + > + /* We reached xmit_skb, containing the unsent data */ > + if (tmp_skb =3D=3D xmit_skb) > + break; > + } > + return rdb_skb; > +} > + > +/** > + * rdb_can_bundle_test() - test if redundant data can be bundled > + * @sk: socket > + * @xmit_skb: the SKB processed for transmission by the output engin= e > + * @max_payload: the maximum allowed payload bytes for the RDB SKB > + * @bytes_in_rdb_skb: store the total number of payload bytes in the > + * RDB SKB if bundling can be performed > + * > + * Traverse the output queue and check if any un-acked data may be > + * bundled. > + * > + * Return: The first SKB to be in the bundle, or NULL if no bundling > + */ > +static struct sk_buff *rdb_can_bundle_test(const struct sock *sk, > + struct sk_buff *xmit_skb, > + unsigned int max_payload, > + u32 *bytes_in_rdb_skb) > +{ > + struct sk_buff *first_to_bundle =3D NULL; > + struct sk_buff *tmp, *skb =3D xmit_skb->prev; > + u32 skbs_in_bundle_count =3D 1; /* Start on 1 to account for = xmit_skb */ > + u32 total_payload =3D xmit_skb->len; > + > + if (sysctl_tcp_rdb_max_bytes) > + max_payload =3D min_t(unsigned int, max_payload, > + sysctl_tcp_rdb_max_bytes); > + > + /* We start at xmit_skb->prev, and go backwards */ > + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { > + /* Including data from this SKB would exceed payload = limit */ > + if ((total_payload + skb->len) > max_payload) > + break; > + > + if (sysctl_tcp_rdb_max_packets && > + (skbs_in_bundle_count > sysctl_tcp_rdb_max_packet= s)) > + break; > + > + total_payload +=3D skb->len; > + skbs_in_bundle_count++; > + first_to_bundle =3D skb; > + } > + *bytes_in_rdb_skb =3D total_payload; > + return first_to_bundle; > +} > + > +/** > + * tcp_transmit_rdb_skb() - try to create and send an RDB packet > + * @sk: socket > + * @xmit_skb: the SKB processed for transmission by the output engin= e > + * @mss_now: current mss value > + * @gfp_mask: gfp_t allocation > + * > + * If an RDB packet could not be created and sent, transmit the > + * original unmodified SKB (xmit_skb). > + * > + * Return: 0 if successfully sent packet, else error from > + * tcp_transmit_skb > + */ > +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, > + unsigned int mss_now, gfp_t gfp_mask) > +{ > + struct sk_buff *rdb_skb =3D NULL; > + struct sk_buff *first_to_bundle; > + u32 bytes_in_rdb_skb =3D 0; > + > + /* How we detect that RDB was used. When equal, no RDB data w= as sent */ > + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq =3D TCP_SKB_CB(xmit_sk= b)->seq; > + > + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk))) During loss recovery tcp inflight fluctuates and would like to trigger this check even for non-thin-stream connections. Since the loss already occurs, RDB can only take advantage from limited-transmit, which it likely does not have (b/c its a thin-stream). It might be checking if the state is open. > + goto xmit_default; > + > + /* No bundling if first in queue, or on FIN packet */ > + if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) || > + (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)) seems there are still benefit to bundle packets up to FIN? > + goto xmit_default; > + > + /* Find number of (previous) SKBs to get data from */ > + first_to_bundle =3D rdb_can_bundle_test(sk, xmit_skb, mss_now= , > + &bytes_in_rdb_skb); > + if (!first_to_bundle) > + goto xmit_default; > + > + /* Create an SKB that contains redundant data starting from > + * first_to_bundle. > + */ > + rdb_skb =3D rdb_build_skb(sk, xmit_skb, first_to_bundle, > + bytes_in_rdb_skb, gfp_mask); > + if (!rdb_skb) > + goto xmit_default; > + > + /* Set skb_mstamp for the SKB in the output queue (xmit_skb) = containing > + * the yet unsent data. Normally this would be done by > + * tcp_transmit_skb(), but as we pass in rdb_skb instead, xmi= t_skb's > + * timestamp will not be touched. > + */ > + skb_mstamp_get(&xmit_skb->skb_mstamp); > + rdb_skb->skb_mstamp =3D xmit_skb->skb_mstamp; > + return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask); > + > +xmit_default: > + /* Transmit the unmodified SKB from output queue */ > + return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask); > +} > -- > 1.9.1 > since RDB will cause DSACKs, and we only blindly count DSACKs to perform CWND undo. How does RDB handle that false positives?