From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nandita Dukkipati Subject: Re: [PATCH] Proportional Rate Reduction for TCP. Date: Fri, 19 Aug 2011 00:34:21 -0700 Message-ID: References: <1313134197-5082-1-git-send-email-nanditad@google.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "David S. Miller" , Netdev , Tom Herbert , Yuchung Cheng , Matt Mathis To: =?ISO-8859-1?Q?Ilpo_J=E4rvinen?= Return-path: Received: from smtp-out.google.com ([74.125.121.67]:39360 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751297Ab1HSHeZ convert rfc822-to-8bit (ORCPT ); Fri, 19 Aug 2011 03:34:25 -0400 Received: from hpaq1.eem.corp.google.com (hpaq1.eem.corp.google.com [172.25.149.1]) by smtp-out.google.com with ESMTP id p7J7YNOI026717 for ; Fri, 19 Aug 2011 00:34:23 -0700 Received: from gxk10 (gxk10.prod.google.com [10.202.11.10]) by hpaq1.eem.corp.google.com with ESMTP id p7J7XVGV029185 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NOT) for ; Fri, 19 Aug 2011 00:34:22 -0700 Received: by gxk10 with SMTP id 10so2916556gxk.11 for ; Fri, 19 Aug 2011 00:34:22 -0700 (PDT) In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Fri, Aug 12, 2011 at 4:36 AM, Ilpo J=E4rvinen wrote: > > > This patch implements Proportional Rate Reduction (PRR) for TCP. > > PRR is an algorithm that determines TCP's sending rate in fast > > recovery. PRR avoids excessive window reductions and aims for > > the actual congestion window size at the end of recovery to be as > > close as possible to the window determined by the congestion contro= l > > algorithm. PRR also improves accuracy of the amount of data sent > > during loss recovery. > > > > The patch implements the recommended flavor of PRR called PRR-SSRB > > (Proportional rate reduction with slow start reduction bound) and > > replaces the existing rate halving algorithm. PRR improves upon the > > existing Linux fast recovery under a number of conditions including= : > > =A0 1) burst losses where the losses implicitly reduce the amount o= f > > outstanding data (pipe) below the ssthresh value selected by the > > congestion control algorithm and, > > =A0 2) losses near the end of short flows where application runs ou= t of > > data to send. > > > > As an example, with the existing rate halving implementation a sing= le > > loss event can cause a connection carrying short Web transactions t= o > > go into the slow start mode after the recovery. This is because dur= ing > > recovery Linux pulls the congestion window down to packets_in_fligh= t+1 > > on every ACK. A short Web response often runs out of new data to se= nd > > and its pipe reduces to zero by the end of recovery when all its pa= ckets > > are drained from the network. Subsequent HTTP responses using the s= ame > > connection will have to slow start to raise cwnd to ssthresh. PRR o= n > > the other hand aims for the cwnd to be as close as possible to ssth= resh > > by the end of recovery. > > > > A description of PRR and a discussion of its performance can be fou= nd at > > the following links: > > - IETF Draft: > > =A0 =A0 http://tools.ietf.org/html/draft-mathis-tcpm-proportional-r= ate-reduction-01 > > - IETF Slides: > > =A0 =A0 http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf > > =A0 =A0 http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf > > - Paper to appear in Internet Measurements Conference (IMC) 2011: > > =A0 =A0 Improving TCP Loss Recovery > > =A0 =A0 Nandita Dukkipati, Matt Mathis, Yuchung Cheng > > > > Signed-off-by: Nandita Dukkipati > > --- > > =A0include/linux/tcp.h =A0 | =A0 =A04 +++ > > =A0net/ipv4/tcp_input.c =A0| =A0 63 +++++++++++++++++++++++++++++++= +++++++++++++---- > > =A0net/ipv4/tcp_output.c | =A0 =A07 ++++- > > =A03 files changed, 67 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > > index 531ede8..dda4f2e1 100644 > > --- a/include/linux/tcp.h > > +++ b/include/linux/tcp.h > > @@ -379,6 +379,10 @@ struct tcp_sock { > > =A0 =A0 =A0 u32 =A0 =A0 snd_cwnd_clamp; /* Do not allow snd_cwnd to= grow above this */ > > =A0 =A0 =A0 u32 =A0 =A0 snd_cwnd_used; > > =A0 =A0 =A0 u32 =A0 =A0 snd_cwnd_stamp; > > + =A0 =A0 u32 =A0 =A0 prr_cwnd; =A0 =A0 =A0 /* Congestion window at= start of Recovery. */ > > + =A0 =A0 u32 =A0 =A0 prr_delivered; =A0/* Number of newly delivere= d packets to > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* rece= iver in Recovery. */ > > + =A0 =A0 u32 =A0 =A0 prr_out; =A0 =A0 =A0 =A0/* Total number of pk= ts sent during Recovery. */ > > > > =A0 =A0 =A0 u32 =A0 =A0 rcv_wnd; =A0 =A0 =A0 =A0/* Current receiver= window =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > > =A0 =A0 =A0 u32 =A0 =A0 write_seq; =A0 =A0 =A0/* Tail(+1) of data h= eld in tcp send buffer */ > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index ea0d218..601eff0 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk= ) > > =A0static inline void tcp_complete_cwr(struct sock *sk) > > =A0{ > > =A0 =A0 =A0 struct tcp_sock *tp =3D tcp_sk(sk); > > - =A0 =A0 /* Do not moderate cwnd if it's already undone in cwr or = recovery */ > > - =A0 =A0 if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) { > > - =A0 =A0 =A0 =A0 =A0 =A0 tp->snd_cwnd =3D tp->snd_ssthresh; > > + > > + =A0 =A0 /* Do not moderate cwnd if it's already undone in cwr or = recovery. */ > > + =A0 =A0 if (tp->undo_marker) { > > + > > + =A0 =A0 =A0 =A0 =A0 =A0 if (inet_csk(sk)->icsk_ca_state =3D=3D TC= P_CA_CWR) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tp->snd_cwnd =3D min(tp->= snd_cwnd, tp->snd_ssthresh); > > + =A0 =A0 =A0 =A0 =A0 =A0 else /* PRR */ > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tp->snd_cwnd =3D tp->snd_= ssthresh; > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 tp->snd_cwnd_stamp =3D tcp_time_stamp; > > =A0 =A0 =A0 } > > =A0 =A0 =A0 tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR); > > @@ -2950,6 +2955,39 @@ void tcp_simple_retransmit(struct sock *sk) > > =A0} > > =A0EXPORT_SYMBOL(tcp_simple_retransmit); > > > > +/* This function implements the PRR algorithm, specifcally the PRR= -SSRB > > + * (proportional rate reduction with slow start reduction bound) a= s described in > > + * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-redu= ction-01.txt. > > + * It computes the number of packets to send (sndcnt) based on pac= kets newly > > + * delivered: > > + * =A0 1) If the packets in flight is larger than ssthresh, PRR sp= reads the > > + * =A0 cwnd reductions across a full RTT. > > + * =A0 2) If packets in flight is lower than ssthresh (such as due= to excess > > + * =A0 losses and/or application stalls), do not perform any furth= er cwnd > > + * =A0 reductions, but instead slow start up to ssthresh. > > + */ > > +static void tcp_update_cwnd_in_recovery(struct sock *sk, int pkts_= delivered, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 int fast_rexmit, int flag) > > +{ > > + =A0 =A0 struct tcp_sock *tp =3D tcp_sk(sk); > > + =A0 =A0 int sndcnt =3D 0; > > + =A0 =A0 int delta =3D tp->snd_ssthresh - tcp_packets_in_flight(tp= ); > > + > > + =A0 =A0 if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) { > > + =A0 =A0 =A0 =A0 =A0 =A0 if (WARN_ON(!tp->prr_cwnd)) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tp->prr_cwnd =3D 1; > > Thanks, this made me to smile when I realized what kind of positive f= eedback > loop it would cause in a heavily congested environment :-). ...Perhap= s any > value below 2 * tp->snd_ssthresh isn't that safe safety relief valve = here. > ...Of course this condition isn't ever hit with the current code base= so > no real harm being done regardless of the value. > > > + =A0 =A0 =A0 =A0 =A0 =A0 sndcnt =3D DIV_ROUND_UP(tp->prr_delivered= * tp->snd_ssthresh, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= tp->prr_cwnd) - tp->prr_out; > > What about very large windows ...overflow? Addressed in patch v2. > Due to not having lower bound here, the resulting window can end up b= elow > snd_ssthresh in one step if prr_delivered suddently jumps to a value = that > is above prr_cnwd? > > Also, snd_ssthresh and prr_cwnd are essentially constants over the > whole recovery and yet divide is needed whole the time. ...And in pra= ctically > all cases the proportion will be 0.5 (except for the odd number produ= ced > off-by-one thing)? > > > + =A0 =A0 } else { > > + =A0 =A0 =A0 =A0 =A0 =A0 sndcnt =3D min_t(int, delta, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0max_t(int,= tp->prr_delivered - tp->prr_out, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= pkts_delivered) + 1); > > + =A0 =A0 } > > + > > + =A0 =A0 sndcnt =3D max(sndcnt, (fast_rexmit ? 1 : 0)); > > + =A0 =A0 tp->snd_cwnd =3D tcp_packets_in_flight(tp) + sndcnt; > > +} > > + > > =A0/* Process an event, which can update packets-in-flight not triv= ially. > > =A0 * Main goal of this function is to calculate new estimate for l= eft_out, > > =A0 * taking into account both packets sitting in receiver's buffer= and > > @@ -2961,7 +2999,8 @@ EXPORT_SYMBOL(tcp_simple_retransmit); > > =A0 * It does _not_ decide what to send, it is made in function > > =A0 * tcp_xmit_retransmit_queue(). > > =A0 */ > > -static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,= int flag) > > +static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 int p= kts_delivered, int flag) > > =A0{ > > =A0 =A0 =A0 struct inet_connection_sock *icsk =3D inet_csk(sk); > > =A0 =A0 =A0 struct tcp_sock *tp =3D tcp_sk(sk); > > @@ -3111,13 +3150,17 @@ static void tcp_fastretrans_alert(struct so= ck *sk, int pkts_acked, int flag) > > > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 tp->bytes_acked =3D 0; > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 tp->snd_cwnd_cnt =3D 0; > > + =A0 =A0 =A0 =A0 =A0 =A0 tp->prr_cwnd =3D tp->snd_cwnd; > > prior_cwnd would make this variable about 1000 times easier to read > without requiring some background. Patch v2 changes this to prior_cwnd. > > While at it, I realized that there is some restore window place in th= e > generic code that assumes "prior_cwnd =3D=3D 2 * tp->snd_ssthresh" wh= ich is > not true with every single cc module. > > > + =A0 =A0 =A0 =A0 =A0 =A0 tp->prr_delivered =3D 0; > > + =A0 =A0 =A0 =A0 =A0 =A0 tp->prr_out =3D 0; > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 tcp_set_ca_state(sk, TCP_CA_Recovery); > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 fast_rexmit =3D 1; > > =A0 =A0 =A0 } > > > > =A0 =A0 =A0 if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk= ))) > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 tcp_update_scoreboard(sk, fast_rexmit); > > - =A0 =A0 tcp_cwnd_down(sk, flag); > > + =A0 =A0 tp->prr_delivered +=3D pkts_delivered; > > ...Is here a bug in the calculation if the recovery was triggered wit= h > _a cumulative ACK_ that had enough SACK blocks? > > > + =A0 =A0 tcp_update_cwnd_in_recovery(sk, pkts_delivered, fast_rexm= it, flag); > > =A0 =A0 =A0 tcp_xmit_retransmit_queue(sk); > > =A0} > > > > @@ -3632,6 +3675,11 @@ static int tcp_ack(struct sock *sk, struct s= k_buff *skb, int flag) > > =A0 =A0 =A0 u32 prior_in_flight; > > =A0 =A0 =A0 u32 prior_fackets; > > =A0 =A0 =A0 int prior_packets; > > + =A0 =A0 int prior_sacked =3D tp->sacked_out; > > + =A0 =A0 /* pkts_delivered is number of packets newly cumulatively= acked or > > + =A0 =A0 =A0* sacked on this ACK. > > + =A0 =A0 =A0*/ > > + =A0 =A0 int pkts_delivered =3D 0; > > If you need a comment like that on variable declaration, maybe its na= me > could be more obvious, pkts_acked_sacked ? Changed this to newly_acked_sacked as per your suggestion. > > > =A0 =A0 =A0 int frto_cwnd =3D 0; > > > > =A0 =A0 =A0 /* If the ack is older than previous acks > > @@ -3703,6 +3751,9 @@ static int tcp_ack(struct sock *sk, struct sk= _buff *skb, int flag) > > =A0 =A0 =A0 /* See if we can take anything off of the retransmit qu= eue. */ > > =A0 =A0 =A0 flag |=3D tcp_clean_rtx_queue(sk, prior_fackets, prior_= snd_una); > > + =A0 =A0 pkts_delivered =3D (prior_packets - prior_sacked) - > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(tp->packets_out - tp-= >sacked_out); > > + > > =A0 =A0 =A0 if (tp->frto_counter) > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 frto_cwnd =3D tcp_process_frto(sk, flag= ); > > =A0 =A0 =A0 /* Guarantee sacktag reordering detection against wrap-= arounds */ > > @@ -3715,7 +3766,7 @@ static int tcp_ack(struct sock *sk, struct sk= _buff *skb, int flag) > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tcp_may_raise_cwnd(sk, flag)) > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tcp_cong_avoid(sk, ack,= prior_in_flight); > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 tcp_fastretrans_alert(sk, prior_packets= - tp->packets_out, > > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= flag); > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= pkts_delivered, flag); > > =A0 =A0 =A0 } else { > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((flag & FLAG_DATA_ACKED) && !frto_c= wnd) > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tcp_cong_avoid(sk, ack,= prior_in_flight); > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index 882e0b0..ca50408 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -1796,11 +1796,13 @@ static int tcp_write_xmit(struct sock *sk, = unsigned int mss_now, int nonagle, > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 tcp_event_new_data_sent(sk, skb); > > > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 tcp_minshall_update(tp, mss_now, skb); > > - =A0 =A0 =A0 =A0 =A0 =A0 sent_pkts++; > > + =A0 =A0 =A0 =A0 =A0 =A0 sent_pkts +=3D tcp_skb_pcount(skb); > > > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (push_one) > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > > =A0 =A0 =A0 } > > + =A0 =A0 if (inet_csk(sk)->icsk_ca_state =3D=3D TCP_CA_Recovery) > > + =A0 =A0 =A0 =A0 =A0 =A0 tp->prr_out +=3D sent_pkts; > > Is there some harm done if you get rid of the conditionality? > > > =A0 =A0 =A0 if (likely(sent_pkts)) { > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 tcp_cwnd_validate(sk); > > @@ -2294,6 +2296,9 @@ begin_fwd: > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return; > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 NET_INC_STATS_BH(sock_net(sk), mib_idx)= ; > > > > + =A0 =A0 =A0 =A0 =A0 =A0 if (inet_csk(sk)->icsk_ca_state =3D=3D TC= P_CA_Recovery) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tp->prr_out +=3D tcp_skb_= pcount(skb); > > + > > ...same here? No harm done if the condition is removed in either of these case. However, letting the condition remain for now for ease of readability. > > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (skb =3D=3D tcp_write_queue_head(sk)= ) > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 inet_csk_reset_xmit_tim= er(sk, ICSK_TIME_RETRANS, > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 inet_csk(sk)->icsk_rto, > > Besides the points above, I'm not convinced that we really need all t= he > across-packet state that is present, some relevant values should be > available before we do any ACK processing (in-flight, cwnd, and more > importantly, the difference between them). ...But I haven't yet > convinced myself of anything particular. > > -- > =A0i.