* [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery()
@ 2012-05-02 18:46 Yuchung Cheng
2012-05-02 18:46 ` [PATCH v3 2/3] tcp: early retransmit Yuchung Cheng
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Yuchung Cheng @ 2012-05-02 18:46 UTC (permalink / raw)
To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng
This a prepartion patch that refactors the code to enter recovery
into a new function tcp_enter_recovery(). It's needed to implement
the delayed fast retransmit in ER.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog since v1:
- swaped with part 1 and part2
ChangeLog since v2:
- removed RFC in commit message
net/ipv4/tcp_input.c | 61 +++++++++++++++++++++++++++----------------------
1 files changed, 34 insertions(+), 27 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c93b0cb..22df826 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3022,6 +3022,38 @@ static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
}
+static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ int mib_idx;
+
+ if (tcp_is_reno(tp))
+ mib_idx = LINUX_MIB_TCPRENORECOVERY;
+ else
+ mib_idx = LINUX_MIB_TCPSACKRECOVERY;
+
+ NET_INC_STATS_BH(sock_net(sk), mib_idx);
+
+ tp->high_seq = tp->snd_nxt;
+ tp->prior_ssthresh = 0;
+ tp->undo_marker = tp->snd_una;
+ tp->undo_retrans = tp->retrans_out;
+
+ if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
+ if (!ece_ack)
+ tp->prior_ssthresh = tcp_current_ssthresh(sk);
+ tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk);
+ TCP_ECN_queue_cwr(tp);
+ }
+
+ tp->bytes_acked = 0;
+ tp->snd_cwnd_cnt = 0;
+ tp->prior_cwnd = tp->snd_cwnd;
+ tp->prr_delivered = 0;
+ tp->prr_out = 0;
+ tcp_set_ca_state(sk, TCP_CA_Recovery);
+}
+
/* Process an event, which can update packets-in-flight not trivially.
* Main goal of this function is to calculate new estimate for left_out,
* taking into account both packets sitting in receiver's buffer and
@@ -3041,7 +3073,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
struct tcp_sock *tp = tcp_sk(sk);
int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
(tcp_fackets_out(tp) > tp->reordering));
- int fast_rexmit = 0, mib_idx;
+ int fast_rexmit = 0;
if (WARN_ON(!tp->packets_out && tp->sacked_out))
tp->sacked_out = 0;
@@ -3142,32 +3174,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
}
/* Otherwise enter Recovery state */
-
- if (tcp_is_reno(tp))
- mib_idx = LINUX_MIB_TCPRENORECOVERY;
- else
- mib_idx = LINUX_MIB_TCPSACKRECOVERY;
-
- NET_INC_STATS_BH(sock_net(sk), mib_idx);
-
- tp->high_seq = tp->snd_nxt;
- tp->prior_ssthresh = 0;
- tp->undo_marker = tp->snd_una;
- tp->undo_retrans = tp->retrans_out;
-
- if (icsk->icsk_ca_state < TCP_CA_CWR) {
- if (!(flag & FLAG_ECE))
- tp->prior_ssthresh = tcp_current_ssthresh(sk);
- tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
- TCP_ECN_queue_cwr(tp);
- }
-
- tp->bytes_acked = 0;
- tp->snd_cwnd_cnt = 0;
- tp->prior_cwnd = tp->snd_cwnd;
- tp->prr_delivered = 0;
- tp->prr_out = 0;
- tcp_set_ca_state(sk, TCP_CA_Recovery);
+ tcp_enter_recovery(sk, (flag & FLAG_ECE));
fast_rexmit = 1;
}
--
1.7.7.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 2/3] tcp: early retransmit
2012-05-02 18:46 [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
@ 2012-05-02 18:46 ` Yuchung Cheng
2012-05-02 19:14 ` Neal Cardwell
2012-05-02 18:46 ` [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit Yuchung Cheng
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Yuchung Cheng @ 2012-05-02 18:46 UTC (permalink / raw)
To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng
This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are
less than 4 to recover losses by fast recovery instead of timeout.
While the algorithm is simple, small but frequent network reordering
makes this feature dangerous: the connection repeatedly enter
false recovery and degrade performance. Therefore we implement
a mitigation suggested in the appendix of the RFC that delays
entering fast recovery by a small interval, i.e., RTT/4. But when
the network reordering degree is too large, i.e., 3 packets,
ER is disabled to avoid false fast recoveries for the rest of
the connection. The performance impact of ER is summarized in
section 6 of the paper "Proportional Rate Reduction for TCP”,
IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
Note that Linux has a similar feature called THIN_DUPACK. The
differences are THIN_DUPACK do not mitigate reorderings and is only
used after slow start. Currently ER is disabled if THIN_DUPACK is
enabled. I would be happy to merge THIN_DUPACK feature with ER if
people think it's a good idea.
ER is enabled by sysctl_tcp_early_retrans:
0: Disables ER
1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
2: (Default) reduce dupthresh like mode 1. In addition, delay
entering fast recovery by RTT/4.
Note: mode 2 is implemented in the third part of this patch series.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog in v2:
- swapped part1 and part2
ChangeLog in v3:
- trivial text fixes in documentation.
Documentation/networking/ip-sysctl.txt | 14 ++++++++++++++
include/linux/tcp.h | 7 ++++---
include/net/tcp.h | 15 +++++++++++++++
net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
net/ipv4/tcp.c | 3 +++
net/ipv4/tcp_input.c | 13 +++++++++++++
net/ipv4/tcp_minisocks.c | 1 +
7 files changed, 60 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 9b569a2..34916e7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -190,6 +190,20 @@ tcp_cookie_size - INTEGER
tcp_dsack - BOOLEAN
Allows TCP to send "duplicate" SACKs.
+tcp_early_retrans - INTEGER
+ Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
+ for triggering fast retransmit when the amount of outstanding data is
+ small and when no previously unsent data can be transmitted (such
+ that limited transmit could be used).
+ Possible values:
+ 0 disables ER
+ 1 enables ER
+ 2 enables ER but delays fast recovery and fast retransmit
+ by a fourth of RTT. This mitigates connection falsely
+ recovers when network has a small degree of reordering
+ (less than 3 packets).
+ Default: 2
+
tcp_ecn - INTEGER
Enable Explicit Congestion Notification (ECN) in TCP. ECN is only
used when both ends of the TCP flow support it. It is useful to
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 278af9e..7d08a79 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -365,12 +365,13 @@ struct tcp_sock {
u32 frto_highmark; /* snd_nxt when RTO occurred */
u16 advmss; /* Advertised MSS */
- u8 frto_counter; /* Number of new acks after RTO */
- u8 nonagle : 4,/* Disable Nagle algorithm? */
+ u16 nonagle : 4,/* Disable Nagle algorithm? */
thin_lto : 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack */
repair : 1,
- unused : 1;
+ do_early_retrans: 1;/* Enable RFC5827 early-retransmit */
+
+ u8 frto_counter; /* Number of new acks after RTO */
u8 repair_queue;
/* RTT measurement */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0fb84de..685437a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -252,6 +252,7 @@ extern int sysctl_tcp_max_ssthresh;
extern int sysctl_tcp_cookie_size;
extern int sysctl_tcp_thin_linear_timeouts;
extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_early_retrans;
extern atomic_long_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated;
@@ -797,6 +798,20 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
tp->rx_opt.sack_ok |= TCP_FACK_ENABLED;
}
+/* TCP early-retransmit (ER) is similar to but more conservative than
+ * the thin-dupack feature. Enable ER only if thin-dupack is disabled.
+ */
+static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
+{
+ tp->do_early_retrans = sysctl_tcp_early_retrans &&
+ !sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
+}
+
+static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
+{
+ tp->do_early_retrans = 0;
+}
+
static inline unsigned int tcp_left_out(const struct tcp_sock *tp)
{
return tp->sacked_out + tp->lost_out;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 33417f8..ef32956 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -27,6 +27,7 @@
#include <net/tcp_memcontrol.h>
static int zero;
+static int two = 2;
static int tcp_retr1_max = 255;
static int ip_local_port_range_min[] = { 1, 1 };
static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -677,6 +678,15 @@ static struct ctl_table ipv4_table[] = {
.proc_handler = proc_dointvec
},
{
+ .procname = "tcp_early_retrans",
+ .data = &sysctl_tcp_early_retrans,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &two,
+ },
+ {
.procname = "udp_mem",
.data = &sysctl_udp_mem,
.maxlen = sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9670af3..6802c89 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -395,6 +395,7 @@ void tcp_init_sock(struct sock *sk)
tp->mss_cache = TCP_MSS_DEFAULT;
tp->reordering = sysctl_tcp_reordering;
+ tcp_enable_early_retrans(tp);
icsk->icsk_ca_ops = &tcp_init_congestion_ops;
sk->sk_state = TCP_CLOSE;
@@ -2495,6 +2496,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
err = -EINVAL;
else
tp->thin_dupack = val;
+ if (tp->thin_dupack)
+ tcp_disable_early_retrans(tp);
break;
case TCP_REPAIR:
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 22df826..98c586d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -99,6 +99,7 @@ int sysctl_tcp_thin_dupack __read_mostly;
int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
int sysctl_tcp_abc __read_mostly;
+int sysctl_tcp_early_retrans __read_mostly = 2;
#define FLAG_DATA 0x01 /* Incoming frame contained data. */
#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
@@ -906,6 +907,7 @@ static void tcp_init_metrics(struct sock *sk)
if (dst_metric(dst, RTAX_REORDERING) &&
tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
tcp_disable_fack(tp);
+ tcp_disable_early_retrans(tp);
tp->reordering = dst_metric(dst, RTAX_REORDERING);
}
@@ -987,6 +989,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
tp->undo_marker ? tp->undo_retrans : 0);
#endif
tcp_disable_fack(tp);
+ tcp_disable_early_retrans(tp);
}
}
@@ -2492,6 +2495,16 @@ static int tcp_time_to_recover(struct sock *sk)
tcp_is_sack(tp) && !tcp_send_head(sk))
return 1;
+ /* Trick#6: TCP early retransmit, per RFC5827. To avoid spurious
+ * retransmissions due to small network reorderings, we implement
+ * Mitigation A.3 in the RFC and delay the retransmission for a short
+ * interval if appropriate.
+ */
+ if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
+ (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
+ !tcp_may_send_now(sk))
+ return 1;
+
return 0;
}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 3cabafb..6f6a918 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
newtp->sacked_out = 0;
newtp->fackets_out = 0;
newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
+ tcp_enable_early_retrans(newtp);
/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed-ACK and congestion control
--
1.7.7.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit
2012-05-02 18:46 [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
2012-05-02 18:46 ` [PATCH v3 2/3] tcp: early retransmit Yuchung Cheng
@ 2012-05-02 18:46 ` Yuchung Cheng
2012-05-02 19:34 ` Neal Cardwell
2012-05-02 18:49 ` [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery() Eric Dumazet
2012-05-02 19:03 ` Neal Cardwell
3 siblings, 1 reply; 8+ messages in thread
From: Yuchung Cheng @ 2012-05-02 18:46 UTC (permalink / raw)
To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng
Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed. When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog in v2:
- Set sysctl_tcp_early_retrans default to 2
ChangeLog in v3:
- use separate u8 for early retrans stats in tcp_sock
- disable ER if detects any reordering
include/linux/tcp.h | 8 +++---
include/net/tcp.h | 3 ++
net/ipv4/tcp_input.c | 72 +++++++++++++++++++++++++++++++++++++++++++-----
net/ipv4/tcp_output.c | 5 +--
net/ipv4/tcp_timer.c | 5 +++
5 files changed, 78 insertions(+), 15 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7d08a79..f8e15b2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -365,12 +365,12 @@ struct tcp_sock {
u32 frto_highmark; /* snd_nxt when RTO occurred */
u16 advmss; /* Advertised MSS */
- u16 nonagle : 4,/* Disable Nagle algorithm? */
+ u8 nonagle : 4,/* Disable Nagle algorithm? */
thin_lto : 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack */
- repair : 1,
- do_early_retrans: 1;/* Enable RFC5827 early-retransmit */
-
+ repair : 1;
+ u8 do_early_retrans:1,/* Enable RFC5827 early retransmit (ER) */
+ early_retrans_delayed:1; /* Delayed ER timer installed */
u8 frto_counter; /* Number of new acks after RTO */
u8 repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 685437a..5283aa4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -500,6 +500,8 @@ extern void tcp_send_delayed_ack(struct sock *sk);
/* tcp_input.c */
extern void tcp_cwnd_application_limited(struct sock *sk);
+extern void tcp_resume_early_retransmit(struct sock *sk);
+extern void tcp_rearm_rto(struct sock *sk);
/* tcp_timer.c */
extern void tcp_init_xmit_timers(struct sock *);
@@ -805,6 +807,7 @@ static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
{
tp->do_early_retrans = sysctl_tcp_early_retrans &&
!sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
+ tp->early_retrans_delayed = 0;
}
static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 98c586d..9363a54 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -989,8 +989,9 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
tp->undo_marker ? tp->undo_retrans : 0);
#endif
tcp_disable_fack(tp);
- tcp_disable_early_retrans(tp);
}
+ if (metric > 0)
+ tcp_disable_early_retrans(tp);
}
/* This must be called before lost_out is incremented */
@@ -2342,6 +2343,27 @@ static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)
return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;
}
+static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ unsigned long delay;
+
+ /* Delay early retransmit and entering fast recovery for
+ * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
+ * available, or RTO is scheduled to fire first.
+ */
+ if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt)
+ return false;
+
+ delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2));
+ if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
+ return false;
+
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX);
+ tp->early_retrans_delayed = 1;
+ return true;
+}
+
static inline int tcp_skb_timedout(const struct sock *sk,
const struct sk_buff *skb)
{
@@ -2449,7 +2471,7 @@ static inline int tcp_head_timedout(const struct sock *sk)
* Main question: may we further continue forward transmission
* with the same cwnd?
*/
-static int tcp_time_to_recover(struct sock *sk)
+static int tcp_time_to_recover(struct sock *sk, int flag)
{
struct tcp_sock *tp = tcp_sk(sk);
__u32 packets_out;
@@ -2503,7 +2525,7 @@ static int tcp_time_to_recover(struct sock *sk)
if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
(tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
!tcp_may_send_now(sk))
- return 1;
+ return !tcp_pause_early_retransmit(sk, flag);
return 0;
}
@@ -3170,7 +3192,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
if (icsk->icsk_ca_state <= TCP_CA_Disorder)
tcp_try_undo_dsack(sk);
- if (!tcp_time_to_recover(sk)) {
+ if (!tcp_time_to_recover(sk, flag)) {
tcp_try_to_open(sk, flag);
return;
}
@@ -3269,16 +3291,47 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
/* Restart timer after forward progress on connection.
* RFC2988 recommends to restart timer to now+rto.
*/
-static void tcp_rearm_rto(struct sock *sk)
+void tcp_rearm_rto(struct sock *sk)
{
- const struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
if (!tp->packets_out) {
inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
} else {
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
- inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+ u32 rto = inet_csk(sk)->icsk_rto;
+ /* Offset the time elapsed after installing regular RTO */
+ if (tp->early_retrans_delayed) {
+ struct sk_buff *skb = tcp_write_queue_head(sk);
+ const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
+ s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
+ /* delta may not be positive if the socket is locked
+ * when the delayed ER timer fires and is rescheduled.
+ */
+ if (delta > 0)
+ rto = delta;
+ }
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
+ TCP_RTO_MAX);
}
+ tp->early_retrans_delayed = 0;
+}
+
+/* This function is called when the delayed ER timer fires. TCP enters
+ * fast recovery and performs fast-retransmit.
+ */
+void tcp_resume_early_retransmit(struct sock *sk)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ tcp_rearm_rto(sk);
+
+ /* Stop if ER is disabled after the delayed ER timer is scheduled */
+ if (!tp->do_early_retrans)
+ return;
+
+ tcp_enter_recovery(sk, false);
+ tcp_update_scoreboard(sk, 1);
+ tcp_xmit_retransmit_queue(sk);
}
/* If we get here, the whole TSO packet has not been acked. */
@@ -3727,6 +3780,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if (after(ack, tp->snd_nxt))
goto invalid_ack;
+ if (tp->early_retrans_delayed)
+ tcp_rearm_rto(sk);
+
if (after(ack, prior_snd_una))
flag |= FLAG_SND_UNA_ADVANCED;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 834e89f..d947330 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -78,9 +78,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
tp->frto_counter = 3;
tp->packets_out += tcp_skb_pcount(skb);
- if (!prior_packets)
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
- inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+ if (!prior_packets || tp->early_retrans_delayed)
+ tcp_rearm_rto(sk);
}
/* SND.NXT, if window was not shrunk.
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 34d4a02..e911e6c 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -319,6 +319,11 @@ void tcp_retransmit_timer(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+ if (tp->early_retrans_delayed) {
+ tcp_resume_early_retransmit(sk);
+ return;
+ }
+
if (!tp->packets_out)
goto out;
--
1.7.7.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery()
2012-05-02 18:46 [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
2012-05-02 18:46 ` [PATCH v3 2/3] tcp: early retransmit Yuchung Cheng
2012-05-02 18:46 ` [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit Yuchung Cheng
@ 2012-05-02 18:49 ` Eric Dumazet
2012-05-02 19:03 ` Neal Cardwell
3 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2012-05-02 18:49 UTC (permalink / raw)
To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, ncardwell, nanditad, netdev
On Wed, 2012-05-02 at 11:46 -0700, Yuchung Cheng wrote:
> This a prepartion patch that refactors the code to enter recovery
> into a new function tcp_enter_recovery(). It's needed to implement
> the delayed fast retransmit in ER.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
> ChangeLog since v1:
> - swaped with part 1 and part2
> ChangeLog since v2:
> - removed RFC in commit message
>
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery()
2012-05-02 18:46 [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
` (2 preceding siblings ...)
2012-05-02 18:49 ` [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery() Eric Dumazet
@ 2012-05-02 19:03 ` Neal Cardwell
3 siblings, 0 replies; 8+ messages in thread
From: Neal Cardwell @ 2012-05-02 19:03 UTC (permalink / raw)
To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, nanditad, netdev
On Wed, May 2, 2012 at 2:46 PM, Yuchung Cheng <ycheng@google.com> wrote:
> This a prepartion patch that refactors the code to enter recovery
> into a new function tcp_enter_recovery(). It's needed to implement
> the delayed fast retransmit in ER.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
> ChangeLog since v1:
> - swaped with part 1 and part2
> ChangeLog since v2:
> - removed RFC in commit message
Acked-by: Neal Cardwell <ncardwell@google.com>
neal
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 2/3] tcp: early retransmit
2012-05-02 18:46 ` [PATCH v3 2/3] tcp: early retransmit Yuchung Cheng
@ 2012-05-02 19:14 ` Neal Cardwell
0 siblings, 0 replies; 8+ messages in thread
From: Neal Cardwell @ 2012-05-02 19:14 UTC (permalink / raw)
To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, nanditad, netdev
On Wed, May 2, 2012 at 2:46 PM, Yuchung Cheng <ycheng@google.com> wrote:
I see this version of the patch decided against going with these
suggestions below that I made on Monday (repasted below). Do you have
a sec to share why you wanted to stick with the existing behavior?
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -365,12 +365,13 @@ struct tcp_sock {
>
> u32 frto_highmark; /* snd_nxt when RTO occurred */
> u16 advmss; /* Advertised MSS */
> - u8 frto_counter; /* Number of new acks after RTO */
> - u8 nonagle : 4,/* Disable Nagle algorithm? */
> + u16 nonagle : 4,/* Disable Nagle algorithm? */
> thin_lto : 1,/* Use linear timeouts for thin streams */
> thin_dupack : 1,/* Fast retransmit on first dupack */
> repair : 1,
> - unused : 1;
> + do_early_retrans: 1;/* Enable RFC5827 early-retransmit */
> +
> + u8 frto_counter; /* Number of new acks after RTO */
> u8 repair_queue;
To keep the change minimal and reduce the risk of mysterious
performance regressions from cache effects, I'd suggest keeping the
frto_counter and nonagle u8 bytes as u8 bytes in their current
location, and add a new u8 for the two ER bits. Same amount of space
as the scheme in the patch, just less shuffling.
> @@ -987,6 +989,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
> tp->undo_marker ? tp->undo_retrans : 0);
> #endif
> tcp_disable_fack(tp);
> + tcp_disable_early_retrans(tp);
> }
> }
I think we should stick with the behavior where we disable early
retransmit any time tcp_update_reordering() is called with a non-zero
reordering metric. This is what we've tested and measured, and my
sense is that we could risk a significant number of spurious ER
firings if instead we relax this so that only reordering >3 causes us
to disable ER. I know the delayed ER should help avoid spurious ER
firings when there is a small degree of reordering, but my guess would
be that the max(RTT/4, 2ms) is perhaps not big enough if we're
allowing delayed ER for connections that have already witnessed small
degrees of reordering. So until we have more experimental data, I'd
recommend sticking with:
if (metric > 0)
tcp_disable_early_retrans(tp);
neal
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit
2012-05-02 18:46 ` [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit Yuchung Cheng
@ 2012-05-02 19:34 ` Neal Cardwell
2012-05-02 21:17 ` Yuchung Cheng
0 siblings, 1 reply; 8+ messages in thread
From: Neal Cardwell @ 2012-05-02 19:34 UTC (permalink / raw)
To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, nanditad, netdev
On Wed, May 2, 2012 at 2:46 PM, Yuchung Cheng <ycheng@google.com> wrote:
> Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
> Delays the fast retransmit by an interval of RTT/4. We borrow the
> RTO timer to implement the delay. If we receive another ACK or send
> a new packet, the timer is cancelled and restored to original RTO
> value offset by time elapsed. When the delayed-ER timer fires,
> we enter fast recovery and perform fast retransmit.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
> ChangeLog in v2:
> - Set sysctl_tcp_early_retrans default to 2
> ChangeLog in v3:
> - use separate u8 for early retrans stats in tcp_sock
> - disable ER if detects any reordering
After reading patch 3 of the series, I see that patch 3 incorporates
most of those suggestions from Monday. I think it would be quite a bit
cleaner to just have patch 2 of the series put the
tcp_disable_early_retrans() call and tcp_sock fields in the ultimately
desired place, rather than having patch 2 put them somewhere and patch
3 move them, but maybe that's just me.
When all the patches in the series are applied, the one issue I still
see is that frto_counter is in a new place, a bit further away from
frto_highmark than it used to be before the patch series. I think it
would be good to keep frto_counter in its original location, back up
nearer to frto_highmark.
neal
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit
2012-05-02 19:34 ` Neal Cardwell
@ 2012-05-02 21:17 ` Yuchung Cheng
0 siblings, 0 replies; 8+ messages in thread
From: Yuchung Cheng @ 2012-05-02 21:17 UTC (permalink / raw)
To: Neal Cardwell; +Cc: davem, ilpo.jarvinen, nanditad, netdev
On Wed, May 2, 2012 at 12:34 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Wed, May 2, 2012 at 2:46 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
>> Delays the fast retransmit by an interval of RTT/4. We borrow the
>> RTO timer to implement the delay. If we receive another ACK or send
>> a new packet, the timer is cancelled and restored to original RTO
>> value offset by time elapsed. When the delayed-ER timer fires,
>> we enter fast recovery and perform fast retransmit.
>>
>> Signed-off-by: Yuchung Cheng <ycheng@google.com>
>> ---
>> ChangeLog in v2:
>> - Set sysctl_tcp_early_retrans default to 2
>> ChangeLog in v3:
>> - use separate u8 for early retrans stats in tcp_sock
>> - disable ER if detects any reordering
>
> After reading patch 3 of the series, I see that patch 3 incorporates
> most of those suggestions from Monday. I think it would be quite a bit
> cleaner to just have patch 2 of the series put the
> tcp_disable_early_retrans() call and tcp_sock fields in the ultimately
> desired place, rather than having patch 2 put them somewhere and patch
> 3 move them, but maybe that's just me.
>
> When all the patches in the series are applied, the one issue I still
> see is that frto_counter is in a new place, a bit further away from
> frto_highmark than it used to be before the patch series. I think it
> would be good to keep frto_counter in its original location, back up
> nearer to frto_highmark.
Sorry for keep mis-placing the changes in different patch parts. Will fix asap.
>
> neal
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-05-02 21:17 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-02 18:46 [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
2012-05-02 18:46 ` [PATCH v3 2/3] tcp: early retransmit Yuchung Cheng
2012-05-02 19:14 ` Neal Cardwell
2012-05-02 18:46 ` [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit Yuchung Cheng
2012-05-02 19:34 ` Neal Cardwell
2012-05-02 21:17 ` Yuchung Cheng
2012-05-02 18:49 ` [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery() Eric Dumazet
2012-05-02 19:03 ` Neal Cardwell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).