All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
@ 2016-10-11  0:18 Lawrence Brakmo
  2016-10-11 19:49 ` Yuchung Cheng
  0 siblings, 1 reply; 8+ messages in thread
From: Lawrence Brakmo @ 2016-10-11  0:18 UTC (permalink / raw)
  To: netdev; +Cc: Kernel Team, Eric Dumazet, Yuchung Cheng, Neal Cardwell

The purpose of this patch is to help balance flows across paths. A new
sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
A probability is used in order to control how many flows are moved
during a congestion event and prevent the congested path from becoming
under utilized (which could occur if too many flows leave the current
path). Txhash changes may be delayed in order to decrease the likelihood
that it will trigger retransmists due to too much reordering.

Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
txhash changes. The idea is to decrease the likelihood of going back
to a broken path. That is, we don't want flow balancing to trigger
changes to broken paths. The drawback is that flow balancing does
not work as well. If the sysctl is greater than 1, then we always
do flow balancing, even after RTOs.

Tested with packedrill tests (for correctness) and performance
experiments with 2 and 3 paths. Performance experiments looked at
aggregate goodput and fairness. For each run, we looked at the ratio of
the goodputs for the fastest and slowest flows. These were averaged for
all the runs. A fairness of 1 means all flows had the same goodput, a
fairness of 2 means the fastest flow was twice as fast as the slowest
flow.

The setup for the performance experiments was 4 or 5 serves in a rack,
10G links. I tested various probabilities, but 20 seemed to have the
best tradeoff for my setup (small RTTs).

                      --- node1 -----
    sender --- switch --- node2 ----- switch ---- receiver
                      --- node3 -----

Scenario 1: One sender sends to one receiver through 2 routes (node1 or
node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
1.9G with flow balancing due to 2 flows ending up in one link and either
not moving and taking some time to move. Fairness was 1 in all cases.
For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
flow balancing increased fairness.

Scenario 2: One sender to one receiver, through 3 routes (node1,...
node2). With 6 or 16 flows the goodput was the same for all, but
fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
prob=20,mode=1 improved average and worst case fairness.

Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
the packets. With 7 flows, goodput was the same 1.1G, but fairness was
1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
balancing, which does more re-routes, is less fair.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 Documentation/networking/ip-sysctl.txt | 15 +++++++++++++++
 include/linux/tcp.h                    |  4 +++-
 include/net/tcp.h                      |  2 ++
 net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++++
 net/ipv4/tcp_input.c                   | 10 ++++++++++
 net/ipv4/tcp_output.c                  | 23 ++++++++++++++++++++++-
 net/ipv4/tcp_timer.c                   |  4 ++++
 7 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 3db8c67..87a984c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
 	if paths are using per packet load balancing (like bonding rr mode)
 	Default: 300
 
+tcp_retrans_txhash_mode - INTEGER
+	If zero, disable txhash recalculation due to non-RTO retransmissions
+	after an RTO. The idea is that broken paths will trigger an RTO and
+	we don't want going back to that path due to standard retransmissons
+	(flow balancing). The drawback is that balancing is less robust.
+	If greater than zero, can always (probabilistically) recalculate
+	txhash after non-RTO retransmissions.
+
+tcp_retrans_txhash_prob - INTEGER
+	Probability [0 to 100] that we will recalculate txhash when a
+	packet is resent not due to RTO (for RTO txhash is always recalculated).
+	The recalculation of the txhash may be delayed to decrease the
+	likelihood that reordering will trigger retransmissons.
+	The purpose is to help balance the flows among the possible paths.
+
 tcp_retrans_collapse - BOOLEAN
 	Bug-to-bug compatibility with some broken printers.
 	On retransmit try to send bigger packets to work around bugs in
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a17ae7b..e0e3b7d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -214,7 +214,9 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	rate_app_limited:1,  /* rate_{delivered,interval_us} limited? */
-		unused:7;
+		txhash_rto:1,	/* If set, don't do flow balancing	*/
+		txhash_want:1,	/* We want to change txhash when safe	*/
+		unused:5;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f83b7f2..3abd304 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -271,6 +271,8 @@ extern int sysctl_tcp_autocorking;
 extern int sysctl_tcp_invalid_ratelimit;
 extern int sysctl_tcp_pacing_ss_ratio;
 extern int sysctl_tcp_pacing_ca_ratio;
+extern int sysctl_tcp_retrans_txhash_prob;
+extern int sysctl_tcp_retrans_txhash_mode;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1cb67de..00d6f26 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -28,6 +28,7 @@
 static int zero;
 static int one = 1;
 static int four = 4;
+static int hundred = 100;
 static int thousand = 1000;
 static int gso_max_segs = GSO_MAX_SEGS;
 static int tcp_retr1_max = 255;
@@ -624,6 +625,23 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec_ms_jiffies,
 	},
 	{
+		.procname	= "tcp_retrans_txhash_prob",
+		.data		= &sysctl_tcp_retrans_txhash_prob,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &hundred,
+	},
+	{
+		.procname	= "tcp_retrans_txhash_mode",
+		.data		= &sysctl_tcp_retrans_txhash_mode,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "icmp_msgs_per_sec",
 		.data		= &sysctl_icmp_msgs_per_sec,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a27b9c0..fed5366 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -101,6 +101,9 @@ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
 int sysctl_tcp_early_retrans __read_mostly = 3;
 int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
 
+int sysctl_tcp_retrans_txhash_prob __read_mostly;
+int sysctl_tcp_retrans_txhash_mode __read_mostly;
+
 #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
 #define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
 #define FLAG_DATA_ACKED		0x04 /* This ACK acknowledged new data.		*/
@@ -3674,6 +3677,13 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, &acked,
 				    &sack_state, &now);
 
+	/* Check if we should set txhash (would not cause reordering) */
+	if (tp->txhash_want &&
+	    (tp->packets_out - tp->sacked_out) < tp->reordering) {
+		sk_set_txhash(sk);
+		tp->txhash_want = 0;
+	}
+
 	if (tcp_ack_is_dubious(sk, flag)) {
 		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
 		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 896e9df..58490ac 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2738,9 +2738,30 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 		tp->retrans_out += tcp_skb_pcount(skb);
 
 		/* Save stamp of the first retransmit. */
-		if (!tp->retrans_stamp)
+		if (!tp->retrans_stamp) {
 			tp->retrans_stamp = tcp_skb_timestamp(skb);
 
+			/* Determine if we should reset hash, only done once
+			 * per recovery
+			 */
+			if ((!tp->txhash_rto ||
+			     sysctl_tcp_retrans_txhash_mode > 0) &&
+			    sk->sk_txhash &&
+			    (prandom_u32_max(100) <
+			     sysctl_tcp_retrans_txhash_prob)) {
+				/* If not too much reordering, or RTT is
+				 * small enough that we don't care about
+				 * reordering, then change it now.
+				 * Else, wait until it is safe.
+				 */
+				if ((tp->packets_out - tp->sacked_out) <
+				    tp->reordering)
+					sk_set_txhash(sk);
+				else
+					tp->txhash_want = 1;
+			}
+		}
+
 	} else if (err != -EBUSY) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
 	}
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 3ea1cf8..e66baad 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -186,6 +186,8 @@ static int tcp_write_timeout(struct sock *sk)
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits) {
+			tp->txhash_rto = 1;
+			tp->txhash_want = 0;
 			dst_negative_advice(sk);
 			if (tp->syn_fastopen || tp->syn_data)
 				tcp_fastopen_cache_set(sk, 0, NULL, true, 0);
@@ -218,6 +220,8 @@ static int tcp_write_timeout(struct sock *sk)
 		} else {
 			sk_rethink_txhash(sk);
 		}
+		tp->txhash_rto = 1;
+		tp->txhash_want = 0;
 
 		retry_until = net->ipv4.sysctl_tcp_retries2;
 		if (sock_flag(sk, SOCK_DEAD)) {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
  2016-10-11  0:18 [PATCH net-next] tcp: Change txhash on some non-RTO retransmits Lawrence Brakmo
@ 2016-10-11 19:49 ` Yuchung Cheng
  2016-10-11 21:08   ` Lawrence Brakmo
  0 siblings, 1 reply; 8+ messages in thread
From: Yuchung Cheng @ 2016-10-11 19:49 UTC (permalink / raw)
  To: Lawrence Brakmo; +Cc: netdev, Kernel Team, Eric Dumazet, Neal Cardwell

On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>
> The purpose of this patch is to help balance flows across paths. A new
> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
> A probability is used in order to control how many flows are moved
> during a congestion event and prevent the congested path from becoming
> under utilized (which could occur if too many flows leave the current
> path). Txhash changes may be delayed in order to decrease the likelihood
> that it will trigger retransmists due to too much reordering.
>
> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
> txhash changes. The idea is to decrease the likelihood of going back
> to a broken path. That is, we don't want flow balancing to trigger
> changes to broken paths. The drawback is that flow balancing does
> not work as well. If the sysctl is greater than 1, then we always
> do flow balancing, even after RTOs.
>
> Tested with packedrill tests (for correctness) and performance
> experiments with 2 and 3 paths. Performance experiments looked at
> aggregate goodput and fairness. For each run, we looked at the ratio of
> the goodputs for the fastest and slowest flows. These were averaged for
> all the runs. A fairness of 1 means all flows had the same goodput, a
> fairness of 2 means the fastest flow was twice as fast as the slowest
> flow.
>
> The setup for the performance experiments was 4 or 5 serves in a rack,
> 10G links. I tested various probabilities, but 20 seemed to have the
> best tradeoff for my setup (small RTTs).
>
>                       --- node1 -----
>     sender --- switch --- node2 ----- switch ---- receiver
>                       --- node3 -----
>
> Scenario 1: One sender sends to one receiver through 2 routes (node1 or
> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
> flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
> 1.9G with flow balancing due to 2 flows ending up in one link and either
> not moving and taking some time to move. Fairness was 1 in all cases.
> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
> flow balancing increased fairness.
>
> Scenario 2: One sender to one receiver, through 3 routes (node1,...
> node2). With 6 or 16 flows the goodput was the same for all, but
> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
> prob=20,mode=1 improved average and worst case fairness.
I am wondering if we can build better API with routing layer to
implement this type of feature, instead of creeping the tx_rehashing
logic scatter in TCP. For example, we call dst_negative_advice on TCP
write timeouts.

On the patch itself, it seems aggressive to (attempt to) rehash every
post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
identify post-RTO retransmission directly.

is this an implementation of the Flow Bender ?
http://dl.acm.org/citation.cfm?id=2674985

>
> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
> the packets. With 7 flows, goodput was the same 1.1G, but fairness was
> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
> balancing, which does more re-routes, is less fair.
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> ---
>  Documentation/networking/ip-sysctl.txt | 15 +++++++++++++++
>  include/linux/tcp.h                    |  4 +++-
>  include/net/tcp.h                      |  2 ++
>  net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++++
>  net/ipv4/tcp_input.c                   | 10 ++++++++++
>  net/ipv4/tcp_output.c                  | 23 ++++++++++++++++++++++-
>  net/ipv4/tcp_timer.c                   |  4 ++++
>  7 files changed, 74 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 3db8c67..87a984c 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
>         if paths are using per packet load balancing (like bonding rr mode)
>         Default: 300
>
> +tcp_retrans_txhash_mode - INTEGER
> +       If zero, disable txhash recalculation due to non-RTO retransmissions
> +       after an RTO. The idea is that broken paths will trigger an RTO and
> +       we don't want going back to that path due to standard retransmissons
> +       (flow balancing). The drawback is that balancing is less robust.
> +       If greater than zero, can always (probabilistically) recalculate
> +       txhash after non-RTO retransmissions.
> +
> +tcp_retrans_txhash_prob - INTEGER
> +       Probability [0 to 100] that we will recalculate txhash when a
> +       packet is resent not due to RTO (for RTO txhash is always recalculated).
> +       The recalculation of the txhash may be delayed to decrease the
> +       likelihood that reordering will trigger retransmissons.
> +       The purpose is to help balance the flows among the possible paths.
> +
>  tcp_retrans_collapse - BOOLEAN
>         Bug-to-bug compatibility with some broken printers.
>         On retransmit try to send bigger packets to work around bugs in
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index a17ae7b..e0e3b7d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -214,7 +214,9 @@ struct tcp_sock {
>         } rack;
>         u16     advmss;         /* Advertised MSS                       */
>         u8      rate_app_limited:1,  /* rate_{delivered,interval_us} limited? */
> -               unused:7;
> +               txhash_rto:1,   /* If set, don't do flow balancing      */
> +               txhash_want:1,  /* We want to change txhash when safe   */
> +               unused:5;
>         u8      nonagle     : 4,/* Disable Nagle algorithm?             */
>                 thin_lto    : 1,/* Use linear timeouts for thin streams */
>                 thin_dupack : 1,/* Fast retransmit on first dupack      */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index f83b7f2..3abd304 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -271,6 +271,8 @@ extern int sysctl_tcp_autocorking;
>  extern int sysctl_tcp_invalid_ratelimit;
>  extern int sysctl_tcp_pacing_ss_ratio;
>  extern int sysctl_tcp_pacing_ca_ratio;
> +extern int sysctl_tcp_retrans_txhash_prob;
> +extern int sysctl_tcp_retrans_txhash_mode;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 1cb67de..00d6f26 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -28,6 +28,7 @@
>  static int zero;
>  static int one = 1;
>  static int four = 4;
> +static int hundred = 100;
>  static int thousand = 1000;
>  static int gso_max_segs = GSO_MAX_SEGS;
>  static int tcp_retr1_max = 255;
> @@ -624,6 +625,23 @@ static struct ctl_table ipv4_table[] = {
>                 .proc_handler   = proc_dointvec_ms_jiffies,
>         },
>         {
> +               .procname       = "tcp_retrans_txhash_prob",
> +               .data           = &sysctl_tcp_retrans_txhash_prob,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +               .extra2         = &hundred,
> +       },
> +       {
> +               .procname       = "tcp_retrans_txhash_mode",
> +               .data           = &sysctl_tcp_retrans_txhash_mode,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +       },
> +       {
>                 .procname       = "icmp_msgs_per_sec",
>                 .data           = &sysctl_icmp_msgs_per_sec,
>                 .maxlen         = sizeof(int),
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a27b9c0..fed5366 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -101,6 +101,9 @@ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>  int sysctl_tcp_early_retrans __read_mostly = 3;
>  int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
>
> +int sysctl_tcp_retrans_txhash_prob __read_mostly;
> +int sysctl_tcp_retrans_txhash_mode __read_mostly;
> +
>  #define FLAG_DATA              0x01 /* Incoming frame contained data.          */
>  #define FLAG_WIN_UPDATE                0x02 /* Incoming ACK was a window update.       */
>  #define FLAG_DATA_ACKED                0x04 /* This ACK acknowledged new data.         */
> @@ -3674,6 +3677,13 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>         flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, &acked,
>                                     &sack_state, &now);
>
> +       /* Check if we should set txhash (would not cause reordering) */
> +       if (tp->txhash_want &&
> +           (tp->packets_out - tp->sacked_out) < tp->reordering) {
> +               sk_set_txhash(sk);
> +               tp->txhash_want = 0;
> +       }
> +
>         if (tcp_ack_is_dubious(sk, flag)) {
>                 is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
>                 tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 896e9df..58490ac 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2738,9 +2738,30 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
>                 tp->retrans_out += tcp_skb_pcount(skb);
>
>                 /* Save stamp of the first retransmit. */
> -               if (!tp->retrans_stamp)
> +               if (!tp->retrans_stamp) {
>                         tp->retrans_stamp = tcp_skb_timestamp(skb);
>
> +                       /* Determine if we should reset hash, only done once
> +                        * per recovery
> +                        */
> +                       if ((!tp->txhash_rto ||
> +                            sysctl_tcp_retrans_txhash_mode > 0) &&
> +                           sk->sk_txhash &&
> +                           (prandom_u32_max(100) <
> +                            sysctl_tcp_retrans_txhash_prob)) {
> +                               /* If not too much reordering, or RTT is
> +                                * small enough that we don't care about
> +                                * reordering, then change it now.
> +                                * Else, wait until it is safe.
> +                                */
> +                               if ((tp->packets_out - tp->sacked_out) <
> +                                   tp->reordering)
I don't parse this logic ... suppose reordering is 100 (not uncommon
today due to the last packet being delivered slightly earlier than the
rest), and cwnd==packets_out =~200,we only want to rehash until half
of the packets are sacked, so we are still rehashing even when
reordering is heavy?

also where do we check RTT is small?

> +                                       sk_set_txhash(sk);
> +                               else
> +                                       tp->txhash_want = 1;
> +                       }
> +               }
> +
>         } else if (err != -EBUSY) {
>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
>         }
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 3ea1cf8..e66baad 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -186,6 +186,8 @@ static int tcp_write_timeout(struct sock *sk)
>
>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>                 if (icsk->icsk_retransmits) {
> +                       tp->txhash_rto = 1;
> +                       tp->txhash_want = 0;
>                         dst_negative_advice(sk);
>                         if (tp->syn_fastopen || tp->syn_data)
>                                 tcp_fastopen_cache_set(sk, 0, NULL, true, 0);
> @@ -218,6 +220,8 @@ static int tcp_write_timeout(struct sock *sk)
>                 } else {
>                         sk_rethink_txhash(sk);
>                 }
> +               tp->txhash_rto = 1;
> +               tp->txhash_want = 0;
>
>                 retry_until = net->ipv4.sysctl_tcp_retries2;
>                 if (sock_flag(sk, SOCK_DEAD)) {
> --
> 2.9.3
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
  2016-10-11 19:49 ` Yuchung Cheng
@ 2016-10-11 21:08   ` Lawrence Brakmo
  2016-10-12  1:01     ` Yuchung Cheng
  0 siblings, 1 reply; 8+ messages in thread
From: Lawrence Brakmo @ 2016-10-11 21:08 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Kernel Team, Eric Dumazet, Neal Cardwell

Yuchung, thank you for your comments. Responses inline.

On 10/11/16, 12:49 PM, "Yuchung Cheng" <ycheng@google.com> wrote:

>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>>
>> The purpose of this patch is to help balance flows across paths. A new
>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
>> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
>> A probability is used in order to control how many flows are moved
>> during a congestion event and prevent the congested path from becoming
>> under utilized (which could occur if too many flows leave the current
>> path). Txhash changes may be delayed in order to decrease the likelihood
>> that it will trigger retransmists due to too much reordering.
>>
>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>> txhash changes. The idea is to decrease the likelihood of going back
>> to a broken path. That is, we don't want flow balancing to trigger
>> changes to broken paths. The drawback is that flow balancing does
>> not work as well. If the sysctl is greater than 1, then we always
>> do flow balancing, even after RTOs.
>>
>> Tested with packedrill tests (for correctness) and performance
>> experiments with 2 and 3 paths. Performance experiments looked at
>> aggregate goodput and fairness. For each run, we looked at the ratio of
>> the goodputs for the fastest and slowest flows. These were averaged for
>> all the runs. A fairness of 1 means all flows had the same goodput, a
>> fairness of 2 means the fastest flow was twice as fast as the slowest
>> flow.
>>
>> The setup for the performance experiments was 4 or 5 serves in a rack,
>> 10G links. I tested various probabilities, but 20 seemed to have the
>> best tradeoff for my setup (small RTTs).
>>
>>                       --- node1 -----
>>     sender --- switch --- node2 ----- switch ---- receiver
>>                       --- node3 -----
>>
>> Scenario 1: One sender sends to one receiver through 2 routes (node1 or
>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
>> flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
>> 1.9G with flow balancing due to 2 flows ending up in one link and either
>> not moving and taking some time to move. Fairness was 1 in all cases.
>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>> flow balancing increased fairness.
>>
>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>> node2). With 6 or 16 flows the goodput was the same for all, but
>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
>> prob=20,mode=1 improved average and worst case fairness.
>I am wondering if we can build better API with routing layer to
>implement this type of feature, instead of creeping the tx_rehashing
>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>write timeouts.

Not sure. The route is not necessarily bad, may be temporarily congested
or they may all be congested. If all we want to do is change the txhash
(unlike dst_negative_advice), then calling a tx_rehashing function may
be the appropriate call.
 
>
>On the patch itself, it seems aggressive to (attempt to) rehash every
>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>identify post-RTO retransmission directly.

Thanks, I will add the test.

>
>is this an implementation of the Flow Bender ?
>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf
>m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_
>g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf7BJ
>Ol3-oxAzZDEYUG4cE-s&e=

Part of flow bender, although there are also some similarities to flowlet
switching.

>
>>
>> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
>> the packets. With 7 flows, goodput was the same 1.1G, but fairness was
>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
>> balancing, which does more re-routes, is less fair.
>>
>> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
>> ---
>>  Documentation/networking/ip-sysctl.txt | 15 +++++++++++++++
>>  include/linux/tcp.h                    |  4 +++-
>>  include/net/tcp.h                      |  2 ++
>>  net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++++
>>  net/ipv4/tcp_input.c                   | 10 ++++++++++
>>  net/ipv4/tcp_output.c                  | 23 ++++++++++++++++++++++-
>>  net/ipv4/tcp_timer.c                   |  4 ++++
>>  7 files changed, 74 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.txt
>>b/Documentation/networking/ip-sysctl.txt
>> index 3db8c67..87a984c 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
>>         if paths are using per packet load balancing (like bonding rr
>>mode)
>>         Default: 300
>>
>> +tcp_retrans_txhash_mode - INTEGER
>> +       If zero, disable txhash recalculation due to non-RTO
>>retransmissions
>> +       after an RTO. The idea is that broken paths will trigger an RTO
>>and
>> +       we don't want going back to that path due to standard
>>retransmissons
>> +       (flow balancing). The drawback is that balancing is less robust.
>> +       If greater than zero, can always (probabilistically) recalculate
>> +       txhash after non-RTO retransmissions.
>> +
>> +tcp_retrans_txhash_prob - INTEGER
>> +       Probability [0 to 100] that we will recalculate txhash when a
>> +       packet is resent not due to RTO (for RTO txhash is always
>>recalculated).
>> +       The recalculation of the txhash may be delayed to decrease the
>> +       likelihood that reordering will trigger retransmissons.
>> +       The purpose is to help balance the flows among the possible
>>paths.
>> +
>>  tcp_retrans_collapse - BOOLEAN
>>         Bug-to-bug compatibility with some broken printers.
>>         On retransmit try to send bigger packets to work around bugs in
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index a17ae7b..e0e3b7d 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -214,7 +214,9 @@ struct tcp_sock {
>>         } rack;
>>         u16     advmss;         /* Advertised MSS
>>*/
>>         u8      rate_app_limited:1,  /* rate_{delivered,interval_us}
>>limited? */
>> -               unused:7;
>> +               txhash_rto:1,   /* If set, don't do flow balancing
>>*/
>> +               txhash_want:1,  /* We want to change txhash when safe
>>*/
>> +               unused:5;
>>         u8      nonagle     : 4,/* Disable Nagle algorithm?
>>*/
>>                 thin_lto    : 1,/* Use linear timeouts for thin streams
>>*/
>>                 thin_dupack : 1,/* Fast retransmit on first dupack
>>*/
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index f83b7f2..3abd304 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -271,6 +271,8 @@ extern int sysctl_tcp_autocorking;
>>  extern int sysctl_tcp_invalid_ratelimit;
>>  extern int sysctl_tcp_pacing_ss_ratio;
>>  extern int sysctl_tcp_pacing_ca_ratio;
>> +extern int sysctl_tcp_retrans_txhash_prob;
>> +extern int sysctl_tcp_retrans_txhash_mode;
>>
>>  extern atomic_long_t tcp_memory_allocated;
>>  extern struct percpu_counter tcp_sockets_allocated;
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 1cb67de..00d6f26 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -28,6 +28,7 @@
>>  static int zero;
>>  static int one = 1;
>>  static int four = 4;
>> +static int hundred = 100;
>>  static int thousand = 1000;
>>  static int gso_max_segs = GSO_MAX_SEGS;
>>  static int tcp_retr1_max = 255;
>> @@ -624,6 +625,23 @@ static struct ctl_table ipv4_table[] = {
>>                 .proc_handler   = proc_dointvec_ms_jiffies,
>>         },
>>         {
>> +               .procname       = "tcp_retrans_txhash_prob",
>> +               .data           = &sysctl_tcp_retrans_txhash_prob,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec_minmax,
>> +               .extra1         = &zero,
>> +               .extra2         = &hundred,
>> +       },
>> +       {
>> +               .procname       = "tcp_retrans_txhash_mode",
>> +               .data           = &sysctl_tcp_retrans_txhash_mode,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec_minmax,
>> +               .extra1         = &zero,
>> +       },
>> +       {
>>                 .procname       = "icmp_msgs_per_sec",
>>                 .data           = &sysctl_icmp_msgs_per_sec,
>>                 .maxlen         = sizeof(int),
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index a27b9c0..fed5366 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -101,6 +101,9 @@ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>>  int sysctl_tcp_early_retrans __read_mostly = 3;
>>  int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
>>
>> +int sysctl_tcp_retrans_txhash_prob __read_mostly;
>> +int sysctl_tcp_retrans_txhash_mode __read_mostly;
>> +
>>  #define FLAG_DATA              0x01 /* Incoming frame contained data.
>>        */
>>  #define FLAG_WIN_UPDATE                0x02 /* Incoming ACK was a
>>window update.       */
>>  #define FLAG_DATA_ACKED                0x04 /* This ACK acknowledged
>>new data.         */
>> @@ -3674,6 +3677,13 @@ static int tcp_ack(struct sock *sk, const struct
>>sk_buff *skb, int flag)
>>         flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una,
>>&acked,
>>                                     &sack_state, &now);
>>
>> +       /* Check if we should set txhash (would not cause reordering) */
>> +       if (tp->txhash_want &&
>> +           (tp->packets_out - tp->sacked_out) < tp->reordering) {
>> +               sk_set_txhash(sk);
>> +               tp->txhash_want = 0;
>> +       }
>> +
>>         if (tcp_ack_is_dubious(sk, flag)) {
>>                 is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED |
>>FLAG_NOT_DUP));
>>                 tcp_fastretrans_alert(sk, acked, is_dupack, &flag,
>>&rexmit);
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 896e9df..58490ac 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -2738,9 +2738,30 @@ int tcp_retransmit_skb(struct sock *sk, struct
>>sk_buff *skb, int segs)
>>                 tp->retrans_out += tcp_skb_pcount(skb);
>>
>>                 /* Save stamp of the first retransmit. */
>> -               if (!tp->retrans_stamp)
>> +               if (!tp->retrans_stamp) {
>>                         tp->retrans_stamp = tcp_skb_timestamp(skb);
>>
>> +                       /* Determine if we should reset hash, only done
>>once
>> +                        * per recovery
>> +                        */
>> +                       if ((!tp->txhash_rto ||
>> +                            sysctl_tcp_retrans_txhash_mode > 0) &&
>> +                           sk->sk_txhash &&
>> +                           (prandom_u32_max(100) <
>> +                            sysctl_tcp_retrans_txhash_prob)) {
>> +                               /* If not too much reordering, or RTT is
>> +                                * small enough that we don't care about
>> +                                * reordering, then change it now.
>> +                                * Else, wait until it is safe.
>> +                                */
>> +                               if ((tp->packets_out - tp->sacked_out) <
>> +                                   tp->reordering)
>I don't parse this logic ... suppose reordering is 100 (not uncommon
>today due to the last packet being delivered slightly earlier than the
>rest), and cwnd==packets_out =~200,we only want to rehash until half
>of the packets are sacked, so we are still rehashing even when
>reordering is heavy?

In your scenario, there would be no re-hashing until sacked_out is 101
((packets_out - sacked_out) < 100). This code would mark txhash_want.
Then when an ACK is received and the conditional is true, txhash
would be changed.

Now, the test does not prevent retransmissions due to reordering in all
cases, but hopefully in most. I will also add the test you recommended,
checking for CA_Loss, to prevent too much re-hashing.

>
>also where do we check RTT is small?

The RTT comment is left over from a previous version, I will remove it.

>
>> +                                       sk_set_txhash(sk);
>> +                               else
>> +                                       tp->txhash_want = 1;
>> +                       }
>> +               }
>> +
>>         } else if (err != -EBUSY) {
>>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
>>         }
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 3ea1cf8..e66baad 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -186,6 +186,8 @@ static int tcp_write_timeout(struct sock *sk)
>>
>>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>                 if (icsk->icsk_retransmits) {
>> +                       tp->txhash_rto = 1;
>> +                       tp->txhash_want = 0;
>>                         dst_negative_advice(sk);
>>                         if (tp->syn_fastopen || tp->syn_data)
>>                                 tcp_fastopen_cache_set(sk, 0, NULL,
>>true, 0);
>> @@ -218,6 +220,8 @@ static int tcp_write_timeout(struct sock *sk)
>>                 } else {
>>                         sk_rethink_txhash(sk);
>>                 }
>> +               tp->txhash_rto = 1;
>> +               tp->txhash_want = 0;
>>
>>                 retry_until = net->ipv4.sysctl_tcp_retries2;
>>                 if (sock_flag(sk, SOCK_DEAD)) {
>> --
>> 2.9.3
>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
  2016-10-11 21:08   ` Lawrence Brakmo
@ 2016-10-12  1:01     ` Yuchung Cheng
  2016-10-12  3:56       ` Yuchung Cheng
  0 siblings, 1 reply; 8+ messages in thread
From: Yuchung Cheng @ 2016-10-12  1:01 UTC (permalink / raw)
  To: Lawrence Brakmo; +Cc: netdev, Kernel Team, Eric Dumazet, Neal Cardwell

On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
> Yuchung, thank you for your comments. Responses inline.
>
> On 10/11/16, 12:49 PM, "Yuchung Cheng" <ycheng@google.com> wrote:
>
>>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>>>
>>> The purpose of this patch is to help balance flows across paths. A new
>>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
>>> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
>>> A probability is used in order to control how many flows are moved
>>> during a congestion event and prevent the congested path from becoming
>>> under utilized (which could occur if too many flows leave the current
>>> path). Txhash changes may be delayed in order to decrease the likelihood
>>> that it will trigger retransmists due to too much reordering.
>>>
>>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
>>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>>> txhash changes. The idea is to decrease the likelihood of going back
>>> to a broken path. That is, we don't want flow balancing to trigger
>>> changes to broken paths. The drawback is that flow balancing does
>>> not work as well. If the sysctl is greater than 1, then we always
>>> do flow balancing, even after RTOs.
>>>
>>> Tested with packedrill tests (for correctness) and performance
>>> experiments with 2 and 3 paths. Performance experiments looked at
>>> aggregate goodput and fairness. For each run, we looked at the ratio of
>>> the goodputs for the fastest and slowest flows. These were averaged for
>>> all the runs. A fairness of 1 means all flows had the same goodput, a
>>> fairness of 2 means the fastest flow was twice as fast as the slowest
>>> flow.
>>>
>>> The setup for the performance experiments was 4 or 5 serves in a rack,
>>> 10G links. I tested various probabilities, but 20 seemed to have the
>>> best tradeoff for my setup (small RTTs).
>>>
>>>                       --- node1 -----
>>>     sender --- switch --- node2 ----- switch ---- receiver
>>>                       --- node3 -----
>>>
>>> Scenario 1: One sender sends to one receiver through 2 routes (node1 or
>>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
>>> flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
>>> 1.9G with flow balancing due to 2 flows ending up in one link and either
>>> not moving and taking some time to move. Fairness was 1 in all cases.
>>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
>>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>>> flow balancing increased fairness.
>>>
>>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>>> node2). With 6 or 16 flows the goodput was the same for all, but
>>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
>>> prob=20,mode=1 improved average and worst case fairness.
>>I am wondering if we can build better API with routing layer to
>>implement this type of feature, instead of creeping the tx_rehashing
>>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>>write timeouts.
>
> Not sure. The route is not necessarily bad, may be temporarily congested
> or they may all be congested. If all we want to do is change the txhash
> (unlike dst_negative_advice), then calling a tx_rehashing function may
> be the appropriate call.
>
>>
>>On the patch itself, it seems aggressive to (attempt to) rehash every
>>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>>identify post-RTO retransmission directly.
>
> Thanks, I will add the test.
>
>>
>>is this an implementation of the Flow Bender ?
>>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf
>>m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_
>>g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf7BJ
>>Ol3-oxAzZDEYUG4cE-s&e=
>
> Part of flow bender, although there are also some similarities to flowlet
> switching.
>
>>
>>>
>>> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
>>> the packets. With 7 flows, goodput was the same 1.1G, but fairness was
>>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
>>> balancing, which does more re-routes, is less fair.
>>>
>>> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
>>> ---
>>>  Documentation/networking/ip-sysctl.txt | 15 +++++++++++++++
>>>  include/linux/tcp.h                    |  4 +++-
>>>  include/net/tcp.h                      |  2 ++
>>>  net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++++
>>>  net/ipv4/tcp_input.c                   | 10 ++++++++++
>>>  net/ipv4/tcp_output.c                  | 23 ++++++++++++++++++++++-
>>>  net/ipv4/tcp_timer.c                   |  4 ++++
>>>  7 files changed, 74 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/networking/ip-sysctl.txt
>>>b/Documentation/networking/ip-sysctl.txt
>>> index 3db8c67..87a984c 100644
>>> --- a/Documentation/networking/ip-sysctl.txt
>>> +++ b/Documentation/networking/ip-sysctl.txt
>>> @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
>>>         if paths are using per packet load balancing (like bonding rr
>>>mode)
>>>         Default: 300
>>>
>>> +tcp_retrans_txhash_mode - INTEGER
>>> +       If zero, disable txhash recalculation due to non-RTO
>>>retransmissions
>>> +       after an RTO. The idea is that broken paths will trigger an RTO
>>>and
>>> +       we don't want going back to that path due to standard
>>>retransmissons
>>> +       (flow balancing). The drawback is that balancing is less robust.
>>> +       If greater than zero, can always (probabilistically) recalculate
>>> +       txhash after non-RTO retransmissions.
>>> +
>>> +tcp_retrans_txhash_prob - INTEGER
>>> +       Probability [0 to 100] that we will recalculate txhash when a
>>> +       packet is resent not due to RTO (for RTO txhash is always
>>>recalculated).
>>> +       The recalculation of the txhash may be delayed to decrease the
>>> +       likelihood that reordering will trigger retransmissons.
>>> +       The purpose is to help balance the flows among the possible
>>>paths.
>>> +
>>>  tcp_retrans_collapse - BOOLEAN
>>>         Bug-to-bug compatibility with some broken printers.
>>>         On retransmit try to send bigger packets to work around bugs in
>>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>>> index a17ae7b..e0e3b7d 100644
>>> --- a/include/linux/tcp.h
>>> +++ b/include/linux/tcp.h
>>> @@ -214,7 +214,9 @@ struct tcp_sock {
>>>         } rack;
>>>         u16     advmss;         /* Advertised MSS
>>>*/
>>>         u8      rate_app_limited:1,  /* rate_{delivered,interval_us}
>>>limited? */
>>> -               unused:7;
>>> +               txhash_rto:1,   /* If set, don't do flow balancing
>>>*/
>>> +               txhash_want:1,  /* We want to change txhash when safe
>>>*/
>>> +               unused:5;
>>>         u8      nonagle     : 4,/* Disable Nagle algorithm?
>>>*/
>>>                 thin_lto    : 1,/* Use linear timeouts for thin streams
>>>*/
>>>                 thin_dupack : 1,/* Fast retransmit on first dupack
>>>*/
>>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>>> index f83b7f2..3abd304 100644
>>> --- a/include/net/tcp.h
>>> +++ b/include/net/tcp.h
>>> @@ -271,6 +271,8 @@ extern int sysctl_tcp_autocorking;
>>>  extern int sysctl_tcp_invalid_ratelimit;
>>>  extern int sysctl_tcp_pacing_ss_ratio;
>>>  extern int sysctl_tcp_pacing_ca_ratio;
>>> +extern int sysctl_tcp_retrans_txhash_prob;
>>> +extern int sysctl_tcp_retrans_txhash_mode;
>>>
>>>  extern atomic_long_t tcp_memory_allocated;
>>>  extern struct percpu_counter tcp_sockets_allocated;
>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>> index 1cb67de..00d6f26 100644
>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>> @@ -28,6 +28,7 @@
>>>  static int zero;
>>>  static int one = 1;
>>>  static int four = 4;
>>> +static int hundred = 100;
>>>  static int thousand = 1000;
>>>  static int gso_max_segs = GSO_MAX_SEGS;
>>>  static int tcp_retr1_max = 255;
>>> @@ -624,6 +625,23 @@ static struct ctl_table ipv4_table[] = {
>>>                 .proc_handler   = proc_dointvec_ms_jiffies,
>>>         },
>>>         {
>>> +               .procname       = "tcp_retrans_txhash_prob",
>>> +               .data           = &sysctl_tcp_retrans_txhash_prob,
>>> +               .maxlen         = sizeof(int),
>>> +               .mode           = 0644,
>>> +               .proc_handler   = proc_dointvec_minmax,
>>> +               .extra1         = &zero,
>>> +               .extra2         = &hundred,
>>> +       },
>>> +       {
>>> +               .procname       = "tcp_retrans_txhash_mode",
>>> +               .data           = &sysctl_tcp_retrans_txhash_mode,
>>> +               .maxlen         = sizeof(int),
>>> +               .mode           = 0644,
>>> +               .proc_handler   = proc_dointvec_minmax,
>>> +               .extra1         = &zero,
>>> +       },
>>> +       {
>>>                 .procname       = "icmp_msgs_per_sec",
>>>                 .data           = &sysctl_icmp_msgs_per_sec,
>>>                 .maxlen         = sizeof(int),
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index a27b9c0..fed5366 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -101,6 +101,9 @@ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>>>  int sysctl_tcp_early_retrans __read_mostly = 3;
>>>  int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
>>>
>>> +int sysctl_tcp_retrans_txhash_prob __read_mostly;
>>> +int sysctl_tcp_retrans_txhash_mode __read_mostly;
>>> +
>>>  #define FLAG_DATA              0x01 /* Incoming frame contained data.
>>>        */
>>>  #define FLAG_WIN_UPDATE                0x02 /* Incoming ACK was a
>>>window update.       */
>>>  #define FLAG_DATA_ACKED                0x04 /* This ACK acknowledged
>>>new data.         */
>>> @@ -3674,6 +3677,13 @@ static int tcp_ack(struct sock *sk, const struct
>>>sk_buff *skb, int flag)
>>>         flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una,
>>>&acked,
>>>                                     &sack_state, &now);
>>>
>>> +       /* Check if we should set txhash (would not cause reordering) */
>>> +       if (tp->txhash_want &&
>>> +           (tp->packets_out - tp->sacked_out) < tp->reordering) {
>>> +               sk_set_txhash(sk);
>>> +               tp->txhash_want = 0;
>>> +       }
>>> +
>>>         if (tcp_ack_is_dubious(sk, flag)) {
>>>                 is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED |
>>>FLAG_NOT_DUP));
>>>                 tcp_fastretrans_alert(sk, acked, is_dupack, &flag,
>>>&rexmit);
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index 896e9df..58490ac 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -2738,9 +2738,30 @@ int tcp_retransmit_skb(struct sock *sk, struct
>>>sk_buff *skb, int segs)
>>>                 tp->retrans_out += tcp_skb_pcount(skb);
>>>
>>>                 /* Save stamp of the first retransmit. */
>>> -               if (!tp->retrans_stamp)
>>> +               if (!tp->retrans_stamp) {
>>>                         tp->retrans_stamp = tcp_skb_timestamp(skb);
>>>
>>> +                       /* Determine if we should reset hash, only done
>>>once
>>> +                        * per recovery
>>> +                        */
>>> +                       if ((!tp->txhash_rto ||
>>> +                            sysctl_tcp_retrans_txhash_mode > 0) &&
>>> +                           sk->sk_txhash &&
>>> +                           (prandom_u32_max(100) <
>>> +                            sysctl_tcp_retrans_txhash_prob)) {
>>> +                               /* If not too much reordering, or RTT is
>>> +                                * small enough that we don't care about
>>> +                                * reordering, then change it now.
>>> +                                * Else, wait until it is safe.
>>> +                                */
>>> +                               if ((tp->packets_out - tp->sacked_out) <
>>> +                                   tp->reordering)
>>I don't parse this logic ... suppose reordering is 100 (not uncommon
>>today due to the last packet being delivered slightly earlier than the
>>rest), and cwnd==packets_out =~200,we only want to rehash until half
>>of the packets are sacked, so we are still rehashing even when
>>reordering is heavy?
>
> In your scenario, there would be no re-hashing until sacked_out is 101
> ((packets_out - sacked_out) < 100). This code would mark txhash_want.
> Then when an ACK is received and the conditional is true, txhash
> would be changed.
>
> Now, the test does not prevent retransmissions due to reordering in all
> cases, but hopefully in most. I will also add the test you recommended,
> checking for CA_Loss, to prevent too much re-hashing.
If the whole point is to rehash at most once between recovery events,
why do we need this complicated change at per packet level in
tcp_retransmit_skb()? there are functions that start  and end recovery
(tcp_enter_recovery, tcp_end_cwnd_reduction). our retransmission logic
is already very complicated.

I still don't understand the incentive of starting the rehashing
half-way retransmitting depending on the sacking and reordering
status, for temporarily congestion as you've mentioned.

is this feature unique for intra-DC connections?

>
>>
>>also where do we check RTT is small?
>
> The RTT comment is left over from a previous version, I will remove it.
>
>>
>>> +                                       sk_set_txhash(sk);
>>> +                               else
>>> +                                       tp->txhash_want = 1;
>>> +                       }
>>> +               }
>>> +
>>>         } else if (err != -EBUSY) {
>>>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
>>>         }
>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>> index 3ea1cf8..e66baad 100644
>>> --- a/net/ipv4/tcp_timer.c
>>> +++ b/net/ipv4/tcp_timer.c
>>> @@ -186,6 +186,8 @@ static int tcp_write_timeout(struct sock *sk)
>>>
>>>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>>                 if (icsk->icsk_retransmits) {
>>> +                       tp->txhash_rto = 1;
>>> +                       tp->txhash_want = 0;
>>>                         dst_negative_advice(sk);
>>>                         if (tp->syn_fastopen || tp->syn_data)
>>>                                 tcp_fastopen_cache_set(sk, 0, NULL,
>>>true, 0);
>>> @@ -218,6 +220,8 @@ static int tcp_write_timeout(struct sock *sk)
>>>                 } else {
>>>                         sk_rethink_txhash(sk);
>>>                 }
>>> +               tp->txhash_rto = 1;
>>> +               tp->txhash_want = 0;
>>>
>>>                 retry_until = net->ipv4.sysctl_tcp_retries2;
>>>                 if (sock_flag(sk, SOCK_DEAD)) {
>>> --
>>> 2.9.3
>>>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
  2016-10-12  1:01     ` Yuchung Cheng
@ 2016-10-12  3:56       ` Yuchung Cheng
  2016-10-12 15:22         ` Eric Dumazet
  2016-10-18  3:35         ` Lawrence Brakmo
  0 siblings, 2 replies; 8+ messages in thread
From: Yuchung Cheng @ 2016-10-12  3:56 UTC (permalink / raw)
  To: Lawrence Brakmo; +Cc: netdev, Kernel Team, Eric Dumazet, Neal Cardwell

On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng <ycheng@google.com> wrote:
> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>> Yuchung, thank you for your comments. Responses inline.
>>
>> On 10/11/16, 12:49 PM, "Yuchung Cheng" <ycheng@google.com> wrote:
>>
>>>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>>>>
>>>> The purpose of this patch is to help balance flows across paths. A new
>>>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
>>>> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
>>>> A probability is used in order to control how many flows are moved
>>>> during a congestion event and prevent the congested path from becoming
>>>> under utilized (which could occur if too many flows leave the current
>>>> path). Txhash changes may be delayed in order to decrease the likelihood
>>>> that it will trigger retransmists due to too much reordering.
>>>>
>>>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
>>>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>>>> txhash changes. The idea is to decrease the likelihood of going back
>>>> to a broken path. That is, we don't want flow balancing to trigger
>>>> changes to broken paths. The drawback is that flow balancing does
>>>> not work as well. If the sysctl is greater than 1, then we always
>>>> do flow balancing, even after RTOs.
>>>>
>>>> Tested with packedrill tests (for correctness) and performance
>>>> experiments with 2 and 3 paths. Performance experiments looked at
>>>> aggregate goodput and fairness. For each run, we looked at the ratio of
>>>> the goodputs for the fastest and slowest flows. These were averaged for
>>>> all the runs. A fairness of 1 means all flows had the same goodput, a
>>>> fairness of 2 means the fastest flow was twice as fast as the slowest
>>>> flow.
>>>>
>>>> The setup for the performance experiments was 4 or 5 serves in a rack,
>>>> 10G links. I tested various probabilities, but 20 seemed to have the
>>>> best tradeoff for my setup (small RTTs).
>>>>
>>>>                       --- node1 -----
>>>>     sender --- switch --- node2 ----- switch ---- receiver
>>>>                       --- node3 -----
>>>>
>>>> Scenario 1: One sender sends to one receiver through 2 routes (node1 or
>>>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
>>>> flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
>>>> 1.9G with flow balancing due to 2 flows ending up in one link and either
>>>> not moving and taking some time to move. Fairness was 1 in all cases.
>>>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
>>>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>>>> flow balancing increased fairness.
>>>>
>>>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>>>> node2). With 6 or 16 flows the goodput was the same for all, but
>>>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>>>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
>>>> prob=20,mode=1 improved average and worst case fairness.
>>>I am wondering if we can build better API with routing layer to
>>>implement this type of feature, instead of creeping the tx_rehashing
>>>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>>>write timeouts.
>>
>> Not sure. The route is not necessarily bad, may be temporarily congested
>> or they may all be congested. If all we want to do is change the txhash
>> (unlike dst_negative_advice), then calling a tx_rehashing function may
>> be the appropriate call.
>>
>>>
>>>On the patch itself, it seems aggressive to (attempt to) rehash every
>>>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>>>identify post-RTO retransmission directly.
>>
>> Thanks, I will add the test.
>>
>>>
>>>is this an implementation of the Flow Bender ?
>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf
>>>m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_
>>>g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf7BJ
>>>Ol3-oxAzZDEYUG4cE-s&e=
>>
>> Part of flow bender, although there are also some similarities to flowlet
>> switching.
>>
>>>
>>>>
>>>> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
>>>> the packets. With 7 flows, goodput was the same 1.1G, but fairness was
>>>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
>>>> balancing, which does more re-routes, is less fair.
>>>>
>>>> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
>>>> ---
>>>>  Documentation/networking/ip-sysctl.txt | 15 +++++++++++++++
>>>>  include/linux/tcp.h                    |  4 +++-
>>>>  include/net/tcp.h                      |  2 ++
>>>>  net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++++
>>>>  net/ipv4/tcp_input.c                   | 10 ++++++++++
>>>>  net/ipv4/tcp_output.c                  | 23 ++++++++++++++++++++++-
>>>>  net/ipv4/tcp_timer.c                   |  4 ++++
>>>>  7 files changed, 74 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/networking/ip-sysctl.txt
>>>>b/Documentation/networking/ip-sysctl.txt
>>>> index 3db8c67..87a984c 100644
>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>> @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
>>>>         if paths are using per packet load balancing (like bonding rr
>>>>mode)
>>>>         Default: 300
>>>>
>>>> +tcp_retrans_txhash_mode - INTEGER
>>>> +       If zero, disable txhash recalculation due to non-RTO
>>>>retransmissions
>>>> +       after an RTO. The idea is that broken paths will trigger an RTO
>>>>and
>>>> +       we don't want going back to that path due to standard
>>>>retransmissons
>>>> +       (flow balancing). The drawback is that balancing is less robust.
>>>> +       If greater than zero, can always (probabilistically) recalculate
>>>> +       txhash after non-RTO retransmissions.
>>>> +
>>>> +tcp_retrans_txhash_prob - INTEGER
>>>> +       Probability [0 to 100] that we will recalculate txhash when a
>>>> +       packet is resent not due to RTO (for RTO txhash is always
>>>>recalculated).
>>>> +       The recalculation of the txhash may be delayed to decrease the
>>>> +       likelihood that reordering will trigger retransmissons.
>>>> +       The purpose is to help balance the flows among the possible
>>>>paths.
>>>> +
>>>>  tcp_retrans_collapse - BOOLEAN
>>>>         Bug-to-bug compatibility with some broken printers.
>>>>         On retransmit try to send bigger packets to work around bugs in
>>>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>>>> index a17ae7b..e0e3b7d 100644
>>>> --- a/include/linux/tcp.h
>>>> +++ b/include/linux/tcp.h
>>>> @@ -214,7 +214,9 @@ struct tcp_sock {
>>>>         } rack;
>>>>         u16     advmss;         /* Advertised MSS
>>>>*/
>>>>         u8      rate_app_limited:1,  /* rate_{delivered,interval_us}
>>>>limited? */
>>>> -               unused:7;
>>>> +               txhash_rto:1,   /* If set, don't do flow balancing
>>>>*/
>>>> +               txhash_want:1,  /* We want to change txhash when safe
>>>>*/
>>>> +               unused:5;
>>>>         u8      nonagle     : 4,/* Disable Nagle algorithm?
>>>>*/
>>>>                 thin_lto    : 1,/* Use linear timeouts for thin streams
>>>>*/
>>>>                 thin_dupack : 1,/* Fast retransmit on first dupack
>>>>*/
>>>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>>>> index f83b7f2..3abd304 100644
>>>> --- a/include/net/tcp.h
>>>> +++ b/include/net/tcp.h
>>>> @@ -271,6 +271,8 @@ extern int sysctl_tcp_autocorking;
>>>>  extern int sysctl_tcp_invalid_ratelimit;
>>>>  extern int sysctl_tcp_pacing_ss_ratio;
>>>>  extern int sysctl_tcp_pacing_ca_ratio;
>>>> +extern int sysctl_tcp_retrans_txhash_prob;
>>>> +extern int sysctl_tcp_retrans_txhash_mode;
>>>>
>>>>  extern atomic_long_t tcp_memory_allocated;
>>>>  extern struct percpu_counter tcp_sockets_allocated;
>>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>>> index 1cb67de..00d6f26 100644
>>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>>> @@ -28,6 +28,7 @@
>>>>  static int zero;
>>>>  static int one = 1;
>>>>  static int four = 4;
>>>> +static int hundred = 100;
>>>>  static int thousand = 1000;
>>>>  static int gso_max_segs = GSO_MAX_SEGS;
>>>>  static int tcp_retr1_max = 255;
>>>> @@ -624,6 +625,23 @@ static struct ctl_table ipv4_table[] = {
>>>>                 .proc_handler   = proc_dointvec_ms_jiffies,
>>>>         },
>>>>         {
>>>> +               .procname       = "tcp_retrans_txhash_prob",
>>>> +               .data           = &sysctl_tcp_retrans_txhash_prob,
>>>> +               .maxlen         = sizeof(int),
>>>> +               .mode           = 0644,
>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>> +               .extra1         = &zero,
>>>> +               .extra2         = &hundred,
>>>> +       },
>>>> +       {
>>>> +               .procname       = "tcp_retrans_txhash_mode",
>>>> +               .data           = &sysctl_tcp_retrans_txhash_mode,
>>>> +               .maxlen         = sizeof(int),
>>>> +               .mode           = 0644,
>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>> +               .extra1         = &zero,
>>>> +       },
>>>> +       {
>>>>                 .procname       = "icmp_msgs_per_sec",
>>>>                 .data           = &sysctl_icmp_msgs_per_sec,
>>>>                 .maxlen         = sizeof(int),
>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>> index a27b9c0..fed5366 100644
>>>> --- a/net/ipv4/tcp_input.c
>>>> +++ b/net/ipv4/tcp_input.c
>>>> @@ -101,6 +101,9 @@ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>>>>  int sysctl_tcp_early_retrans __read_mostly = 3;
>>>>  int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
>>>>
>>>> +int sysctl_tcp_retrans_txhash_prob __read_mostly;
>>>> +int sysctl_tcp_retrans_txhash_mode __read_mostly;
>>>> +
>>>>  #define FLAG_DATA              0x01 /* Incoming frame contained data.
>>>>        */
>>>>  #define FLAG_WIN_UPDATE                0x02 /* Incoming ACK was a
>>>>window update.       */
>>>>  #define FLAG_DATA_ACKED                0x04 /* This ACK acknowledged
>>>>new data.         */
>>>> @@ -3674,6 +3677,13 @@ static int tcp_ack(struct sock *sk, const struct
>>>>sk_buff *skb, int flag)
>>>>         flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una,
>>>>&acked,
>>>>                                     &sack_state, &now);
>>>>
>>>> +       /* Check if we should set txhash (would not cause reordering) */
>>>> +       if (tp->txhash_want &&
>>>> +           (tp->packets_out - tp->sacked_out) < tp->reordering) {
>>>> +               sk_set_txhash(sk);
>>>> +               tp->txhash_want = 0;
>>>> +       }
>>>> +
>>>>         if (tcp_ack_is_dubious(sk, flag)) {
>>>>                 is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED |
>>>>FLAG_NOT_DUP));
>>>>                 tcp_fastretrans_alert(sk, acked, is_dupack, &flag,
>>>>&rexmit);
>>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>>> index 896e9df..58490ac 100644
>>>> --- a/net/ipv4/tcp_output.c
>>>> +++ b/net/ipv4/tcp_output.c
>>>> @@ -2738,9 +2738,30 @@ int tcp_retransmit_skb(struct sock *sk, struct
>>>>sk_buff *skb, int segs)
>>>>                 tp->retrans_out += tcp_skb_pcount(skb);
>>>>
>>>>                 /* Save stamp of the first retransmit. */
>>>> -               if (!tp->retrans_stamp)
>>>> +               if (!tp->retrans_stamp) {
>>>>                         tp->retrans_stamp = tcp_skb_timestamp(skb);
>>>>
>>>> +                       /* Determine if we should reset hash, only done
>>>>once
>>>> +                        * per recovery
>>>> +                        */
>>>> +                       if ((!tp->txhash_rto ||
>>>> +                            sysctl_tcp_retrans_txhash_mode > 0) &&
>>>> +                           sk->sk_txhash &&
>>>> +                           (prandom_u32_max(100) <
>>>> +                            sysctl_tcp_retrans_txhash_prob)) {
>>>> +                               /* If not too much reordering, or RTT is
>>>> +                                * small enough that we don't care about
>>>> +                                * reordering, then change it now.
>>>> +                                * Else, wait until it is safe.
>>>> +                                */
>>>> +                               if ((tp->packets_out - tp->sacked_out) <
>>>> +                                   tp->reordering)
>>>I don't parse this logic ... suppose reordering is 100 (not uncommon
>>>today due to the last packet being delivered slightly earlier than the
>>>rest), and cwnd==packets_out =~200,we only want to rehash until half
>>>of the packets are sacked, so we are still rehashing even when
>>>reordering is heavy?
>>
>> In your scenario, there would be no re-hashing until sacked_out is 101
>> ((packets_out - sacked_out) < 100). This code would mark txhash_want.
>> Then when an ACK is received and the conditional is true, txhash
>> would be changed.
>>
>> Now, the test does not prevent retransmissions due to reordering in all
>> cases, but hopefully in most. I will also add the test you recommended,
>> checking for CA_Loss, to prevent too much re-hashing.
> If the whole point is to rehash at most once between recovery events,
> why do we need this complicated change at per packet level in
> tcp_retransmit_skb()? there are functions that start  and end recovery
> (tcp_enter_recovery, tcp_end_cwnd_reduction). our retransmission logic
> is already very complicated.
>
> I still don't understand the incentive of starting the rehashing
> half-way retransmitting depending on the sacking and reordering
> status, for temporarily congestion as you've mentioned.
>
> is this feature unique for intra-DC connections?
I thought more about this patch on my way home and have more
questions: why do we exclude RTO retransmission specifically? also
when we rehash, we'll introduce reordering either in recovery or after
recovery, as some TCP CC like bbr would continue sending regardlessly,
so starting in tcp_ack() with tp->txhash_want does not really prevent
causing more reordering.

so if we want to rehash upon a recovery event (fast or timeout), then
we can make a helper function ie tcp_retry_alt_path, and just call it
in tcp_enter_loss and tcp_recovery (which are called once per
recovery). Since it'll cause reordering anyways, I doubt we need to
check tp->reordering at all. If packets already traverse different
paths, rehashing can make it better or worse. if packets weren't
traversing different paths, the rehashing itself can introduce
reordering. so in the end there is much benefit to make decision based
on current inflight and reordering IMO...


>
>>
>>>
>>>also where do we check RTT is small?
>>
>> The RTT comment is left over from a previous version, I will remove it.
>>
>>>
>>>> +                                       sk_set_txhash(sk);
>>>> +                               else
>>>> +                                       tp->txhash_want = 1;
>>>> +                       }
>>>> +               }
>>>> +
>>>>         } else if (err != -EBUSY) {
>>>>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
>>>>         }
>>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>>> index 3ea1cf8..e66baad 100644
>>>> --- a/net/ipv4/tcp_timer.c
>>>> +++ b/net/ipv4/tcp_timer.c
>>>> @@ -186,6 +186,8 @@ static int tcp_write_timeout(struct sock *sk)
>>>>
>>>>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>>>                 if (icsk->icsk_retransmits) {
>>>> +                       tp->txhash_rto = 1;
>>>> +                       tp->txhash_want = 0;
>>>>                         dst_negative_advice(sk);
>>>>                         if (tp->syn_fastopen || tp->syn_data)
>>>>                                 tcp_fastopen_cache_set(sk, 0, NULL,
>>>>true, 0);
>>>> @@ -218,6 +220,8 @@ static int tcp_write_timeout(struct sock *sk)
>>>>                 } else {
>>>>                         sk_rethink_txhash(sk);
>>>>                 }
>>>> +               tp->txhash_rto = 1;
>>>> +               tp->txhash_want = 0;
>>>>
>>>>                 retry_until = net->ipv4.sysctl_tcp_retries2;
>>>>                 if (sock_flag(sk, SOCK_DEAD)) {
>>>> --
>>>> 2.9.3
>>>>
>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
  2016-10-12  3:56       ` Yuchung Cheng
@ 2016-10-12 15:22         ` Eric Dumazet
  2016-10-18  3:35         ` Lawrence Brakmo
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2016-10-12 15:22 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Lawrence Brakmo, netdev, Kernel Team, Neal Cardwell

On Tue, 2016-10-11 at 20:56 -0700, Yuchung Cheng wrote:

> I thought more about this patch on my way home and have more
> questions: why do we exclude RTO retransmission specifically? also
> when we rehash, we'll introduce reordering either in recovery or after
> recovery, as some TCP CC like bbr would continue sending regardlessly,
> so starting in tcp_ack() with tp->txhash_want does not really prevent
> causing more reordering.

Note that changing txhash during a non rto retransmit is going to break
pacing on a bonding setup, since the change in txhash will likely select
a different slave, where MQ+FQ are the qdisc in place.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
  2016-10-12  3:56       ` Yuchung Cheng
  2016-10-12 15:22         ` Eric Dumazet
@ 2016-10-18  3:35         ` Lawrence Brakmo
  2016-10-18  4:28           ` Tom Herbert
  1 sibling, 1 reply; 8+ messages in thread
From: Lawrence Brakmo @ 2016-10-18  3:35 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Kernel Team, Eric Dumazet, Neal Cardwell

Yuchung and Eric, thank you for your comments.

It looks like I need to think more about this patch. I was trying
to reduce the likelihood of reordering (which seems even more
important based on Eric¹s comment on pacing), but it seems like
the only way to prevent reordering is to only re-hash after an RTO
or when there are no packets in flight (which may not occur).


On 10/11/16, 8:56 PM, "Yuchung Cheng" <ycheng@google.com> wrote:

>On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>>> Yuchung, thank you for your comments. Responses inline.
>>>
>>> On 10/11/16, 12:49 PM, "Yuchung Cheng" <ycheng@google.com> wrote:
>>>
>>>>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>>>>>
>>>>> The purpose of this patch is to help balance flows across paths. A
>>>>>new
>>>>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100)
>>>>>that
>>>>> the txhash (IPv6 flowlabel) will be changed after a non-RTO
>>>>>retransmit.
>>>>> A probability is used in order to control how many flows are moved
>>>>> during a congestion event and prevent the congested path from
>>>>>becoming
>>>>> under utilized (which could occur if too many flows leave the current
>>>>> path). Txhash changes may be delayed in order to decrease the
>>>>>likelihood
>>>>> that it will trigger retransmists due to too much reordering.
>>>>>
>>>>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior
>>>>>after
>>>>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>>>>> txhash changes. The idea is to decrease the likelihood of going back
>>>>> to a broken path. That is, we don't want flow balancing to trigger
>>>>> changes to broken paths. The drawback is that flow balancing does
>>>>> not work as well. If the sysctl is greater than 1, then we always
>>>>> do flow balancing, even after RTOs.
>>>>>
>>>>> Tested with packedrill tests (for correctness) and performance
>>>>> experiments with 2 and 3 paths. Performance experiments looked at
>>>>> aggregate goodput and fairness. For each run, we looked at the ratio
>>>>>of
>>>>> the goodputs for the fastest and slowest flows. These were averaged
>>>>>for
>>>>> all the runs. A fairness of 1 means all flows had the same goodput, a
>>>>> fairness of 2 means the fastest flow was twice as fast as the slowest
>>>>> flow.
>>>>>
>>>>> The setup for the performance experiments was 4 or 5 serves in a
>>>>>rack,
>>>>> 10G links. I tested various probabilities, but 20 seemed to have the
>>>>> best tradeoff for my setup (small RTTs).
>>>>>
>>>>>                       --- node1 -----
>>>>>     sender --- switch --- node2 ----- switch ---- receiver
>>>>>                       --- node3 -----
>>>>>
>>>>> Scenario 1: One sender sends to one receiver through 2 routes (node1
>>>>>or
>>>>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With
>>>>>only 2
>>>>> flows, without flow balancing (prob=0) the average goodput is 1.6G
>>>>>vs.
>>>>> 1.9G with flow balancing due to 2 flows ending up in one link and
>>>>>either
>>>>> not moving and taking some time to move. Fairness was 1 in all cases.
>>>>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or
>>>>>1.2
>>>>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>>>>> flow balancing increased fairness.
>>>>>
>>>>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>>>>> node2). With 6 or 16 flows the goodput was the same for all, but
>>>>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>>>>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That
>>>>>is,
>>>>> prob=20,mode=1 improved average and worst case fairness.
>>>>I am wondering if we can build better API with routing layer to
>>>>implement this type of feature, instead of creeping the tx_rehashing
>>>>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>>>>write timeouts.
>>>
>>> Not sure. The route is not necessarily bad, may be temporarily
>>>congested
>>> or they may all be congested. If all we want to do is change the txhash
>>> (unlike dst_negative_advice), then calling a tx_rehashing function may
>>> be the appropriate call.
>>>
>>>>
>>>>On the patch itself, it seems aggressive to (attempt to) rehash every
>>>>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>>>>identify post-RTO retransmission directly.
>>>
>>> Thanks, I will add the test.
>>>
>>>>
>>>>is this an implementation of the Flow Bender ?
>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation
>>>>.cf
>>>>m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx
>>>>1u_
>>>>g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf
>>>>7BJ
>>>>Ol3-oxAzZDEYUG4cE-s&e=
>>>
>>> Part of flow bender, although there are also some similarities to
>>>flowlet
>>> switching.
>>>
>>>>
>>>>>
>>>>> Scenario 3: One sender to one receiver, 2 routes, one route drops
>>>>>50% of
>>>>> the packets. With 7 flows, goodput was the same 1.1G, but fairness
>>>>>was
>>>>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
>>>>> balancing, which does more re-routes, is less fair.
>>>>>
>>>>> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
>>>>> ---
>>>>>  Documentation/networking/ip-sysctl.txt | 15 +++++++++++++++
>>>>>  include/linux/tcp.h                    |  4 +++-
>>>>>  include/net/tcp.h                      |  2 ++
>>>>>  net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++++
>>>>>  net/ipv4/tcp_input.c                   | 10 ++++++++++
>>>>>  net/ipv4/tcp_output.c                  | 23 ++++++++++++++++++++++-
>>>>>  net/ipv4/tcp_timer.c                   |  4 ++++
>>>>>  7 files changed, 74 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/networking/ip-sysctl.txt
>>>>>b/Documentation/networking/ip-sysctl.txt
>>>>> index 3db8c67..87a984c 100644
>>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>>> @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
>>>>>         if paths are using per packet load balancing (like bonding rr
>>>>>mode)
>>>>>         Default: 300
>>>>>
>>>>> +tcp_retrans_txhash_mode - INTEGER
>>>>> +       If zero, disable txhash recalculation due to non-RTO
>>>>>retransmissions
>>>>> +       after an RTO. The idea is that broken paths will trigger an
>>>>>RTO
>>>>>and
>>>>> +       we don't want going back to that path due to standard
>>>>>retransmissons
>>>>> +       (flow balancing). The drawback is that balancing is less
>>>>>robust.
>>>>> +       If greater than zero, can always (probabilistically)
>>>>>recalculate
>>>>> +       txhash after non-RTO retransmissions.
>>>>> +
>>>>> +tcp_retrans_txhash_prob - INTEGER
>>>>> +       Probability [0 to 100] that we will recalculate txhash when a
>>>>> +       packet is resent not due to RTO (for RTO txhash is always
>>>>>recalculated).
>>>>> +       The recalculation of the txhash may be delayed to decrease
>>>>>the
>>>>> +       likelihood that reordering will trigger retransmissons.
>>>>> +       The purpose is to help balance the flows among the possible
>>>>>paths.
>>>>> +
>>>>>  tcp_retrans_collapse - BOOLEAN
>>>>>         Bug-to-bug compatibility with some broken printers.
>>>>>         On retransmit try to send bigger packets to work around bugs
>>>>>in
>>>>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>>>>> index a17ae7b..e0e3b7d 100644
>>>>> --- a/include/linux/tcp.h
>>>>> +++ b/include/linux/tcp.h
>>>>> @@ -214,7 +214,9 @@ struct tcp_sock {
>>>>>         } rack;
>>>>>         u16     advmss;         /* Advertised MSS
>>>>>*/
>>>>>         u8      rate_app_limited:1,  /* rate_{delivered,interval_us}
>>>>>limited? */
>>>>> -               unused:7;
>>>>> +               txhash_rto:1,   /* If set, don't do flow balancing
>>>>>*/
>>>>> +               txhash_want:1,  /* We want to change txhash when safe
>>>>>*/
>>>>> +               unused:5;
>>>>>         u8      nonagle     : 4,/* Disable Nagle algorithm?
>>>>>*/
>>>>>                 thin_lto    : 1,/* Use linear timeouts for thin
>>>>>streams
>>>>>*/
>>>>>                 thin_dupack : 1,/* Fast retransmit on first dupack
>>>>>*/
>>>>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>>>>> index f83b7f2..3abd304 100644
>>>>> --- a/include/net/tcp.h
>>>>> +++ b/include/net/tcp.h
>>>>> @@ -271,6 +271,8 @@ extern int sysctl_tcp_autocorking;
>>>>>  extern int sysctl_tcp_invalid_ratelimit;
>>>>>  extern int sysctl_tcp_pacing_ss_ratio;
>>>>>  extern int sysctl_tcp_pacing_ca_ratio;
>>>>> +extern int sysctl_tcp_retrans_txhash_prob;
>>>>> +extern int sysctl_tcp_retrans_txhash_mode;
>>>>>
>>>>>  extern atomic_long_t tcp_memory_allocated;
>>>>>  extern struct percpu_counter tcp_sockets_allocated;
>>>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>>>> index 1cb67de..00d6f26 100644
>>>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>>>> @@ -28,6 +28,7 @@
>>>>>  static int zero;
>>>>>  static int one = 1;
>>>>>  static int four = 4;
>>>>> +static int hundred = 100;
>>>>>  static int thousand = 1000;
>>>>>  static int gso_max_segs = GSO_MAX_SEGS;
>>>>>  static int tcp_retr1_max = 255;
>>>>> @@ -624,6 +625,23 @@ static struct ctl_table ipv4_table[] = {
>>>>>                 .proc_handler   = proc_dointvec_ms_jiffies,
>>>>>         },
>>>>>         {
>>>>> +               .procname       = "tcp_retrans_txhash_prob",
>>>>> +               .data           = &sysctl_tcp_retrans_txhash_prob,
>>>>> +               .maxlen         = sizeof(int),
>>>>> +               .mode           = 0644,
>>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>>> +               .extra1         = &zero,
>>>>> +               .extra2         = &hundred,
>>>>> +       },
>>>>> +       {
>>>>> +               .procname       = "tcp_retrans_txhash_mode",
>>>>> +               .data           = &sysctl_tcp_retrans_txhash_mode,
>>>>> +               .maxlen         = sizeof(int),
>>>>> +               .mode           = 0644,
>>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>>> +               .extra1         = &zero,
>>>>> +       },
>>>>> +       {
>>>>>                 .procname       = "icmp_msgs_per_sec",
>>>>>                 .data           = &sysctl_icmp_msgs_per_sec,
>>>>>                 .maxlen         = sizeof(int),
>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>> index a27b9c0..fed5366 100644
>>>>> --- a/net/ipv4/tcp_input.c
>>>>> +++ b/net/ipv4/tcp_input.c
>>>>> @@ -101,6 +101,9 @@ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>>>>>  int sysctl_tcp_early_retrans __read_mostly = 3;
>>>>>  int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
>>>>>
>>>>> +int sysctl_tcp_retrans_txhash_prob __read_mostly;
>>>>> +int sysctl_tcp_retrans_txhash_mode __read_mostly;
>>>>> +
>>>>>  #define FLAG_DATA              0x01 /* Incoming frame contained
>>>>>data.
>>>>>        */
>>>>>  #define FLAG_WIN_UPDATE                0x02 /* Incoming ACK was a
>>>>>window update.       */
>>>>>  #define FLAG_DATA_ACKED                0x04 /* This ACK acknowledged
>>>>>new data.         */
>>>>> @@ -3674,6 +3677,13 @@ static int tcp_ack(struct sock *sk, const
>>>>>struct
>>>>>sk_buff *skb, int flag)
>>>>>         flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una,
>>>>>&acked,
>>>>>                                     &sack_state, &now);
>>>>>
>>>>> +       /* Check if we should set txhash (would not cause
>>>>>reordering) */
>>>>> +       if (tp->txhash_want &&
>>>>> +           (tp->packets_out - tp->sacked_out) < tp->reordering) {
>>>>> +               sk_set_txhash(sk);
>>>>> +               tp->txhash_want = 0;
>>>>> +       }
>>>>> +
>>>>>         if (tcp_ack_is_dubious(sk, flag)) {
>>>>>                 is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED |
>>>>>FLAG_NOT_DUP));
>>>>>                 tcp_fastretrans_alert(sk, acked, is_dupack, &flag,
>>>>>&rexmit);
>>>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>>>> index 896e9df..58490ac 100644
>>>>> --- a/net/ipv4/tcp_output.c
>>>>> +++ b/net/ipv4/tcp_output.c
>>>>> @@ -2738,9 +2738,30 @@ int tcp_retransmit_skb(struct sock *sk, struct
>>>>>sk_buff *skb, int segs)
>>>>>                 tp->retrans_out += tcp_skb_pcount(skb);
>>>>>
>>>>>                 /* Save stamp of the first retransmit. */
>>>>> -               if (!tp->retrans_stamp)
>>>>> +               if (!tp->retrans_stamp) {
>>>>>                         tp->retrans_stamp = tcp_skb_timestamp(skb);
>>>>>
>>>>> +                       /* Determine if we should reset hash, only
>>>>>done
>>>>>once
>>>>> +                        * per recovery
>>>>> +                        */
>>>>> +                       if ((!tp->txhash_rto ||
>>>>> +                            sysctl_tcp_retrans_txhash_mode > 0) &&
>>>>> +                           sk->sk_txhash &&
>>>>> +                           (prandom_u32_max(100) <
>>>>> +                            sysctl_tcp_retrans_txhash_prob)) {
>>>>> +                               /* If not too much reordering, or
>>>>>RTT is
>>>>> +                                * small enough that we don't care
>>>>>about
>>>>> +                                * reordering, then change it now.
>>>>> +                                * Else, wait until it is safe.
>>>>> +                                */
>>>>> +                               if ((tp->packets_out -
>>>>>tp->sacked_out) <
>>>>> +                                   tp->reordering)
>>>>I don't parse this logic ... suppose reordering is 100 (not uncommon
>>>>today due to the last packet being delivered slightly earlier than the
>>>>rest), and cwnd==packets_out =~200,we only want to rehash until half
>>>>of the packets are sacked, so we are still rehashing even when
>>>>reordering is heavy?
>>>
>>> In your scenario, there would be no re-hashing until sacked_out is 101
>>> ((packets_out - sacked_out) < 100). This code would mark txhash_want.
>>> Then when an ACK is received and the conditional is true, txhash
>>> would be changed.
>>>
>>> Now, the test does not prevent retransmissions due to reordering in all
>>> cases, but hopefully in most. I will also add the test you recommended,
>>> checking for CA_Loss, to prevent too much re-hashing.
>> If the whole point is to rehash at most once between recovery events,
>> why do we need this complicated change at per packet level in
>> tcp_retransmit_skb()? there are functions that start  and end recovery
>> (tcp_enter_recovery, tcp_end_cwnd_reduction). our retransmission logic
>> is already very complicated.
>>
>> I still don't understand the incentive of starting the rehashing
>> half-way retransmitting depending on the sacking and reordering
>> status, for temporarily congestion as you've mentioned.
>>
>> is this feature unique for intra-DC connections?
>I thought more about this patch on my way home and have more
>questions: why do we exclude RTO retransmission specifically? also
>when we rehash, we'll introduce reordering either in recovery or after
>recovery, as some TCP CC like bbr would continue sending regardlessly,
>so starting in tcp_ack() with tp->txhash_want does not really prevent
>causing more reordering.
>
>so if we want to rehash upon a recovery event (fast or timeout), then
>we can make a helper function ie tcp_retry_alt_path, and just call it
>in tcp_enter_loss and tcp_recovery (which are called once per
>recovery). Since it'll cause reordering anyways, I doubt we need to
>check tp->reordering at all. If packets already traverse different
>paths, rehashing can make it better or worse. if packets weren't
>traversing different paths, the rehashing itself can introduce
>reordering. so in the end there is much benefit to make decision based
>on current inflight and reordering IMO...
>
>
>>
>>>
>>>>
>>>>also where do we check RTT is small?
>>>
>>> The RTT comment is left over from a previous version, I will remove it.
>>>
>>>>
>>>>> +                                       sk_set_txhash(sk);
>>>>> +                               else
>>>>> +                                       tp->txhash_want = 1;
>>>>> +                       }
>>>>> +               }
>>>>> +
>>>>>         } else if (err != -EBUSY) {
>>>>>                 NET_INC_STATS(sock_net(sk),
>>>>>LINUX_MIB_TCPRETRANSFAIL);
>>>>>         }
>>>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>>>> index 3ea1cf8..e66baad 100644
>>>>> --- a/net/ipv4/tcp_timer.c
>>>>> +++ b/net/ipv4/tcp_timer.c
>>>>> @@ -186,6 +186,8 @@ static int tcp_write_timeout(struct sock *sk)
>>>>>
>>>>>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>>>>                 if (icsk->icsk_retransmits) {
>>>>> +                       tp->txhash_rto = 1;
>>>>> +                       tp->txhash_want = 0;
>>>>>                         dst_negative_advice(sk);
>>>>>                         if (tp->syn_fastopen || tp->syn_data)
>>>>>                                 tcp_fastopen_cache_set(sk, 0, NULL,
>>>>>true, 0);
>>>>> @@ -218,6 +220,8 @@ static int tcp_write_timeout(struct sock *sk)
>>>>>                 } else {
>>>>>                         sk_rethink_txhash(sk);
>>>>>                 }
>>>>> +               tp->txhash_rto = 1;
>>>>> +               tp->txhash_want = 0;
>>>>>
>>>>>                 retry_until = net->ipv4.sysctl_tcp_retries2;
>>>>>                 if (sock_flag(sk, SOCK_DEAD)) {
>>>>> --
>>>>> 2.9.3
>>>>>
>>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
  2016-10-18  3:35         ` Lawrence Brakmo
@ 2016-10-18  4:28           ` Tom Herbert
  0 siblings, 0 replies; 8+ messages in thread
From: Tom Herbert @ 2016-10-18  4:28 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: Yuchung Cheng, netdev, Kernel Team, Eric Dumazet, Neal Cardwell

On Mon, Oct 17, 2016 at 8:35 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
> Yuchung and Eric, thank you for your comments.
>
> It looks like I need to think more about this patch. I was trying
> to reduce the likelihood of reordering (which seems even more
> important based on Eric¹s comment on pacing), but it seems like
> the only way to prevent reordering is to only re-hash after an RTO
> or when there are no packets in flight (which may not occur).
>
Sounds like that should be the same condition as when we set ooo_okay?

>
> On 10/11/16, 8:56 PM, "Yuchung Cheng" <ycheng@google.com> wrote:
>
>>On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng <ycheng@google.com> wrote:
>>> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>>>> Yuchung, thank you for your comments. Responses inline.
>>>>
>>>> On 10/11/16, 12:49 PM, "Yuchung Cheng" <ycheng@google.com> wrote:
>>>>
>>>>>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>>>>>>
>>>>>> The purpose of this patch is to help balance flows across paths. A
>>>>>>new
>>>>>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100)
>>>>>>that
>>>>>> the txhash (IPv6 flowlabel) will be changed after a non-RTO
>>>>>>retransmit.
>>>>>> A probability is used in order to control how many flows are moved
>>>>>> during a congestion event and prevent the congested path from
>>>>>>becoming
>>>>>> under utilized (which could occur if too many flows leave the current
>>>>>> path). Txhash changes may be delayed in order to decrease the
>>>>>>likelihood
>>>>>> that it will trigger retransmists due to too much reordering.
>>>>>>
>>>>>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior
>>>>>>after
>>>>>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>>>>>> txhash changes. The idea is to decrease the likelihood of going back
>>>>>> to a broken path. That is, we don't want flow balancing to trigger
>>>>>> changes to broken paths. The drawback is that flow balancing does
>>>>>> not work as well. If the sysctl is greater than 1, then we always
>>>>>> do flow balancing, even after RTOs.
>>>>>>
>>>>>> Tested with packedrill tests (for correctness) and performance
>>>>>> experiments with 2 and 3 paths. Performance experiments looked at
>>>>>> aggregate goodput and fairness. For each run, we looked at the ratio
>>>>>>of
>>>>>> the goodputs for the fastest and slowest flows. These were averaged
>>>>>>for
>>>>>> all the runs. A fairness of 1 means all flows had the same goodput, a
>>>>>> fairness of 2 means the fastest flow was twice as fast as the slowest
>>>>>> flow.
>>>>>>
>>>>>> The setup for the performance experiments was 4 or 5 serves in a
>>>>>>rack,
>>>>>> 10G links. I tested various probabilities, but 20 seemed to have the
>>>>>> best tradeoff for my setup (small RTTs).
>>>>>>
>>>>>>                       --- node1 -----
>>>>>>     sender --- switch --- node2 ----- switch ---- receiver
>>>>>>                       --- node3 -----
>>>>>>
>>>>>> Scenario 1: One sender sends to one receiver through 2 routes (node1
>>>>>>or
>>>>>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With
>>>>>>only 2
>>>>>> flows, without flow balancing (prob=0) the average goodput is 1.6G
>>>>>>vs.
>>>>>> 1.9G with flow balancing due to 2 flows ending up in one link and
>>>>>>either
>>>>>> not moving and taking some time to move. Fairness was 1 in all cases.
>>>>>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or
>>>>>>1.2
>>>>>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>>>>>> flow balancing increased fairness.
>>>>>>
>>>>>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>>>>>> node2). With 6 or 16 flows the goodput was the same for all, but
>>>>>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>>>>>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That
>>>>>>is,
>>>>>> prob=20,mode=1 improved average and worst case fairness.
>>>>>I am wondering if we can build better API with routing layer to
>>>>>implement this type of feature, instead of creeping the tx_rehashing
>>>>>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>>>>>write timeouts.
>>>>
>>>> Not sure. The route is not necessarily bad, may be temporarily
>>>>congested
>>>> or they may all be congested. If all we want to do is change the txhash
>>>> (unlike dst_negative_advice), then calling a tx_rehashing function may
>>>> be the appropriate call.
>>>>
>>>>>
>>>>>On the patch itself, it seems aggressive to (attempt to) rehash every
>>>>>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>>>>>identify post-RTO retransmission directly.
>>>>
>>>> Thanks, I will add the test.
>>>>
>>>>>
>>>>>is this an implementation of the Flow Bender ?
>>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation
>>>>>.cf
>>>>>m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx
>>>>>1u_
>>>>>g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf
>>>>>7BJ
>>>>>Ol3-oxAzZDEYUG4cE-s&e=
>>>>
>>>> Part of flow bender, although there are also some similarities to
>>>>flowlet
>>>> switching.
>>>>
>>>>>
>>>>>>
>>>>>> Scenario 3: One sender to one receiver, 2 routes, one route drops
>>>>>>50% of
>>>>>> the packets. With 7 flows, goodput was the same 1.1G, but fairness
>>>>>>was
>>>>>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
>>>>>> balancing, which does more re-routes, is less fair.
>>>>>>
>>>>>> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
>>>>>> ---
>>>>>>  Documentation/networking/ip-sysctl.txt | 15 +++++++++++++++
>>>>>>  include/linux/tcp.h                    |  4 +++-
>>>>>>  include/net/tcp.h                      |  2 ++
>>>>>>  net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++++
>>>>>>  net/ipv4/tcp_input.c                   | 10 ++++++++++
>>>>>>  net/ipv4/tcp_output.c                  | 23 ++++++++++++++++++++++-
>>>>>>  net/ipv4/tcp_timer.c                   |  4 ++++
>>>>>>  7 files changed, 74 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/networking/ip-sysctl.txt
>>>>>>b/Documentation/networking/ip-sysctl.txt
>>>>>> index 3db8c67..87a984c 100644
>>>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>>>> @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
>>>>>>         if paths are using per packet load balancing (like bonding rr
>>>>>>mode)
>>>>>>         Default: 300
>>>>>>
>>>>>> +tcp_retrans_txhash_mode - INTEGER
>>>>>> +       If zero, disable txhash recalculation due to non-RTO
>>>>>>retransmissions
>>>>>> +       after an RTO. The idea is that broken paths will trigger an
>>>>>>RTO
>>>>>>and
>>>>>> +       we don't want going back to that path due to standard
>>>>>>retransmissons
>>>>>> +       (flow balancing). The drawback is that balancing is less
>>>>>>robust.
>>>>>> +       If greater than zero, can always (probabilistically)
>>>>>>recalculate
>>>>>> +       txhash after non-RTO retransmissions.
>>>>>> +
>>>>>> +tcp_retrans_txhash_prob - INTEGER
>>>>>> +       Probability [0 to 100] that we will recalculate txhash when a
>>>>>> +       packet is resent not due to RTO (for RTO txhash is always
>>>>>>recalculated).
>>>>>> +       The recalculation of the txhash may be delayed to decrease
>>>>>>the
>>>>>> +       likelihood that reordering will trigger retransmissons.
>>>>>> +       The purpose is to help balance the flows among the possible
>>>>>>paths.
>>>>>> +
>>>>>>  tcp_retrans_collapse - BOOLEAN
>>>>>>         Bug-to-bug compatibility with some broken printers.
>>>>>>         On retransmit try to send bigger packets to work around bugs
>>>>>>in
>>>>>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>>>>>> index a17ae7b..e0e3b7d 100644
>>>>>> --- a/include/linux/tcp.h
>>>>>> +++ b/include/linux/tcp.h
>>>>>> @@ -214,7 +214,9 @@ struct tcp_sock {
>>>>>>         } rack;
>>>>>>         u16     advmss;         /* Advertised MSS
>>>>>>*/
>>>>>>         u8      rate_app_limited:1,  /* rate_{delivered,interval_us}
>>>>>>limited? */
>>>>>> -               unused:7;
>>>>>> +               txhash_rto:1,   /* If set, don't do flow balancing
>>>>>>*/
>>>>>> +               txhash_want:1,  /* We want to change txhash when safe
>>>>>>*/
>>>>>> +               unused:5;
>>>>>>         u8      nonagle     : 4,/* Disable Nagle algorithm?
>>>>>>*/
>>>>>>                 thin_lto    : 1,/* Use linear timeouts for thin
>>>>>>streams
>>>>>>*/
>>>>>>                 thin_dupack : 1,/* Fast retransmit on first dupack
>>>>>>*/
>>>>>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>>>>>> index f83b7f2..3abd304 100644
>>>>>> --- a/include/net/tcp.h
>>>>>> +++ b/include/net/tcp.h
>>>>>> @@ -271,6 +271,8 @@ extern int sysctl_tcp_autocorking;
>>>>>>  extern int sysctl_tcp_invalid_ratelimit;
>>>>>>  extern int sysctl_tcp_pacing_ss_ratio;
>>>>>>  extern int sysctl_tcp_pacing_ca_ratio;
>>>>>> +extern int sysctl_tcp_retrans_txhash_prob;
>>>>>> +extern int sysctl_tcp_retrans_txhash_mode;
>>>>>>
>>>>>>  extern atomic_long_t tcp_memory_allocated;
>>>>>>  extern struct percpu_counter tcp_sockets_allocated;
>>>>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>>>>> index 1cb67de..00d6f26 100644
>>>>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>>>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>>>>> @@ -28,6 +28,7 @@
>>>>>>  static int zero;
>>>>>>  static int one = 1;
>>>>>>  static int four = 4;
>>>>>> +static int hundred = 100;
>>>>>>  static int thousand = 1000;
>>>>>>  static int gso_max_segs = GSO_MAX_SEGS;
>>>>>>  static int tcp_retr1_max = 255;
>>>>>> @@ -624,6 +625,23 @@ static struct ctl_table ipv4_table[] = {
>>>>>>                 .proc_handler   = proc_dointvec_ms_jiffies,
>>>>>>         },
>>>>>>         {
>>>>>> +               .procname       = "tcp_retrans_txhash_prob",
>>>>>> +               .data           = &sysctl_tcp_retrans_txhash_prob,
>>>>>> +               .maxlen         = sizeof(int),
>>>>>> +               .mode           = 0644,
>>>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>>>> +               .extra1         = &zero,
>>>>>> +               .extra2         = &hundred,
>>>>>> +       },
>>>>>> +       {
>>>>>> +               .procname       = "tcp_retrans_txhash_mode",
>>>>>> +               .data           = &sysctl_tcp_retrans_txhash_mode,
>>>>>> +               .maxlen         = sizeof(int),
>>>>>> +               .mode           = 0644,
>>>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>>>> +               .extra1         = &zero,
>>>>>> +       },
>>>>>> +       {
>>>>>>                 .procname       = "icmp_msgs_per_sec",
>>>>>>                 .data           = &sysctl_icmp_msgs_per_sec,
>>>>>>                 .maxlen         = sizeof(int),
>>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>>> index a27b9c0..fed5366 100644
>>>>>> --- a/net/ipv4/tcp_input.c
>>>>>> +++ b/net/ipv4/tcp_input.c
>>>>>> @@ -101,6 +101,9 @@ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>>>>>>  int sysctl_tcp_early_retrans __read_mostly = 3;
>>>>>>  int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
>>>>>>
>>>>>> +int sysctl_tcp_retrans_txhash_prob __read_mostly;
>>>>>> +int sysctl_tcp_retrans_txhash_mode __read_mostly;
>>>>>> +
>>>>>>  #define FLAG_DATA              0x01 /* Incoming frame contained
>>>>>>data.
>>>>>>        */
>>>>>>  #define FLAG_WIN_UPDATE                0x02 /* Incoming ACK was a
>>>>>>window update.       */
>>>>>>  #define FLAG_DATA_ACKED                0x04 /* This ACK acknowledged
>>>>>>new data.         */
>>>>>> @@ -3674,6 +3677,13 @@ static int tcp_ack(struct sock *sk, const
>>>>>>struct
>>>>>>sk_buff *skb, int flag)
>>>>>>         flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una,
>>>>>>&acked,
>>>>>>                                     &sack_state, &now);
>>>>>>
>>>>>> +       /* Check if we should set txhash (would not cause
>>>>>>reordering) */
>>>>>> +       if (tp->txhash_want &&
>>>>>> +           (tp->packets_out - tp->sacked_out) < tp->reordering) {
>>>>>> +               sk_set_txhash(sk);
>>>>>> +               tp->txhash_want = 0;
>>>>>> +       }
>>>>>> +
>>>>>>         if (tcp_ack_is_dubious(sk, flag)) {
>>>>>>                 is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED |
>>>>>>FLAG_NOT_DUP));
>>>>>>                 tcp_fastretrans_alert(sk, acked, is_dupack, &flag,
>>>>>>&rexmit);
>>>>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>>>>> index 896e9df..58490ac 100644
>>>>>> --- a/net/ipv4/tcp_output.c
>>>>>> +++ b/net/ipv4/tcp_output.c
>>>>>> @@ -2738,9 +2738,30 @@ int tcp_retransmit_skb(struct sock *sk, struct
>>>>>>sk_buff *skb, int segs)
>>>>>>                 tp->retrans_out += tcp_skb_pcount(skb);
>>>>>>
>>>>>>                 /* Save stamp of the first retransmit. */
>>>>>> -               if (!tp->retrans_stamp)
>>>>>> +               if (!tp->retrans_stamp) {
>>>>>>                         tp->retrans_stamp = tcp_skb_timestamp(skb);
>>>>>>
>>>>>> +                       /* Determine if we should reset hash, only
>>>>>>done
>>>>>>once
>>>>>> +                        * per recovery
>>>>>> +                        */
>>>>>> +                       if ((!tp->txhash_rto ||
>>>>>> +                            sysctl_tcp_retrans_txhash_mode > 0) &&
>>>>>> +                           sk->sk_txhash &&
>>>>>> +                           (prandom_u32_max(100) <
>>>>>> +                            sysctl_tcp_retrans_txhash_prob)) {
>>>>>> +                               /* If not too much reordering, or
>>>>>>RTT is
>>>>>> +                                * small enough that we don't care
>>>>>>about
>>>>>> +                                * reordering, then change it now.
>>>>>> +                                * Else, wait until it is safe.
>>>>>> +                                */
>>>>>> +                               if ((tp->packets_out -
>>>>>>tp->sacked_out) <
>>>>>> +                                   tp->reordering)
>>>>>I don't parse this logic ... suppose reordering is 100 (not uncommon
>>>>>today due to the last packet being delivered slightly earlier than the
>>>>>rest), and cwnd==packets_out =~200,we only want to rehash until half
>>>>>of the packets are sacked, so we are still rehashing even when
>>>>>reordering is heavy?
>>>>
>>>> In your scenario, there would be no re-hashing until sacked_out is 101
>>>> ((packets_out - sacked_out) < 100). This code would mark txhash_want.
>>>> Then when an ACK is received and the conditional is true, txhash
>>>> would be changed.
>>>>
>>>> Now, the test does not prevent retransmissions due to reordering in all
>>>> cases, but hopefully in most. I will also add the test you recommended,
>>>> checking for CA_Loss, to prevent too much re-hashing.
>>> If the whole point is to rehash at most once between recovery events,
>>> why do we need this complicated change at per packet level in
>>> tcp_retransmit_skb()? there are functions that start  and end recovery
>>> (tcp_enter_recovery, tcp_end_cwnd_reduction). our retransmission logic
>>> is already very complicated.
>>>
>>> I still don't understand the incentive of starting the rehashing
>>> half-way retransmitting depending on the sacking and reordering
>>> status, for temporarily congestion as you've mentioned.
>>>
>>> is this feature unique for intra-DC connections?
>>I thought more about this patch on my way home and have more
>>questions: why do we exclude RTO retransmission specifically? also
>>when we rehash, we'll introduce reordering either in recovery or after
>>recovery, as some TCP CC like bbr would continue sending regardlessly,
>>so starting in tcp_ack() with tp->txhash_want does not really prevent
>>causing more reordering.
>>
>>so if we want to rehash upon a recovery event (fast or timeout), then
>>we can make a helper function ie tcp_retry_alt_path, and just call it
>>in tcp_enter_loss and tcp_recovery (which are called once per
>>recovery). Since it'll cause reordering anyways, I doubt we need to
>>check tp->reordering at all. If packets already traverse different
>>paths, rehashing can make it better or worse. if packets weren't
>>traversing different paths, the rehashing itself can introduce
>>reordering. so in the end there is much benefit to make decision based
>>on current inflight and reordering IMO...
>>
>>
>>>
>>>>
>>>>>
>>>>>also where do we check RTT is small?
>>>>
>>>> The RTT comment is left over from a previous version, I will remove it.
>>>>
>>>>>
>>>>>> +                                       sk_set_txhash(sk);
>>>>>> +                               else
>>>>>> +                                       tp->txhash_want = 1;
>>>>>> +                       }
>>>>>> +               }
>>>>>> +
>>>>>>         } else if (err != -EBUSY) {
>>>>>>                 NET_INC_STATS(sock_net(sk),
>>>>>>LINUX_MIB_TCPRETRANSFAIL);
>>>>>>         }
>>>>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>>>>> index 3ea1cf8..e66baad 100644
>>>>>> --- a/net/ipv4/tcp_timer.c
>>>>>> +++ b/net/ipv4/tcp_timer.c
>>>>>> @@ -186,6 +186,8 @@ static int tcp_write_timeout(struct sock *sk)
>>>>>>
>>>>>>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>>>>>                 if (icsk->icsk_retransmits) {
>>>>>> +                       tp->txhash_rto = 1;
>>>>>> +                       tp->txhash_want = 0;
>>>>>>                         dst_negative_advice(sk);
>>>>>>                         if (tp->syn_fastopen || tp->syn_data)
>>>>>>                                 tcp_fastopen_cache_set(sk, 0, NULL,
>>>>>>true, 0);
>>>>>> @@ -218,6 +220,8 @@ static int tcp_write_timeout(struct sock *sk)
>>>>>>                 } else {
>>>>>>                         sk_rethink_txhash(sk);
>>>>>>                 }
>>>>>> +               tp->txhash_rto = 1;
>>>>>> +               tp->txhash_want = 0;
>>>>>>
>>>>>>                 retry_until = net->ipv4.sysctl_tcp_retries2;
>>>>>>                 if (sock_flag(sk, SOCK_DEAD)) {
>>>>>> --
>>>>>> 2.9.3
>>>>>>
>>>>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-10-18  4:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-11  0:18 [PATCH net-next] tcp: Change txhash on some non-RTO retransmits Lawrence Brakmo
2016-10-11 19:49 ` Yuchung Cheng
2016-10-11 21:08   ` Lawrence Brakmo
2016-10-12  1:01     ` Yuchung Cheng
2016-10-12  3:56       ` Yuchung Cheng
2016-10-12 15:22         ` Eric Dumazet
2016-10-18  3:35         ` Lawrence Brakmo
2016-10-18  4:28           ` Tom Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.