All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
@ 2017-07-31 22:19 Shaohua Li
  2017-07-31 22:19 ` [PATCH V4 net 1/2] net: remove unnecessary rotation Shaohua Li
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Shaohua Li @ 2017-07-31 22:19 UTC (permalink / raw)
  To: netdev, davem; +Cc: Shaohua Li

From: Shaohua Li <shli@fb.com>

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
(0xd827f) are different. This causes our router doesn't correctly close tcp
connection. The patches try to fix the issue.

Thanks,
Shaohua

Shaohua Li (2):
  net: remove unnecessary rotation
  net: fix tcp reset packet flowlabel for ipv6

 include/net/ipv6.h       | 15 ++++-----------
 net/ipv4/tcp_minisocks.c |  8 +++++++-
 net/ipv6/ip6_gre.c       |  2 +-
 net/ipv6/ip6_output.c    |  4 ++--
 net/ipv6/ip6_tunnel.c    |  2 +-
 net/ipv6/tcp_ipv6.c      | 18 +++++++++++++++++-
 6 files changed, 32 insertions(+), 17 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V4 net 1/2] net: remove unnecessary rotation
  2017-07-31 22:19 [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet Shaohua Li
@ 2017-07-31 22:19 ` Shaohua Li
  2017-07-31 22:19 ` [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6 Shaohua Li
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Shaohua Li @ 2017-07-31 22:19 UTC (permalink / raw)
  To: netdev, davem; +Cc: Shaohua Li

From: Shaohua Li <shli@fb.com>

According to David Miller, the rotation doesn't really help avoid
security problem, so delte it.

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/net/ipv6.h | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 6eac5cf..7548367 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -790,12 +790,6 @@ static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
 
 	hash = skb_get_hash_flowi6(skb, fl6);
 
-	/* Since this is being sent on the wire obfuscate hash a bit
-	 * to minimize possbility that any useful information to an
-	 * attacker is leaked. Only lower 20 bits are relevant.
-	 */
-	rol32(hash, 16);
-
 	flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
 
 	if (net->ipv6.sysctl.flowlabel_state_ranges)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6
  2017-07-31 22:19 [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet Shaohua Li
  2017-07-31 22:19 ` [PATCH V4 net 1/2] net: remove unnecessary rotation Shaohua Li
@ 2017-07-31 22:19 ` Shaohua Li
  2017-07-31 22:35   ` Cong Wang
  2017-08-09 14:59 ` [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet Shaohua Li
  2017-08-09 16:40 ` Tom Herbert
  3 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-07-31 22:19 UTC (permalink / raw)
  To: netdev, davem; +Cc: Shaohua Li, Eric Dumazet, Florent Fourcot, Cong Wang

From: Shaohua Li <shli@fb.com>

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The tcp reset packet has a different flowlabel, which causes our router
doesn't correctly close tcp connection. The reason is the normal packet
gets the skb->hash from sk->sk_txhash, which is generated randomly.
ip6_make_flowlabel then uses the hash to create a flowlabel. The reset
packet doesn't get assigned a hash, so the flowlabel is calculated with
flowi6.

Since user can't change timewait sock flowlabel, we create a flowlabel
for timewait socket with the random generated hash (sk->sk_txhash), then
use it in reset packet. In this way, the reset packet will have the same
flowlabel as normal packets.

This also fixes the flowlabel issue for reset packet if user configures
flowlabel, which is ignored previously.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florent Fourcot <flo@fourcot.fr>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/net/ipv6.h       |  9 ++++-----
 net/ipv4/tcp_minisocks.c |  8 +++++++-
 net/ipv6/ip6_gre.c       |  2 +-
 net/ipv6/ip6_output.c    |  4 ++--
 net/ipv6/ip6_tunnel.c    |  2 +-
 net/ipv6/tcp_ipv6.c      | 18 +++++++++++++++++-
 6 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 7548367..f8713fd 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -773,10 +773,8 @@ static inline void iph_to_flow_copy_v6addrs(struct flow_keys *flow,
 
 static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
 					__be32 flowlabel, bool autolabel,
-					struct flowi6 *fl6)
+					struct flowi6 *fl6, u32 hash)
 {
-	u32 hash;
-
 	/* @flowlabel may include more than a flow label, eg, the traffic class.
 	 * Here we want only the flow label value.
 	 */
@@ -788,7 +786,8 @@ static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
 	     net->ipv6.sysctl.auto_flowlabels != IP6_AUTO_FLOW_LABEL_FORCED))
 		return flowlabel;
 
-	hash = skb_get_hash_flowi6(skb, fl6);
+	if (skb)
+		hash = skb_get_hash_flowi6(skb, fl6);
 
 	flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
 
@@ -814,7 +813,7 @@ static inline int ip6_default_np_autolabel(struct net *net)
 static inline void ip6_set_txhash(struct sock *sk) { }
 static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
 					__be32 flowlabel, bool autolabel,
-					struct flowi6 *fl6)
+					struct flowi6 *fl6, u32 hash)
 {
 	return flowlabel;
 }
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 0ff83c1..ac41b25 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -276,11 +276,17 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 #if IS_ENABLED(CONFIG_IPV6)
 		if (tw->tw_family == PF_INET6) {
 			struct ipv6_pinfo *np = inet6_sk(sk);
+			__be32 flowlabel;
 
 			tw->tw_v6_daddr = sk->sk_v6_daddr;
 			tw->tw_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
 			tw->tw_tclass = np->tclass;
-			tw->tw_flowlabel = be32_to_cpu(np->flow_label & IPV6_FLOWLABEL_MASK);
+			flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK;
+			if (flowlabel == 0)
+				flowlabel = ip6_make_flowlabel(
+					sock_net(sk), NULL, 0, np->autoflowlabel,
+					NULL, sk->sk_txhash);
+			tw->tw_flowlabel = be32_to_cpu(flowlabel);
 			tw->tw_ipv6only = sk->sk_ipv6only;
 		}
 #endif
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 67ff2aa..36db44a 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -948,7 +948,7 @@ static int ip6gre_header(struct sk_buff *skb, struct net_device *dev,
 	ip6_flow_hdr(ipv6h, 0,
 		     ip6_make_flowlabel(dev_net(dev), skb,
 					t->fl.u.ip6.flowlabel, true,
-					&t->fl.u.ip6));
+					&t->fl.u.ip6, 0));
 	ipv6h->hop_limit = t->parms.hop_limit;
 	ipv6h->nexthdr = NEXTHDR_GRE;
 	ipv6h->saddr = t->parms.laddr;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 162efba..cefd249 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -230,7 +230,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 		hlimit = ip6_dst_hoplimit(dst);
 
 	ip6_flow_hdr(hdr, tclass, ip6_make_flowlabel(net, skb, fl6->flowlabel,
-						     np->autoflowlabel, fl6));
+						     np->autoflowlabel, fl6, 0));
 
 	hdr->payload_len = htons(seg_len);
 	hdr->nexthdr = proto;
@@ -1702,7 +1702,7 @@ struct sk_buff *__ip6_make_skb(struct sock *sk,
 
 	ip6_flow_hdr(hdr, v6_cork->tclass,
 		     ip6_make_flowlabel(net, skb, fl6->flowlabel,
-					np->autoflowlabel, fl6));
+					np->autoflowlabel, fl6, 0));
 	hdr->hop_limit = v6_cork->hop_limit;
 	hdr->nexthdr = proto;
 	hdr->saddr = fl6->saddr;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 3a0ba2a..9c5d129 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1201,7 +1201,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield,
 	skb_reset_network_header(skb);
 	ipv6h = ipv6_hdr(skb);
 	ip6_flow_hdr(ipv6h, dsfield,
-		     ip6_make_flowlabel(net, skb, fl6->flowlabel, true, fl6));
+		     ip6_make_flowlabel(net, skb, fl6->flowlabel, true, fl6, 0));
 	ipv6h->hop_limit = hop_limit;
 	ipv6h->nexthdr = proto;
 	ipv6h->saddr = fl6->saddr;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2521690..bb47b6c 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -891,6 +891,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb)
 	struct sock *sk1 = NULL;
 #endif
 	int oif;
+	u8 tclass = 0;
+	__be32 flowlabel = 0;
 
 	if (th->rst)
 		return;
@@ -939,7 +941,21 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb)
 			  (th->doff << 2);
 
 	oif = sk ? sk->sk_bound_dev_if : 0;
-	tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
+	if (sk) {
+		if (sk_fullsock(sk)) {
+			struct ipv6_pinfo *np = inet6_sk(sk);
+
+			tclass = np->tclass;
+			flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK;
+		} else {
+			struct inet_timewait_sock *tw = inet_twsk(sk);
+
+			tclass = tw->tw_tclass;
+			flowlabel = cpu_to_be32(tw->tw_flowlabel);
+		}
+	}
+	tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1,
+		tclass, flowlabel);
 
 #ifdef CONFIG_TCP_MD5SIG
 out:
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6
  2017-07-31 22:19 ` [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6 Shaohua Li
@ 2017-07-31 22:35   ` Cong Wang
  2017-07-31 23:00     ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2017-07-31 22:35 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Linux Kernel Network Developers, David Miller, Shaohua Li,
	Eric Dumazet, Florent Fourcot

On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
>  static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
>                                         __be32 flowlabel, bool autolabel,
> -                                       struct flowi6 *fl6)
> +                                       struct flowi6 *fl6, u32 hash)
>  {
> -       u32 hash;
> -
>         /* @flowlabel may include more than a flow label, eg, the traffic class.
>          * Here we want only the flow label value.
>          */
> @@ -788,7 +786,8 @@ static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
>              net->ipv6.sysctl.auto_flowlabels != IP6_AUTO_FLOW_LABEL_FORCED))
>                 return flowlabel;
>
> -       hash = skb_get_hash_flowi6(skb, fl6);
> +       if (skb)
> +               hash = skb_get_hash_flowi6(skb, fl6);


Why not just move skb_get_hash_flowi6() to its caller?
This check is not necessary. If you don't want to touch
existing callers, you can just introduce a wrapper:


static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
                                        __be32 flowlabel, bool autolabel,
                                        struct flowi6 *fl6)
{
  u32 hash = skb_get_hash_flowi6(skb, fl6);
  return __ip6_make_flowlabel(net, flowlabel, autolabel, hash);
}

And your code can just call:

__ip6_make_flowlabel(net, flowlabel, autolabel, sk->sk_txhash);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6
  2017-07-31 22:35   ` Cong Wang
@ 2017-07-31 23:00     ` Shaohua Li
  2017-08-01 21:17       ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-07-31 23:00 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Shaohua Li,
	Eric Dumazet, Florent Fourcot

On Mon, Jul 31, 2017 at 03:35:02PM -0700, Cong Wang wrote:
> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
> >  static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
> >                                         __be32 flowlabel, bool autolabel,
> > -                                       struct flowi6 *fl6)
> > +                                       struct flowi6 *fl6, u32 hash)
> >  {
> > -       u32 hash;
> > -
> >         /* @flowlabel may include more than a flow label, eg, the traffic class.
> >          * Here we want only the flow label value.
> >          */
> > @@ -788,7 +786,8 @@ static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
> >              net->ipv6.sysctl.auto_flowlabels != IP6_AUTO_FLOW_LABEL_FORCED))
> >                 return flowlabel;
> >
> > -       hash = skb_get_hash_flowi6(skb, fl6);
> > +       if (skb)
> > +               hash = skb_get_hash_flowi6(skb, fl6);
> 
> 
> Why not just move skb_get_hash_flowi6() to its caller?
> This check is not necessary. If you don't want to touch
> existing callers, you can just introduce a wrapper:
> 
> 
> static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
>                                         __be32 flowlabel, bool autolabel,
>                                         struct flowi6 *fl6)
> {
>   u32 hash = skb_get_hash_flowi6(skb, fl6);
>   return __ip6_make_flowlabel(net, flowlabel, autolabel, hash);
> }

this will always call skb_get_hash_flowi6 for the fast path even auto flowlabel
is disabled. I thought we should avoid this.

> 
> And your code can just call:
> 
> __ip6_make_flowlabel(net, flowlabel, autolabel, sk->sk_txhash);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6
  2017-07-31 23:00     ` Shaohua Li
@ 2017-08-01 21:17       ` Cong Wang
  2017-08-01 21:42         ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2017-08-01 21:17 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Linux Kernel Network Developers, David Miller, Shaohua Li,
	Eric Dumazet, Florent Fourcot

On Mon, Jul 31, 2017 at 4:00 PM, Shaohua Li <shli@kernel.org> wrote:
> On Mon, Jul 31, 2017 at 03:35:02PM -0700, Cong Wang wrote:
>> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
>> >  static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
>> >                                         __be32 flowlabel, bool autolabel,
>> > -                                       struct flowi6 *fl6)
>> > +                                       struct flowi6 *fl6, u32 hash)
>> >  {
>> > -       u32 hash;
>> > -
>> >         /* @flowlabel may include more than a flow label, eg, the traffic class.
>> >          * Here we want only the flow label value.
>> >          */
>> > @@ -788,7 +786,8 @@ static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
>> >              net->ipv6.sysctl.auto_flowlabels != IP6_AUTO_FLOW_LABEL_FORCED))
>> >                 return flowlabel;
>> >
>> > -       hash = skb_get_hash_flowi6(skb, fl6);
>> > +       if (skb)
>> > +               hash = skb_get_hash_flowi6(skb, fl6);
>>
>>
>> Why not just move skb_get_hash_flowi6() to its caller?
>> This check is not necessary. If you don't want to touch
>> existing callers, you can just introduce a wrapper:
>>
>>
>> static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
>>                                         __be32 flowlabel, bool autolabel,
>>                                         struct flowi6 *fl6)
>> {
>>   u32 hash = skb_get_hash_flowi6(skb, fl6);
>>   return __ip6_make_flowlabel(net, flowlabel, autolabel, hash);
>> }
>
> this will always call skb_get_hash_flowi6 for the fast path even auto flowlabel
> is disabled. I thought we should avoid this.

Yeah, but you can move the check out too,
something like:

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 6eac5cf8f1e6..18ffa824c00a 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -771,31 +771,22 @@ static inline void
iph_to_flow_copy_v6addrs(struct flow_keys *flow,

 #define IP6_DEFAULT_AUTO_FLOW_LABELS   IP6_AUTO_FLOW_LABEL_OPTOUT

-static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
-                                       __be32 flowlabel, bool autolabel,
-                                       struct flowi6 *fl6)
+static inline bool ip6_need_flowlabel(struct net *net, __be32
flowlabel, bool autolabel)
 {
-       u32 hash;
-
        /* @flowlabel may include more than a flow label, eg, the traffic class.
         * Here we want only the flow label value.
         */
-       flowlabel &= IPV6_FLOWLABEL_MASK;
-
-       if (flowlabel ||
+       if ((flowlabel & IPV6_FLOWLABEL_MASK) ||
            net->ipv6.sysctl.auto_flowlabels == IP6_AUTO_FLOW_LABEL_OFF ||
            (!autolabel &&
             net->ipv6.sysctl.auto_flowlabels != IP6_AUTO_FLOW_LABEL_FORCED))
-               return flowlabel;
-
-       hash = skb_get_hash_flowi6(skb, fl6);
+               return false;

-       /* Since this is being sent on the wire obfuscate hash a bit
-        * to minimize possbility that any useful information to an
-        * attacker is leaked. Only lower 20 bits are relevant.
-        */
-       rol32(hash, 16);
+       return true;
+}

+static inline __be32 __ip6_make_flowlabel(struct net *net, __be32
flowlabel, u32 hash)
+{
        flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;

        if (net->ipv6.sysctl.flowlabel_state_ranges)
@@ -804,6 +795,19 @@ static inline __be32 ip6_make_flowlabel(struct
net *net, struct sk_buff *skb,
        return flowlabel;
 }

+static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
+                                       __be32 flowlabel, bool autolabel,
+                                       struct flowi6 *fl6)
+{
+       u32 hash;
+
+       if (!ip6_need_flowlabel(net, flowlabel, autolabel))
+               return flowlabel & IPV6_FLOWLABEL_MASK;
+
+       hash = skb_get_hash_flowi6(skb, fl6);
+       return __ip6_make_flowlabel(net, flowlabel, hash);
+}
+
 static inline int ip6_default_np_autolabel(struct net *net)
 {
        switch (net->ipv6.sysctl.auto_flowlabels) {

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6
  2017-08-01 21:17       ` Cong Wang
@ 2017-08-01 21:42         ` Shaohua Li
  2017-08-03  1:33           ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-08-01 21:42 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Shaohua Li,
	Eric Dumazet, Florent Fourcot

On Tue, Aug 01, 2017 at 02:17:58PM -0700, Cong Wang wrote:
> On Mon, Jul 31, 2017 at 4:00 PM, Shaohua Li <shli@kernel.org> wrote:
> > On Mon, Jul 31, 2017 at 03:35:02PM -0700, Cong Wang wrote:
> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
> >> >  static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
> >> >                                         __be32 flowlabel, bool autolabel,
> >> > -                                       struct flowi6 *fl6)
> >> > +                                       struct flowi6 *fl6, u32 hash)
> >> >  {
> >> > -       u32 hash;
> >> > -
> >> >         /* @flowlabel may include more than a flow label, eg, the traffic class.
> >> >          * Here we want only the flow label value.
> >> >          */
> >> > @@ -788,7 +786,8 @@ static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
> >> >              net->ipv6.sysctl.auto_flowlabels != IP6_AUTO_FLOW_LABEL_FORCED))
> >> >                 return flowlabel;
> >> >
> >> > -       hash = skb_get_hash_flowi6(skb, fl6);
> >> > +       if (skb)
> >> > +               hash = skb_get_hash_flowi6(skb, fl6);
> >>
> >>
> >> Why not just move skb_get_hash_flowi6() to its caller?
> >> This check is not necessary. If you don't want to touch
> >> existing callers, you can just introduce a wrapper:
> >>
> >>
> >> static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
> >>                                         __be32 flowlabel, bool autolabel,
> >>                                         struct flowi6 *fl6)
> >> {
> >>   u32 hash = skb_get_hash_flowi6(skb, fl6);
> >>   return __ip6_make_flowlabel(net, flowlabel, autolabel, hash);
> >> }
> >
> > this will always call skb_get_hash_flowi6 for the fast path even auto flowlabel
> > is disabled. I thought we should avoid this.
> 
> Yeah, but you can move the check out too,
> something like:

Is this really better? I don't see any point. I'd use my original patch other
than this one. that said, there are just several lines of code, brutally
'abstract' them into a function doesn't make the code better.
 
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index 6eac5cf8f1e6..18ffa824c00a 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -771,31 +771,22 @@ static inline void
> iph_to_flow_copy_v6addrs(struct flow_keys *flow,
> 
>  #define IP6_DEFAULT_AUTO_FLOW_LABELS   IP6_AUTO_FLOW_LABEL_OPTOUT
> 
> -static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
> -                                       __be32 flowlabel, bool autolabel,
> -                                       struct flowi6 *fl6)
> +static inline bool ip6_need_flowlabel(struct net *net, __be32
> flowlabel, bool autolabel)
>  {
> -       u32 hash;
> -
>         /* @flowlabel may include more than a flow label, eg, the traffic class.
>          * Here we want only the flow label value.
>          */
> -       flowlabel &= IPV6_FLOWLABEL_MASK;
> -
> -       if (flowlabel ||
> +       if ((flowlabel & IPV6_FLOWLABEL_MASK) ||
>             net->ipv6.sysctl.auto_flowlabels == IP6_AUTO_FLOW_LABEL_OFF ||
>             (!autolabel &&
>              net->ipv6.sysctl.auto_flowlabels != IP6_AUTO_FLOW_LABEL_FORCED))
> -               return flowlabel;
> -
> -       hash = skb_get_hash_flowi6(skb, fl6);
> +               return false;
> 
> -       /* Since this is being sent on the wire obfuscate hash a bit
> -        * to minimize possbility that any useful information to an
> -        * attacker is leaked. Only lower 20 bits are relevant.
> -        */
> -       rol32(hash, 16);
> +       return true;
> +}
> 
> +static inline __be32 __ip6_make_flowlabel(struct net *net, __be32
> flowlabel, u32 hash)
> +{
>         flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
> 
>         if (net->ipv6.sysctl.flowlabel_state_ranges)
> @@ -804,6 +795,19 @@ static inline __be32 ip6_make_flowlabel(struct
> net *net, struct sk_buff *skb,
>         return flowlabel;
>  }
> 
> +static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
> +                                       __be32 flowlabel, bool autolabel,
> +                                       struct flowi6 *fl6)
> +{
> +       u32 hash;
> +
> +       if (!ip6_need_flowlabel(net, flowlabel, autolabel))
> +               return flowlabel & IPV6_FLOWLABEL_MASK;
> +
> +       hash = skb_get_hash_flowi6(skb, fl6);
> +       return __ip6_make_flowlabel(net, flowlabel, hash);
> +}
> +
>  static inline int ip6_default_np_autolabel(struct net *net)
>  {
>         switch (net->ipv6.sysctl.auto_flowlabels) {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6
  2017-08-01 21:42         ` Shaohua Li
@ 2017-08-03  1:33           ` Cong Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Cong Wang @ 2017-08-03  1:33 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Linux Kernel Network Developers, David Miller, Shaohua Li,
	Eric Dumazet, Florent Fourcot

On Tue, Aug 1, 2017 at 2:42 PM, Shaohua Li <shli@kernel.org> wrote:
> Is this really better? I don't see any point. I'd use my original patch other
> than this one. that said, there are just several lines of code, brutally
> 'abstract' them into a function doesn't make the code better.

The current ip6_make_flowlabel() is already a mess, with your
change you just make it worse to read. I just want to make it suck
less.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-07-31 22:19 [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet Shaohua Li
  2017-07-31 22:19 ` [PATCH V4 net 1/2] net: remove unnecessary rotation Shaohua Li
  2017-07-31 22:19 ` [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6 Shaohua Li
@ 2017-08-09 14:59 ` Shaohua Li
  2017-08-09 17:55   ` David Miller
  2017-08-09 16:40 ` Tom Herbert
  3 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-08-09 14:59 UTC (permalink / raw)
  To: netdev, davem; +Cc: Shaohua Li

On Mon, Jul 31, 2017 at 03:19:21PM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Please see below tcpdump output:
> 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
> 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
> 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
> 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
> 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
> 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
> 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
> 
> The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
> (0xd827f) are different. This causes our router doesn't correctly close tcp
> connection. The patches try to fix the issue.

Hi Dave,

Could you please look at the patches?

Thanks,
Shaohua



> 
> Thanks,
> Shaohua
> 
> Shaohua Li (2):
>   net: remove unnecessary rotation
>   net: fix tcp reset packet flowlabel for ipv6
> 
>  include/net/ipv6.h       | 15 ++++-----------
>  net/ipv4/tcp_minisocks.c |  8 +++++++-
>  net/ipv6/ip6_gre.c       |  2 +-
>  net/ipv6/ip6_output.c    |  4 ++--
>  net/ipv6/ip6_tunnel.c    |  2 +-
>  net/ipv6/tcp_ipv6.c      | 18 +++++++++++++++++-
>  6 files changed, 32 insertions(+), 17 deletions(-)
> 
> -- 
> 2.9.3
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-07-31 22:19 [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet Shaohua Li
                   ` (2 preceding siblings ...)
  2017-08-09 14:59 ` [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet Shaohua Li
@ 2017-08-09 16:40 ` Tom Herbert
  2017-08-10 16:30   ` Shaohua Li
  3 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2017-08-09 16:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Linux Kernel Network Developers, David S. Miller, Shaohua Li

On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
> From: Shaohua Li <shli@fb.com>
>
> Please see below tcpdump output:
> 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
> 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
> 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
> 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
> 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
> 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
> 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
>
> The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
> (0xd827f) are different. This causes our router doesn't correctly close tcp
> connection. The patches try to fix the issue.
>
Shaohua,

Can you give some more detail about what the router doesn't close the
TCP connection means? I'm guessing the problem is either: 1) the
router is maintaining connection state that includes the flow label in
a connection tuple. 2) some router in the path is maintaining
connection state, but when the flow label changes the flow's packet
are routed through a different router that doesn't have a state for
the flow it drops the packet. #1 should be easily fix in the router,
flow labels cannot be used as state. #2 is the known problem that
stateful firewalls have killed our ability to use multihoming.

Another consideration is that sk_txhash is also used in routing
decisions by the local host (flow label is normally derived from
txhash). If you want to ensure that connections are routed
consistently for timewait state you might need sk_txhash saved also.

Tom

> Thanks,
> Shaohua
>
> Shaohua Li (2):
>   net: remove unnecessary rotation
>   net: fix tcp reset packet flowlabel for ipv6
>
>  include/net/ipv6.h       | 15 ++++-----------
>  net/ipv4/tcp_minisocks.c |  8 +++++++-
>  net/ipv6/ip6_gre.c       |  2 +-
>  net/ipv6/ip6_output.c    |  4 ++--
>  net/ipv6/ip6_tunnel.c    |  2 +-
>  net/ipv6/tcp_ipv6.c      | 18 +++++++++++++++++-
>  6 files changed, 32 insertions(+), 17 deletions(-)
>
> --
> 2.9.3
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-09 14:59 ` [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet Shaohua Li
@ 2017-08-09 17:55   ` David Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2017-08-09 17:55 UTC (permalink / raw)
  To: shli; +Cc: netdev, shli

From: Shaohua Li <shli@kernel.org>
Date: Wed, 9 Aug 2017 07:59:53 -0700

> Could you please look at the patches?

If you actually looked in patchwork, the state of your patches is
"changes requested".

This means that you were given feedback asking to change something
about your patches, which means that a new submission is expected.

I think Cong Wang's function movement request is legitimate, so
please do it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-09 16:40 ` Tom Herbert
@ 2017-08-10 16:30   ` Shaohua Li
  2017-08-10 18:30     ` Tom Herbert
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-08-10 16:30 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers, David S. Miller, Shaohua Li

On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
> > From: Shaohua Li <shli@fb.com>
> >
> > Please see below tcpdump output:
> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
> >
> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
> > (0xd827f) are different. This causes our router doesn't correctly close tcp
> > connection. The patches try to fix the issue.
> >
> Shaohua,
> 
> Can you give some more detail about what the router doesn't close the
> TCP connection means? I'm guessing the problem is either: 1) the
> router is maintaining connection state that includes the flow label in
> a connection tuple. 2) some router in the path is maintaining
> connection state, but when the flow label changes the flow's packet
> are routed through a different router that doesn't have a state for
> the flow it drops the packet. #1 should be easily fix in the router,
> flow labels cannot be used as state. #2 is the known problem that
> stateful firewalls have killed our ability to use multihoming.

The #2 is exactly the problem we saw.
 
> Another consideration is that sk_txhash is also used in routing
> decisions by the local host (flow label is normally derived from
> txhash). If you want to ensure that connections are routed
> consistently for timewait state you might need sk_txhash saved also.

As far as I understood, we don't use sk_txhash for routing selection. The code
does routing selection with flowlabel user configured, at that time we don't
derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
does routing selection first and then uses ip6_make_flowlabel to build packet
data where we derive flowlabel from skb->hash.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-10 16:30   ` Shaohua Li
@ 2017-08-10 18:30     ` Tom Herbert
  2017-08-10 19:13       ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2017-08-10 18:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Linux Kernel Network Developers, David S. Miller, Shaohua Li

On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
> On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
>> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
>> > From: Shaohua Li <shli@fb.com>
>> >
>> > Please see below tcpdump output:
>> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
>> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
>> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
>> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
>> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
>> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
>> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
>> >
>> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
>> > (0xd827f) are different. This causes our router doesn't correctly close tcp
>> > connection. The patches try to fix the issue.
>> >
>> Shaohua,
>>
>> Can you give some more detail about what the router doesn't close the
>> TCP connection means? I'm guessing the problem is either: 1) the
>> router is maintaining connection state that includes the flow label in
>> a connection tuple. 2) some router in the path is maintaining
>> connection state, but when the flow label changes the flow's packet
>> are routed through a different router that doesn't have a state for
>> the flow it drops the packet. #1 should be easily fix in the router,
>> flow labels cannot be used as state. #2 is the known problem that
>> stateful firewalls have killed our ability to use multihoming.
>
> The #2 is exactly the problem we saw.
>
>> Another consideration is that sk_txhash is also used in routing
>> decisions by the local host (flow label is normally derived from
>> txhash). If you want to ensure that connections are routed
>> consistently for timewait state you might need sk_txhash saved also.
>
> As far as I understood, we don't use sk_txhash for routing selection. The code
> does routing selection with flowlabel user configured, at that time we don't
> derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
> does routing selection first and then uses ip6_make_flowlabel to build packet
> data where we derive flowlabel from skb->hash.
>
That is assuming one particular use case. Generally, if you want to
ensure all packets for a flow take the same path you'll need tx_hash
and make it persistent (disable flow bender). For instance, if you
were doing UDP encapsulation like in VXLAN the UDP source port
selection is unaffected by saved flow label for the lifetime of the
flow. So we would still hit #2 in that case and the stateful device
doesn't see whole flow. It might be just as easy to move tx_hash in
skc_common so that it's available in TW state for this purpose. Then
when moving to TW state just copy the tx_hash.

Tom

> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-10 18:30     ` Tom Herbert
@ 2017-08-10 19:13       ` Shaohua Li
  2017-08-12  1:00         ` Tom Herbert
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-08-10 19:13 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers, David S. Miller, Shaohua Li

On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
> >> > From: Shaohua Li <shli@fb.com>
> >> >
> >> > Please see below tcpdump output:
> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
> >> >
> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
> >> > (0xd827f) are different. This causes our router doesn't correctly close tcp
> >> > connection. The patches try to fix the issue.
> >> >
> >> Shaohua,
> >>
> >> Can you give some more detail about what the router doesn't close the
> >> TCP connection means? I'm guessing the problem is either: 1) the
> >> router is maintaining connection state that includes the flow label in
> >> a connection tuple. 2) some router in the path is maintaining
> >> connection state, but when the flow label changes the flow's packet
> >> are routed through a different router that doesn't have a state for
> >> the flow it drops the packet. #1 should be easily fix in the router,
> >> flow labels cannot be used as state. #2 is the known problem that
> >> stateful firewalls have killed our ability to use multihoming.
> >
> > The #2 is exactly the problem we saw.
> >
> >> Another consideration is that sk_txhash is also used in routing
> >> decisions by the local host (flow label is normally derived from
> >> txhash). If you want to ensure that connections are routed
> >> consistently for timewait state you might need sk_txhash saved also.
> >
> > As far as I understood, we don't use sk_txhash for routing selection. The code
> > does routing selection with flowlabel user configured, at that time we don't
> > derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
> > does routing selection first and then uses ip6_make_flowlabel to build packet
> > data where we derive flowlabel from skb->hash.
> >
> That is assuming one particular use case. Generally, if you want to
> ensure all packets for a flow take the same path you'll need tx_hash
> and make it persistent (disable flow bender). For instance, if you
> were doing UDP encapsulation like in VXLAN the UDP source port
> selection is unaffected by saved flow label for the lifetime of the
> flow. So we would still hit #2 in that case and the stateful device
> doesn't see whole flow. It might be just as easy to move tx_hash in
> skc_common so that it's available in TW state for this purpose. Then
> when moving to TW state just copy the tx_hash.

Hi Tom,

My original implementation is to add a tx_hash in tw sock, we then copy sock's
tx_hash to the tw tx_hash. This does makes things simplier. One concern from
Eric is this will increase the size of tw sock. If we move tx_hash to
skc_common, all sock size will increase, is this acceptable?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-10 19:13       ` Shaohua Li
@ 2017-08-12  1:00         ` Tom Herbert
  2017-08-15  2:52           ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2017-08-12  1:00 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Linux Kernel Network Developers, David S. Miller, Shaohua Li

On Thu, Aug 10, 2017 at 12:13 PM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
>> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
>> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
>> >> > From: Shaohua Li <shli@fb.com>
>> >> >
>> >> > Please see below tcpdump output:
>> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
>> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
>> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
>> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
>> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
>> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
>> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
>> >> >
>> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
>> >> > (0xd827f) are different. This causes our router doesn't correctly close tcp
>> >> > connection. The patches try to fix the issue.
>> >> >
>> >> Shaohua,
>> >>
>> >> Can you give some more detail about what the router doesn't close the
>> >> TCP connection means? I'm guessing the problem is either: 1) the
>> >> router is maintaining connection state that includes the flow label in
>> >> a connection tuple. 2) some router in the path is maintaining
>> >> connection state, but when the flow label changes the flow's packet
>> >> are routed through a different router that doesn't have a state for
>> >> the flow it drops the packet. #1 should be easily fix in the router,
>> >> flow labels cannot be used as state. #2 is the known problem that
>> >> stateful firewalls have killed our ability to use multihoming.
>> >
>> > The #2 is exactly the problem we saw.
>> >
>> >> Another consideration is that sk_txhash is also used in routing
>> >> decisions by the local host (flow label is normally derived from
>> >> txhash). If you want to ensure that connections are routed
>> >> consistently for timewait state you might need sk_txhash saved also.
>> >
>> > As far as I understood, we don't use sk_txhash for routing selection. The code
>> > does routing selection with flowlabel user configured, at that time we don't
>> > derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
>> > does routing selection first and then uses ip6_make_flowlabel to build packet
>> > data where we derive flowlabel from skb->hash.
>> >
>> That is assuming one particular use case. Generally, if you want to
>> ensure all packets for a flow take the same path you'll need tx_hash
>> and make it persistent (disable flow bender). For instance, if you
>> were doing UDP encapsulation like in VXLAN the UDP source port
>> selection is unaffected by saved flow label for the lifetime of the
>> flow. So we would still hit #2 in that case and the stateful device
>> doesn't see whole flow. It might be just as easy to move tx_hash in
>> skc_common so that it's available in TW state for this purpose. Then
>> when moving to TW state just copy the tx_hash.
>
> Hi Tom,
>
> My original implementation is to add a tx_hash in tw sock, we then copy sock's
> tx_hash to the tw tx_hash. This does makes things simplier. One concern from
> Eric is this will increase the size of tw sock. If we move tx_hash to
> skc_common, all sock size will increase, is this acceptable?

I think that can only be measured by how critical it is to
persistently route all packets the same exact way for every
connection. Page one of the IP book clearly states that IP packets can
be dropped, duplicated, or received out of order. Received OOO implies
that packet for the same flow are allowed to take different paths. The
requirement that packets for the same flow must always take the same
path through the network was created by stateful middleboxes-- it's
not inherent in the architecture of IP networking. Unfortunately,
we're seeing this become more and more of a problem as more devices
are multi-homed (like smart phones) and these network requirement
cripple our ability to take advantage of features like that.

Personally, I wish the middleboxes fix the problem they created, but I
suppose we need to be pragmatic at least in the short term.

Tom

> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-12  1:00         ` Tom Herbert
@ 2017-08-15  2:52           ` Shaohua Li
  2017-08-15 14:08             ` Tom Herbert
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-08-15  2:52 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers, David S. Miller, Shaohua Li

On Fri, Aug 11, 2017 at 06:00:20PM -0700, Tom Herbert wrote:
> On Thu, Aug 10, 2017 at 12:13 PM, Shaohua Li <shli@kernel.org> wrote:
> > On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
> >> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
> >> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
> >> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
> >> >> > From: Shaohua Li <shli@fb.com>
> >> >> >
> >> >> > Please see below tcpdump output:
> >> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
> >> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
> >> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
> >> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
> >> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
> >> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
> >> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
> >> >> >
> >> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
> >> >> > (0xd827f) are different. This causes our router doesn't correctly close tcp
> >> >> > connection. The patches try to fix the issue.
> >> >> >
> >> >> Shaohua,
> >> >>
> >> >> Can you give some more detail about what the router doesn't close the
> >> >> TCP connection means? I'm guessing the problem is either: 1) the
> >> >> router is maintaining connection state that includes the flow label in
> >> >> a connection tuple. 2) some router in the path is maintaining
> >> >> connection state, but when the flow label changes the flow's packet
> >> >> are routed through a different router that doesn't have a state for
> >> >> the flow it drops the packet. #1 should be easily fix in the router,
> >> >> flow labels cannot be used as state. #2 is the known problem that
> >> >> stateful firewalls have killed our ability to use multihoming.
> >> >
> >> > The #2 is exactly the problem we saw.
> >> >
> >> >> Another consideration is that sk_txhash is also used in routing
> >> >> decisions by the local host (flow label is normally derived from
> >> >> txhash). If you want to ensure that connections are routed
> >> >> consistently for timewait state you might need sk_txhash saved also.
> >> >
> >> > As far as I understood, we don't use sk_txhash for routing selection. The code
> >> > does routing selection with flowlabel user configured, at that time we don't
> >> > derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
> >> > does routing selection first and then uses ip6_make_flowlabel to build packet
> >> > data where we derive flowlabel from skb->hash.
> >> >
> >> That is assuming one particular use case. Generally, if you want to
> >> ensure all packets for a flow take the same path you'll need tx_hash
> >> and make it persistent (disable flow bender). For instance, if you
> >> were doing UDP encapsulation like in VXLAN the UDP source port
> >> selection is unaffected by saved flow label for the lifetime of the
> >> flow. So we would still hit #2 in that case and the stateful device
> >> doesn't see whole flow. It might be just as easy to move tx_hash in
> >> skc_common so that it's available in TW state for this purpose. Then
> >> when moving to TW state just copy the tx_hash.
> >
> > Hi Tom,
> >
> > My original implementation is to add a tx_hash in tw sock, we then copy sock's
> > tx_hash to the tw tx_hash. This does makes things simplier. One concern from
> > Eric is this will increase the size of tw sock. If we move tx_hash to
> > skc_common, all sock size will increase, is this acceptable?
> 
> I think that can only be measured by how critical it is to
> persistently route all packets the same exact way for every
> connection. Page one of the IP book clearly states that IP packets can
> be dropped, duplicated, or received out of order. Received OOO implies
> that packet for the same flow are allowed to take different paths. The
> requirement that packets for the same flow must always take the same
> path through the network was created by stateful middleboxes-- it's
> not inherent in the architecture of IP networking. Unfortunately,
> we're seeing this become more and more of a problem as more devices
> are multi-homed (like smart phones) and these network requirement
> cripple our ability to take advantage of features like that.
> 
> Personally, I wish the middleboxes fix the problem they created, but I
> suppose we need to be pragmatic at least in the short term.

Hmm, I still hesitate to add a new field in skc_common. Fixing current problem
looks propriate in current stage. I'd defer fixing the generic issue till it's
necessary.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-15  2:52           ` Shaohua Li
@ 2017-08-15 14:08             ` Tom Herbert
  2017-08-15 22:42               ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2017-08-15 14:08 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Linux Kernel Network Developers, David S. Miller, Shaohua Li

On Mon, Aug 14, 2017 at 7:52 PM, Shaohua Li <shli@kernel.org> wrote:
> On Fri, Aug 11, 2017 at 06:00:20PM -0700, Tom Herbert wrote:
>> On Thu, Aug 10, 2017 at 12:13 PM, Shaohua Li <shli@kernel.org> wrote:
>> > On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
>> >> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
>> >> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
>> >> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
>> >> >> > From: Shaohua Li <shli@fb.com>
>> >> >> >
>> >> >> > Please see below tcpdump output:
>> >> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
>> >> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
>> >> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> >> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
>> >> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> >> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
>> >> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> >> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
>> >> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> >> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
>> >> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
>> >> >> >
>> >> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
>> >> >> > (0xd827f) are different. This causes our router doesn't correctly close tcp
>> >> >> > connection. The patches try to fix the issue.
>> >> >> >
>> >> >> Shaohua,
>> >> >>
>> >> >> Can you give some more detail about what the router doesn't close the
>> >> >> TCP connection means? I'm guessing the problem is either: 1) the
>> >> >> router is maintaining connection state that includes the flow label in
>> >> >> a connection tuple. 2) some router in the path is maintaining
>> >> >> connection state, but when the flow label changes the flow's packet
>> >> >> are routed through a different router that doesn't have a state for
>> >> >> the flow it drops the packet. #1 should be easily fix in the router,
>> >> >> flow labels cannot be used as state. #2 is the known problem that
>> >> >> stateful firewalls have killed our ability to use multihoming.
>> >> >
>> >> > The #2 is exactly the problem we saw.
>> >> >
>> >> >> Another consideration is that sk_txhash is also used in routing
>> >> >> decisions by the local host (flow label is normally derived from
>> >> >> txhash). If you want to ensure that connections are routed
>> >> >> consistently for timewait state you might need sk_txhash saved also.
>> >> >
>> >> > As far as I understood, we don't use sk_txhash for routing selection. The code
>> >> > does routing selection with flowlabel user configured, at that time we don't
>> >> > derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
>> >> > does routing selection first and then uses ip6_make_flowlabel to build packet
>> >> > data where we derive flowlabel from skb->hash.
>> >> >
>> >> That is assuming one particular use case. Generally, if you want to
>> >> ensure all packets for a flow take the same path you'll need tx_hash
>> >> and make it persistent (disable flow bender). For instance, if you
>> >> were doing UDP encapsulation like in VXLAN the UDP source port
>> >> selection is unaffected by saved flow label for the lifetime of the
>> >> flow. So we would still hit #2 in that case and the stateful device
>> >> doesn't see whole flow. It might be just as easy to move tx_hash in
>> >> skc_common so that it's available in TW state for this purpose. Then
>> >> when moving to TW state just copy the tx_hash.
>> >
>> > Hi Tom,
>> >
>> > My original implementation is to add a tx_hash in tw sock, we then copy sock's
>> > tx_hash to the tw tx_hash. This does makes things simplier. One concern from
>> > Eric is this will increase the size of tw sock. If we move tx_hash to
>> > skc_common, all sock size will increase, is this acceptable?
>>
>> I think that can only be measured by how critical it is to
>> persistently route all packets the same exact way for every
>> connection. Page one of the IP book clearly states that IP packets can
>> be dropped, duplicated, or received out of order. Received OOO implies
>> that packet for the same flow are allowed to take different paths. The
>> requirement that packets for the same flow must always take the same
>> path through the network was created by stateful middleboxes-- it's
>> not inherent in the architecture of IP networking. Unfortunately,
>> we're seeing this become more and more of a problem as more devices
>> are multi-homed (like smart phones) and these network requirement
>> cripple our ability to take advantage of features like that.
>>
>> Personally, I wish the middleboxes fix the problem they created, but I
>> suppose we need to be pragmatic at least in the short term.
>
> Hmm, I still hesitate to add a new field in skc_common. Fixing current problem
> looks propriate in current stage. I'd defer fixing the generic issue till it's
> necessary.
>
Shaohua,

An alternative would be to not initialize sk_txhash, but instead defer
hash computation to use flow dissector in the TX path when the hash is
needed (to get flow label, src port for UDP encap, route for
multipath, etc.). At the first hash computation in TX path  the result
in sk_txhash. In TW state there is no socket so flow dissector is
always used but that should yield the same hash. No extra fields would
be needed and additional cost is negligible.

Tom

> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-15 14:08             ` Tom Herbert
@ 2017-08-15 22:42               ` Shaohua Li
  2017-08-16  0:15                 ` Tom Herbert
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-08-15 22:42 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller, Martin KaFai Lau

On Tue, Aug 15, 2017 at 07:08:31AM -0700, Tom Herbert wrote:
> On Mon, Aug 14, 2017 at 7:52 PM, Shaohua Li <shli@kernel.org> wrote:
> > On Fri, Aug 11, 2017 at 06:00:20PM -0700, Tom Herbert wrote:
> >> On Thu, Aug 10, 2017 at 12:13 PM, Shaohua Li <shli@kernel.org> wrote:
> >> > On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
> >> >> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
> >> >> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
> >> >> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
> >> >> >> > From: Shaohua Li <shli@fb.com>
> >> >> >> >
> >> >> >> > Please see below tcpdump output:
> >> >> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
> >> >> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
> >> >> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> >> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
> >> >> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> >> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
> >> >> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> >> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
> >> >> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> >> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
> >> >> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
> >> >> >> >
> >> >> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
> >> >> >> > (0xd827f) are different. This causes our router doesn't correctly close tcp
> >> >> >> > connection. The patches try to fix the issue.
> >> >> >> >
> >> >> >> Shaohua,
> >> >> >>
> >> >> >> Can you give some more detail about what the router doesn't close the
> >> >> >> TCP connection means? I'm guessing the problem is either: 1) the
> >> >> >> router is maintaining connection state that includes the flow label in
> >> >> >> a connection tuple. 2) some router in the path is maintaining
> >> >> >> connection state, but when the flow label changes the flow's packet
> >> >> >> are routed through a different router that doesn't have a state for
> >> >> >> the flow it drops the packet. #1 should be easily fix in the router,
> >> >> >> flow labels cannot be used as state. #2 is the known problem that
> >> >> >> stateful firewalls have killed our ability to use multihoming.
> >> >> >
> >> >> > The #2 is exactly the problem we saw.
> >> >> >
> >> >> >> Another consideration is that sk_txhash is also used in routing
> >> >> >> decisions by the local host (flow label is normally derived from
> >> >> >> txhash). If you want to ensure that connections are routed
> >> >> >> consistently for timewait state you might need sk_txhash saved also.
> >> >> >
> >> >> > As far as I understood, we don't use sk_txhash for routing selection. The code
> >> >> > does routing selection with flowlabel user configured, at that time we don't
> >> >> > derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
> >> >> > does routing selection first and then uses ip6_make_flowlabel to build packet
> >> >> > data where we derive flowlabel from skb->hash.
> >> >> >
> >> >> That is assuming one particular use case. Generally, if you want to
> >> >> ensure all packets for a flow take the same path you'll need tx_hash
> >> >> and make it persistent (disable flow bender). For instance, if you
> >> >> were doing UDP encapsulation like in VXLAN the UDP source port
> >> >> selection is unaffected by saved flow label for the lifetime of the
> >> >> flow. So we would still hit #2 in that case and the stateful device
> >> >> doesn't see whole flow. It might be just as easy to move tx_hash in
> >> >> skc_common so that it's available in TW state for this purpose. Then
> >> >> when moving to TW state just copy the tx_hash.
> >> >
> >> > Hi Tom,
> >> >
> >> > My original implementation is to add a tx_hash in tw sock, we then copy sock's
> >> > tx_hash to the tw tx_hash. This does makes things simplier. One concern from
> >> > Eric is this will increase the size of tw sock. If we move tx_hash to
> >> > skc_common, all sock size will increase, is this acceptable?
> >>
> >> I think that can only be measured by how critical it is to
> >> persistently route all packets the same exact way for every
> >> connection. Page one of the IP book clearly states that IP packets can
> >> be dropped, duplicated, or received out of order. Received OOO implies
> >> that packet for the same flow are allowed to take different paths. The
> >> requirement that packets for the same flow must always take the same
> >> path through the network was created by stateful middleboxes-- it's
> >> not inherent in the architecture of IP networking. Unfortunately,
> >> we're seeing this become more and more of a problem as more devices
> >> are multi-homed (like smart phones) and these network requirement
> >> cripple our ability to take advantage of features like that.
> >>
> >> Personally, I wish the middleboxes fix the problem they created, but I
> >> suppose we need to be pragmatic at least in the short term.
> >
> > Hmm, I still hesitate to add a new field in skc_common. Fixing current problem
> > looks propriate in current stage. I'd defer fixing the generic issue till it's
> > necessary.
> >
> Shaohua,
> 
> An alternative would be to not initialize sk_txhash, but instead defer
> hash computation to use flow dissector in the TX path when the hash is
> needed (to get flow label, src port for UDP encap, route for
> multipath, etc.). At the first hash computation in TX path  the result
> in sk_txhash. In TW state there is no socket so flow dissector is
> always used but that should yield the same hash. No extra fields would
> be needed and additional cost is negligible.

Hi Tom,

Did you mean revert 877d1f6291f8(net: Set sk_txhash from a random number)? This
could fix the issue. So in normal case we calculate the sk_txhash using flow
dissector but in negative routing case we use the random hash, is this what you
want?

There seems to have other bug in this side. From my understanding, commit
265f94ff54d6(net: Recompute sk_txhash on negative routing advice) tries to
select a different route. But the multipath selection code
(rt6_multipath_select) doesn't use sk_txhash or skb->hash, it does use
fl6.flowlabel, but that is the flowlabel user sets. So looks like the commit
doesn't change anything.

What's the 'src port for UDP encap'? I can't find the code setting skb->hash
to sk_txhash in UDP side.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-15 22:42               ` Shaohua Li
@ 2017-08-16  0:15                 ` Tom Herbert
  2017-08-17 17:26                   ` Shaohua Li
  2017-08-17 22:55                   ` Martin KaFai Lau
  0 siblings, 2 replies; 32+ messages in thread
From: Tom Herbert @ 2017-08-16  0:15 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Linux Kernel Network Developers, David S. Miller, Martin KaFai Lau

On Tue, Aug 15, 2017 at 3:42 PM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Aug 15, 2017 at 07:08:31AM -0700, Tom Herbert wrote:
>> On Mon, Aug 14, 2017 at 7:52 PM, Shaohua Li <shli@kernel.org> wrote:
>> > On Fri, Aug 11, 2017 at 06:00:20PM -0700, Tom Herbert wrote:
>> >> On Thu, Aug 10, 2017 at 12:13 PM, Shaohua Li <shli@kernel.org> wrote:
>> >> > On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
>> >> >> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
>> >> >> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
>> >> >> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
>> >> >> >> > From: Shaohua Li <shli@fb.com>
>> >> >> >> >
>> >> >> >> > Please see below tcpdump output:
>> >> >> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
>> >> >> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
>> >> >> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> >> >> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
>> >> >> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> >> >> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
>> >> >> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> >> >> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
>> >> >> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> >> >> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
>> >> >> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
>> >> >> >> >
>> >> >> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
>> >> >> >> > (0xd827f) are different. This causes our router doesn't correctly close tcp
>> >> >> >> > connection. The patches try to fix the issue.
>> >> >> >> >
>> >> >> >> Shaohua,
>> >> >> >>
>> >> >> >> Can you give some more detail about what the router doesn't close the
>> >> >> >> TCP connection means? I'm guessing the problem is either: 1) the
>> >> >> >> router is maintaining connection state that includes the flow label in
>> >> >> >> a connection tuple. 2) some router in the path is maintaining
>> >> >> >> connection state, but when the flow label changes the flow's packet
>> >> >> >> are routed through a different router that doesn't have a state for
>> >> >> >> the flow it drops the packet. #1 should be easily fix in the router,
>> >> >> >> flow labels cannot be used as state. #2 is the known problem that
>> >> >> >> stateful firewalls have killed our ability to use multihoming.
>> >> >> >
>> >> >> > The #2 is exactly the problem we saw.
>> >> >> >
>> >> >> >> Another consideration is that sk_txhash is also used in routing
>> >> >> >> decisions by the local host (flow label is normally derived from
>> >> >> >> txhash). If you want to ensure that connections are routed
>> >> >> >> consistently for timewait state you might need sk_txhash saved also.
>> >> >> >
>> >> >> > As far as I understood, we don't use sk_txhash for routing selection. The code
>> >> >> > does routing selection with flowlabel user configured, at that time we don't
>> >> >> > derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
>> >> >> > does routing selection first and then uses ip6_make_flowlabel to build packet
>> >> >> > data where we derive flowlabel from skb->hash.
>> >> >> >
>> >> >> That is assuming one particular use case. Generally, if you want to
>> >> >> ensure all packets for a flow take the same path you'll need tx_hash
>> >> >> and make it persistent (disable flow bender). For instance, if you
>> >> >> were doing UDP encapsulation like in VXLAN the UDP source port
>> >> >> selection is unaffected by saved flow label for the lifetime of the
>> >> >> flow. So we would still hit #2 in that case and the stateful device
>> >> >> doesn't see whole flow. It might be just as easy to move tx_hash in
>> >> >> skc_common so that it's available in TW state for this purpose. Then
>> >> >> when moving to TW state just copy the tx_hash.
>> >> >
>> >> > Hi Tom,
>> >> >
>> >> > My original implementation is to add a tx_hash in tw sock, we then copy sock's
>> >> > tx_hash to the tw tx_hash. This does makes things simplier. One concern from
>> >> > Eric is this will increase the size of tw sock. If we move tx_hash to
>> >> > skc_common, all sock size will increase, is this acceptable?
>> >>
>> >> I think that can only be measured by how critical it is to
>> >> persistently route all packets the same exact way for every
>> >> connection. Page one of the IP book clearly states that IP packets can
>> >> be dropped, duplicated, or received out of order. Received OOO implies
>> >> that packet for the same flow are allowed to take different paths. The
>> >> requirement that packets for the same flow must always take the same
>> >> path through the network was created by stateful middleboxes-- it's
>> >> not inherent in the architecture of IP networking. Unfortunately,
>> >> we're seeing this become more and more of a problem as more devices
>> >> are multi-homed (like smart phones) and these network requirement
>> >> cripple our ability to take advantage of features like that.
>> >>
>> >> Personally, I wish the middleboxes fix the problem they created, but I
>> >> suppose we need to be pragmatic at least in the short term.
>> >
>> > Hmm, I still hesitate to add a new field in skc_common. Fixing current problem
>> > looks propriate in current stage. I'd defer fixing the generic issue till it's
>> > necessary.
>> >
>> Shaohua,
>>
>> An alternative would be to not initialize sk_txhash, but instead defer
>> hash computation to use flow dissector in the TX path when the hash is
>> needed (to get flow label, src port for UDP encap, route for
>> multipath, etc.). At the first hash computation in TX path  the result
>> in sk_txhash. TW state there is no socket so flow dissector is
>> always used but that should yield the same hash. No extra fields would
>> be needed and additional cost is negligible.
>
> Hi Tom,
>
> Did you mean revert 877d1f6291f8(net: Set sk_txhash from a random number)? This
> could fix the issue. So in normal case we calculate the sk_txhash using flow
> dissector but in negative routing case we use the random hash, is this what you
> want?
>
No. What you'd want is something like a sysctl that sets an alternate
mode for sk_txhash processing. sk_txhash is derived from flow
dissector for the first TX packet and then it's never allowed to
change. Maybe this should be called persistent-hash mode.

> There seems to have other bug in this side. From my understanding, commit
> 265f94ff54d6(net: Recompute sk_txhash on negative routing advice) tries to
> select a different route. But the multipath selection code
> (rt6_multipath_select) doesn't use sk_txhash or skb->hash, it does use
> fl6.flowlabel, but that is the flowlabel user sets. So looks like the commit
> doesn't change anything.
>
The routing functions typically don't use sock of skbuff, but use flow
structs instead. It may be reasonable to add a hash to those.

> What's the 'src port for UDP encap'? I can't find the code setting skb->hash
> to sk_txhash in UDP side.
>
udp_flow_src_port is function call by UDP encaps to set source port.
This is call skb_get_hash. sk_set_txhash is function to set txhash
right now to random value. skb_set_hash_from_sk set skb->hash when
skbuff is owned by socket (skb_set_owner_w).

Thanks,
Tom


> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-16  0:15                 ` Tom Herbert
@ 2017-08-17 17:26                   ` Shaohua Li
  2017-08-17 19:07                     ` Tom Herbert
  2017-08-17 22:55                   ` Martin KaFai Lau
  1 sibling, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-08-17 17:26 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller, Martin KaFai Lau

On Tue, Aug 15, 2017 at 05:15:46PM -0700, Tom Herbert wrote:
> On Tue, Aug 15, 2017 at 3:42 PM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Aug 15, 2017 at 07:08:31AM -0700, Tom Herbert wrote:
> >> On Mon, Aug 14, 2017 at 7:52 PM, Shaohua Li <shli@kernel.org> wrote:
> >> > On Fri, Aug 11, 2017 at 06:00:20PM -0700, Tom Herbert wrote:
> >> >> On Thu, Aug 10, 2017 at 12:13 PM, Shaohua Li <shli@kernel.org> wrote:
> >> >> > On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
> >> >> >> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
> >> >> >> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
> >> >> >> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
> >> >> >> >> > From: Shaohua Li <shli@fb.com>
> >> >> >> >> >
> >> >> >> >> > Please see below tcpdump output:
> >> >> >> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
> >> >> >> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
> >> >> >> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> >> >> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
> >> >> >> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> >> >> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
> >> >> >> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> >> >> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
> >> >> >> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> >> >> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
> >> >> >> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
> >> >> >> >> >
> >> >> >> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
> >> >> >> >> > (0xd827f) are different. This causes our router doesn't correctly close tcp
> >> >> >> >> > connection. The patches try to fix the issue.
> >> >> >> >> >
> >> >> >> >> Shaohua,
> >> >> >> >>
> >> >> >> >> Can you give some more detail about what the router doesn't close the
> >> >> >> >> TCP connection means? I'm guessing the problem is either: 1) the
> >> >> >> >> router is maintaining connection state that includes the flow label in
> >> >> >> >> a connection tuple. 2) some router in the path is maintaining
> >> >> >> >> connection state, but when the flow label changes the flow's packet
> >> >> >> >> are routed through a different router that doesn't have a state for
> >> >> >> >> the flow it drops the packet. #1 should be easily fix in the router,
> >> >> >> >> flow labels cannot be used as state. #2 is the known problem that
> >> >> >> >> stateful firewalls have killed our ability to use multihoming.
> >> >> >> >
> >> >> >> > The #2 is exactly the problem we saw.
> >> >> >> >
> >> >> >> >> Another consideration is that sk_txhash is also used in routing
> >> >> >> >> decisions by the local host (flow label is normally derived from
> >> >> >> >> txhash). If you want to ensure that connections are routed
> >> >> >> >> consistently for timewait state you might need sk_txhash saved also.
> >> >> >> >
> >> >> >> > As far as I understood, we don't use sk_txhash for routing selection. The code
> >> >> >> > does routing selection with flowlabel user configured, at that time we don't
> >> >> >> > derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
> >> >> >> > does routing selection first and then uses ip6_make_flowlabel to build packet
> >> >> >> > data where we derive flowlabel from skb->hash.
> >> >> >> >
> >> >> >> That is assuming one particular use case. Generally, if you want to
> >> >> >> ensure all packets for a flow take the same path you'll need tx_hash
> >> >> >> and make it persistent (disable flow bender). For instance, if you
> >> >> >> were doing UDP encapsulation like in VXLAN the UDP source port
> >> >> >> selection is unaffected by saved flow label for the lifetime of the
> >> >> >> flow. So we would still hit #2 in that case and the stateful device
> >> >> >> doesn't see whole flow. It might be just as easy to move tx_hash in
> >> >> >> skc_common so that it's available in TW state for this purpose. Then
> >> >> >> when moving to TW state just copy the tx_hash.
> >> >> >
> >> >> > Hi Tom,
> >> >> >
> >> >> > My original implementation is to add a tx_hash in tw sock, we then copy sock's
> >> >> > tx_hash to the tw tx_hash. This does makes things simplier. One concern from
> >> >> > Eric is this will increase the size of tw sock. If we move tx_hash to
> >> >> > skc_common, all sock size will increase, is this acceptable?
> >> >>
> >> >> I think that can only be measured by how critical it is to
> >> >> persistently route all packets the same exact way for every
> >> >> connection. Page one of the IP book clearly states that IP packets can
> >> >> be dropped, duplicated, or received out of order. Received OOO implies
> >> >> that packet for the same flow are allowed to take different paths. The
> >> >> requirement that packets for the same flow must always take the same
> >> >> path through the network was created by stateful middleboxes-- it's
> >> >> not inherent in the architecture of IP networking. Unfortunately,
> >> >> we're seeing this become more and more of a problem as more devices
> >> >> are multi-homed (like smart phones) and these network requirement
> >> >> cripple our ability to take advantage of features like that.
> >> >>
> >> >> Personally, I wish the middleboxes fix the problem they created, but I
> >> >> suppose we need to be pragmatic at least in the short term.
> >> >
> >> > Hmm, I still hesitate to add a new field in skc_common. Fixing current problem
> >> > looks propriate in current stage. I'd defer fixing the generic issue till it's
> >> > necessary.
> >> >
> >> Shaohua,
> >>
> >> An alternative would be to not initialize sk_txhash, but instead defer
> >> hash computation to use flow dissector in the TX path when the hash is
> >> needed (to get flow label, src port for UDP encap, route for
> >> multipath, etc.). At the first hash computation in TX path  the result
> >> in sk_txhash. TW state there is no socket so flow dissector is
> >> always used but that should yield the same hash. No extra fields would
> >> be needed and additional cost is negligible.
> >
> > Hi Tom,
> >
> > Did you mean revert 877d1f6291f8(net: Set sk_txhash from a random number)? This
> > could fix the issue. So in normal case we calculate the sk_txhash using flow
> > dissector but in negative routing case we use the random hash, is this what you
> > want?
> >
> No. What you'd want is something like a sysctl that sets an alternate
> mode for sk_txhash processing. sk_txhash is derived from flow
> dissector for the first TX packet and then it's never allowed to
> change. Maybe this should be called persistent-hash mode.

persistent-hash will almostly mean disabling auto flowlabel. We'd prefer
disabling auto flowlabel rather than the persistent-hash. But I think we do
want to use the auto flowlabel for load balancing, so the persistent-hash isn't
a go for us.

Let's take the other way. Your main concern is the multipath selection. I think
this can be fixed. We can still use the tw->tw_flowlabel to store the auto
created flowlabel like what my patch does, but we use an extra bit in tw_pad of
inet_timewait_sock to indicate this flowlabel is auto created. In the reset
packet send path, we don't use the auto created flowlabel for route slection,
but just for packet data. This will keep current multipath selection behavior
and fix the reset packet flowlabel issue.
 
> > There seems to have other bug in this side. From my understanding, commit
> > 265f94ff54d6(net: Recompute sk_txhash on negative routing advice) tries to
> > select a different route. But the multipath selection code
> > (rt6_multipath_select) doesn't use sk_txhash or skb->hash, it does use
> > fl6.flowlabel, but that is the flowlabel user sets. So looks like the commit
> > doesn't change anything.
> >
> The routing functions typically don't use sock of skbuff, but use flow
> structs instead. It may be reasonable to add a hash to those.

I'll look at this issue later. It's not related to current flowlabel problem.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-17 17:26                   ` Shaohua Li
@ 2017-08-17 19:07                     ` Tom Herbert
  0 siblings, 0 replies; 32+ messages in thread
From: Tom Herbert @ 2017-08-17 19:07 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Linux Kernel Network Developers, David S. Miller, Martin KaFai Lau

On Thu, Aug 17, 2017 at 10:26 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Aug 15, 2017 at 05:15:46PM -0700, Tom Herbert wrote:
>> On Tue, Aug 15, 2017 at 3:42 PM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Aug 15, 2017 at 07:08:31AM -0700, Tom Herbert wrote:
>> >> On Mon, Aug 14, 2017 at 7:52 PM, Shaohua Li <shli@kernel.org> wrote:
>> >> > On Fri, Aug 11, 2017 at 06:00:20PM -0700, Tom Herbert wrote:
>> >> >> On Thu, Aug 10, 2017 at 12:13 PM, Shaohua Li <shli@kernel.org> wrote:
>> >> >> > On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
>> >> >> >> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
>> >> >> >> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
>> >> >> >> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
>> >> >> >> >> > From: Shaohua Li <shli@fb.com>
>> >> >> >> >> >
>> >> >> >> >> > Please see below tcpdump output:
>> >> >> >> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
>> >> >> >> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
>> >> >> >> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> >> >> >> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
>> >> >> >> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
>> >> >> >> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
>> >> >> >> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> >> >> >> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
>> >> >> >> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
>> >> >> >> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
>> >> >> >> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
>> >> >> >> >> >
>> >> >> >> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
>> >> >> >> >> > (0xd827f) are different. This causes our router doesn't correctly close tcp
>> >> >> >> >> > connection. The patches try to fix the issue.
>> >> >> >> >> >
>> >> >> >> >> Shaohua,
>> >> >> >> >>
>> >> >> >> >> Can you give some more detail about what the router doesn't close the
>> >> >> >> >> TCP connection means? I'm guessing the problem is either: 1) the
>> >> >> >> >> router is maintaining connection state that includes the flow label in
>> >> >> >> >> a connection tuple. 2) some router in the path is maintaining
>> >> >> >> >> connection state, but when the flow label changes the flow's packet
>> >> >> >> >> are routed through a different router that doesn't have a state for
>> >> >> >> >> the flow it drops the packet. #1 should be easily fix in the router,
>> >> >> >> >> flow labels cannot be used as state. #2 is the known problem that
>> >> >> >> >> stateful firewalls have killed our ability to use multihoming.
>> >> >> >> >
>> >> >> >> > The #2 is exactly the problem we saw.
>> >> >> >> >
>> >> >> >> >> Another consideration is that sk_txhash is also used in routing
>> >> >> >> >> decisions by the local host (flow label is normally derived from
>> >> >> >> >> txhash). If you want to ensure that connections are routed
>> >> >> >> >> consistently for timewait state you might need sk_txhash saved also.
>> >> >> >> >
>> >> >> >> > As far as I understood, we don't use sk_txhash for routing selection. The code
>> >> >> >> > does routing selection with flowlabel user configured, at that time we don't
>> >> >> >> > derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
>> >> >> >> > does routing selection first and then uses ip6_make_flowlabel to build packet
>> >> >> >> > data where we derive flowlabel from skb->hash.
>> >> >> >> >
>> >> >> >> That is assuming one particular use case. Generally, if you want to
>> >> >> >> ensure all packets for a flow take the same path you'll need tx_hash
>> >> >> >> and make it persistent (disable flow bender). For instance, if you
>> >> >> >> were doing UDP encapsulation like in VXLAN the UDP source port
>> >> >> >> selection is unaffected by saved flow label for the lifetime of the
>> >> >> >> flow. So we would still hit #2 in that case and the stateful device
>> >> >> >> doesn't see whole flow. It might be just as easy to move tx_hash in
>> >> >> >> skc_common so that it's available in TW state for this purpose. Then
>> >> >> >> when moving to TW state just copy the tx_hash.
>> >> >> >
>> >> >> > Hi Tom,
>> >> >> >
>> >> >> > My original implementation is to add a tx_hash in tw sock, we then copy sock's
>> >> >> > tx_hash to the tw tx_hash. This does makes things simplier. One concern from
>> >> >> > Eric is this will increase the size of tw sock. If we move tx_hash to
>> >> >> > skc_common, all sock size will increase, is this acceptable?
>> >> >>
>> >> >> I think that can only be measured by how critical it is to
>> >> >> persistently route all packets the same exact way for every
>> >> >> connection. Page one of the IP book clearly states that IP packets can
>> >> >> be dropped, duplicated, or received out of order. Received OOO implies
>> >> >> that packet for the same flow are allowed to take different paths. The
>> >> >> requirement that packets for the same flow must always take the same
>> >> >> path through the network was created by stateful middleboxes-- it's
>> >> >> not inherent in the architecture of IP networking. Unfortunately,
>> >> >> we're seeing this become more and more of a problem as more devices
>> >> >> are multi-homed (like smart phones) and these network requirement
>> >> >> cripple our ability to take advantage of features like that.
>> >> >>
>> >> >> Personally, I wish the middleboxes fix the problem they created, but I
>> >> >> suppose we need to be pragmatic at least in the short term.
>> >> >
>> >> > Hmm, I still hesitate to add a new field in skc_common. Fixing current problem
>> >> > looks propriate in current stage. I'd defer fixing the generic issue till it's
>> >> > necessary.
>> >> >
>> >> Shaohua,
>> >>
>> >> An alternative would be to not initialize sk_txhash, but instead defer
>> >> hash computation to use flow dissector in the TX path when the hash is
>> >> needed (to get flow label, src port for UDP encap, route for
>> >> multipath, etc.). At the first hash computation in TX path  the result
>> >> in sk_txhash. TW state there is no socket so flow dissector is
>> >> always used but that should yield the same hash. No extra fields would
>> >> be needed and additional cost is negligible.
>> >
>> > Hi Tom,
>> >
>> > Did you mean revert 877d1f6291f8(net: Set sk_txhash from a random number)? This
>> > could fix the issue. So in normal case we calculate the sk_txhash using flow
>> > dissector but in negative routing case we use the random hash, is this what you
>> > want?
>> >
>> No. What you'd want is something like a sysctl that sets an alternate
>> mode for sk_txhash processing. sk_txhash is derived from flow
>> dissector for the first TX packet and then it's never allowed to
>> change. Maybe this should be called persistent-hash mode.
>
> persistent-hash will almostly mean disabling auto flowlabel. We'd prefer
> disabling auto flowlabel rather than the persistent-hash. But I think we do
> want to use the auto flowlabel for load balancing, so the persistent-hash isn't
> a go for us.
>
Those are two separate features here. Auto flow label means
automatically generating flow labels for packets based on some hash. A
persistent hash means that once a hash is determined for a connection
it never changes, so the flow label derived from the hash never
changes.

> Let's take the other way. Your main concern is the multipath selection. I think
> this can be fixed. We can still use the tw->tw_flowlabel to store the auto
> created flowlabel like what my patch does, but we use an extra bit in tw_pad of
> inet_timewait_sock to indicate this flowlabel is auto created. In the reset
> packet send path, we don't use the auto created flowlabel for route slection,
> but just for packet data. This will keep current multipath selection behavior
> and fix the reset packet flowlabel issue.
>
>> > There seems to have other bug in this side. From my understanding, commit
>> > 265f94ff54d6(net: Recompute sk_txhash on negative routing advice) tries to
>> > select a different route. But the multipath selection code
>> > (rt6_multipath_select) doesn't use sk_txhash or skb->hash, it does use
>> > fl6.flowlabel, but that is the flowlabel user sets. So looks like the commit
>> > doesn't change anything.
>> >
>> The routing functions typically don't use sock of skbuff, but use flow
>> structs instead. It may be reasonable to add a hash to those.
>
> I'll look at this issue later. It's not related to current flowlabel problem.
>
As I mentioned already, this is not a problem specific to flowlabels.
This is a problem about getting all the elements of the system work
together to ensure that all packets of flow are routed they same way
in order to meet a requirement imposed by middleboxes. You can solve
this problem just to make flow labels work, but IMO that's just
kicking the can down the road for someone else to have to debug and
solve when they hit the same underlying issue but in a different
context.

Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-16  0:15                 ` Tom Herbert
  2017-08-17 17:26                   ` Shaohua Li
@ 2017-08-17 22:55                   ` Martin KaFai Lau
  2017-08-18 14:50                     ` Tom Herbert
  1 sibling, 1 reply; 32+ messages in thread
From: Martin KaFai Lau @ 2017-08-17 22:55 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Shaohua Li, Linux Kernel Network Developers, David S. Miller

On Tue, Aug 15, 2017 at 05:15:46PM -0700, Tom Herbert wrote:
> On Tue, Aug 15, 2017 at 3:42 PM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Aug 15, 2017 at 07:08:31AM -0700, Tom Herbert wrote:
> >> On Mon, Aug 14, 2017 at 7:52 PM, Shaohua Li <shli@kernel.org> wrote:
> >> > On Fri, Aug 11, 2017 at 06:00:20PM -0700, Tom Herbert wrote:
> >> >> On Thu, Aug 10, 2017 at 12:13 PM, Shaohua Li <shli@kernel.org> wrote:
> >> >> > On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
> >> >> >> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li <shli@kernel.org> wrote:
> >> >> >> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
> >> >> >> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li <shli@kernel.org> wrote:
> >> >> >> >> > From: Shaohua Li <shli@fb.com>
> >> >> >> >> >
> >> >> >> >> > Please see below tcpdump output:
> >> >> >> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
> >> >> >> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
> >> >> >> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> >> >> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30
> >> >> >> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0
> >> >> >> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24
> >> >> >> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> >> >> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0
> >> >> >> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0
> >> >> >> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.5555: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24
> >> >> >> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456.5555 > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0
> >> >> >> >> >
> >> >> >> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
> >> >> >> >> > (0xd827f) are different. This causes our router doesn't correctly close tcp
> >> >> >> >> > connection. The patches try to fix the issue.
> >> >> >> >> >
> >> >> >> >> Shaohua,
> >> >> >> >>
> >> >> >> >> Can you give some more detail about what the router doesn't close the
> >> >> >> >> TCP connection means? I'm guessing the problem is either: 1) the
> >> >> >> >> router is maintaining connection state that includes the flow label in
> >> >> >> >> a connection tuple. 2) some router in the path is maintaining
> >> >> >> >> connection state, but when the flow label changes the flow's packet
> >> >> >> >> are routed through a different router that doesn't have a state for
> >> >> >> >> the flow it drops the packet. #1 should be easily fix in the router,
> >> >> >> >> flow labels cannot be used as state. #2 is the known problem that
> >> >> >> >> stateful firewalls have killed our ability to use multihoming.
> >> >> >> >
> >> >> >> > The #2 is exactly the problem we saw.
> >> >> >> >
> >> >> >> >> Another consideration is that sk_txhash is also used in routing
> >> >> >> >> decisions by the local host (flow label is normally derived from
> >> >> >> >> txhash). If you want to ensure that connections are routed
> >> >> >> >> consistently for timewait state you might need sk_txhash saved also.
> >> >> >> >
> >> >> >> > As far as I understood, we don't use sk_txhash for routing selection. The code
> >> >> >> > does routing selection with flowlabel user configured, at that time we don't
> >> >> >> > derive fl6.flowlabel from skb->hash (which is from sk_txhash). The code always
> >> >> >> > does routing selection first and then uses ip6_make_flowlabel to build packet
> >> >> >> > data where we derive flowlabel from skb->hash.
> >> >> >> >
> >> >> >> That is assuming one particular use case. Generally, if you want to
> >> >> >> ensure all packets for a flow take the same path you'll need tx_hash
> >> >> >> and make it persistent (disable flow bender). For instance, if you
> >> >> >> were doing UDP encapsulation like in VXLAN the UDP source port
> >> >> >> selection is unaffected by saved flow label for the lifetime of the
> >> >> >> flow. So we would still hit #2 in that case and the stateful device
> >> >> >> doesn't see whole flow. It might be just as easy to move tx_hash in
> >> >> >> skc_common so that it's available in TW state for this purpose. Then
> >> >> >> when moving to TW state just copy the tx_hash.
> >> >> >
> >> >> > Hi Tom,
> >> >> >
> >> >> > My original implementation is to add a tx_hash in tw sock, we then copy sock's
> >> >> > tx_hash to the tw tx_hash. This does makes things simplier. One concern from
> >> >> > Eric is this will increase the size of tw sock. If we move tx_hash to
> >> >> > skc_common, all sock size will increase, is this acceptable?
> >> >>
> >> >> I think that can only be measured by how critical it is to
> >> >> persistently route all packets the same exact way for every
> >> >> connection. Page one of the IP book clearly states that IP packets can
> >> >> be dropped, duplicated, or received out of order. Received OOO implies
> >> >> that packet for the same flow are allowed to take different paths. The
> >> >> requirement that packets for the same flow must always take the same
> >> >> path through the network was created by stateful middleboxes-- it's
> >> >> not inherent in the architecture of IP networking. Unfortunately,
> >> >> we're seeing this become more and more of a problem as more devices
> >> >> are multi-homed (like smart phones) and these network requirement
> >> >> cripple our ability to take advantage of features like that.
> >> >>
> >> >> Personally, I wish the middleboxes fix the problem they created, but I
> >> >> suppose we need to be pragmatic at least in the short term.
> >> >
> >> > Hmm, I still hesitate to add a new field in skc_common. Fixing current problem
> >> > looks propriate in current stage. I'd defer fixing the generic issue till it's
> >> > necessary.
> >> >
> >> Shaohua,
> >>
> >> An alternative would be to not initialize sk_txhash, but instead defer
> >> hash computation to use flow dissector in the TX path when the hash is
> >> needed (to get flow label, src port for UDP encap, route for
> >> multipath, etc.). At the first hash computation in TX path  the result
> >> in sk_txhash. TW state there is no socket so flow dissector is
> >> always used but that should yield the same hash. No extra fields would
> >> be needed and additional cost is negligible.
> >
> > Hi Tom,
> >
> > Did you mean revert 877d1f6291f8(net: Set sk_txhash from a random number)? This
> > could fix the issue. So in normal case we calculate the sk_txhash using flow
> > dissector but in negative routing case we use the random hash, is this what you
> > want?
> >
> No. What you'd want is something like a sysctl that sets an alternate
> mode for sk_txhash processing. sk_txhash is derived from flow
> dissector for the first TX packet and then it's never allowed to
> change. Maybe this should be called persistent-hash mode.
Hi Tom,

We had been using the auto_flowlabels=1 (i.e. essentially enable flowlabel)
mainly because we want to take the benefit of dst_negative_advice() when
tcp_write_timeout() happens.

During our test, our system handles quite well with changing flowlabel.
The only exception we have hit is the TCP_RST sent from an inet_timewait_sock.

If we keep the flowlabel consistent (or persistent sk_txhash), there
is no practical usage for us to turn on flowlabel and the problem also goes
away.  We have it off for now.

>
> > There seems to have other bug in this side. From my understanding, commit
> > 265f94ff54d6(net: Recompute sk_txhash on negative routing advice) tries to
> > select a different route. But the multipath selection code
> > (rt6_multipath_select) doesn't use sk_txhash or skb->hash, it does use
> > fl6.flowlabel, but that is the flowlabel user sets. So looks like the commit
> > doesn't change anything.
> >
> The routing functions typically don't use sock of skbuff, but use flow
> structs instead. It may be reasonable to add a hash to those.
The localhost's mutlipath selection is another existing issue.  AFAICT,
it does not take the sk_txhash (or skb->hash) into account and the following
dst_negative_advice() will also have no effect in the route selction.
It is another issue to be fixed and to be figured out how to pass the
sk_txhash down. (1)

Shaohua is proposing to record the 20 bits of the sk_txhash in the
tw_flowlabel of the 'struct inet_timewait_sock'.  The tw_flowlabel could
potentially be used to do the multipath selection once we figured out
how to tackle (1).

Thanks,
Martin


>
> > What's the 'src port for UDP encap'? I can't find the code setting skb->hash
> > to sk_txhash in UDP side.
> >
> udp_flow_src_port is function call by UDP encaps to set source port.
> This is call skb_get_hash. sk_set_txhash is function to set txhash
> right now to random value. skb_set_hash_from_sk set skb->hash when
> skbuff is owned by socket (skb_set_owner_w).
>
> Thanks,
> Tom
>
>
> > Thanks,
> > Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-17 22:55                   ` Martin KaFai Lau
@ 2017-08-18 14:50                     ` Tom Herbert
  2017-08-18 20:51                       ` Martin KaFai Lau
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2017-08-18 14:50 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Shaohua Li, Linux Kernel Network Developers, David S. Miller

> We had been using the auto_flowlabels=1 (i.e. essentially enable flowlabel)
> mainly because we want to take the benefit of dst_negative_advice() when
> tcp_write_timeout() happens.
>
> During our test, our system handles quite well with changing flowlabel.
> The only exception we have hit is the TCP_RST sent from an inet_timewait_sock.
>
Martin,

That is interesting data. Have you determined why the middlebox has a
problem with flow label change in TW state but not other states?

Tom

> If we keep the flowlabel consistent (or persistent sk_txhash), there
> is no practical usage for us to turn on flowlabel and the problem also goes
> away.  We have it off for now.
>
>>
>> > There seems to have other bug in this side. From my understanding, commit
>> > 265f94ff54d6(net: Recompute sk_txhash on negative routing advice) tries to
>> > select a different route. But the multipath selection code
>> > (rt6_multipath_select) doesn't use sk_txhash or skb->hash, it does use
>> > fl6.flowlabel, but that is the flowlabel user sets. So looks like the commit
>> > doesn't change anything.
>> >
>> The routing functions typically don't use sock of skbuff, but use flow
>> structs instead. It may be reasonable to add a hash to those.
> The localhost's mutlipath selection is another existing issue.  AFAICT,
> it does not take the sk_txhash (or skb->hash) into account and the following
> dst_negative_advice() will also have no effect in the route selction.
> It is another issue to be fixed and to be figured out how to pass the
> sk_txhash down. (1)
>
> Shaohua is proposing to record the 20 bits of the sk_txhash in the
> tw_flowlabel of the 'struct inet_timewait_sock'.  The tw_flowlabel could
> potentially be used to do the multipath selection once we figured out
> how to tackle (1).
>
> Thanks,
> Martin
>
>
>>
>> > What's the 'src port for UDP encap'? I can't find the code setting skb->hash
>> > to sk_txhash in UDP side.
>> >
>> udp_flow_src_port is function call by UDP encaps to set source port.
>> This is call skb_get_hash. sk_set_txhash is function to set txhash
>> right now to random value. skb_set_hash_from_sk set skb->hash when
>> skbuff is owned by socket (skb_set_owner_w).
>>
>> Thanks,
>> Tom
>>
>>
>> > Thanks,
>> > Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-18 14:50                     ` Tom Herbert
@ 2017-08-18 20:51                       ` Martin KaFai Lau
  2017-08-18 22:27                         ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Martin KaFai Lau @ 2017-08-18 20:51 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Shaohua Li, Linux Kernel Network Developers, David S. Miller

On Fri, Aug 18, 2017 at 07:50:03AM -0700, Tom Herbert wrote:
> > We had been using the auto_flowlabels=1 (i.e. essentially enable flowlabel)
> > mainly because we want to take the benefit of dst_negative_advice() when
> > tcp_write_timeout() happens.
> >
> > During our test, our system handles quite well with changing flowlabel.
> > The only exception we have hit is the TCP_RST sent from an inet_timewait_sock.
> >
> Martin,
>
> That is interesting data. Have you determined why the middlebox has a
> problem with flow label change in TW state but not other states?
Tom,

The problem of this middle box is specific to TCP_RST with a different
flowlabel from its previous packets.  Sending TCP_RST from TW state hits
this pain point.

It seems like that middle box specifically drops TCP_RST if it
does not know anything about this flow.  Since the flowlabel of the TCP_RST
(sent in TW state) is always different, it always lands to a different middle
box.  All of these TCP_RST cannot be delivered.

We are resilience to a small number of TCP_RST drop.  However, this guarantee
flowlabel change on TCP_RST and then dropped is too much.

This flowlabel change does not look like intentional either when transitioning
from full sk to tw sk (tw->tw_flowlabel is inheriting the np->flow_label
in tcp_time_wait()).  Currently, the tw_flowlabel is used in tcp_v6_timewait_ack()
but not in tcp_v6_send_reset().  Hence,  shaohua is looking for a solution to solve
them together.

Thanks,
Martin

>
> Tom
>
> > If we keep the flowlabel consistent (or persistent sk_txhash), there
> > is no practical usage for us to turn on flowlabel and the problem also goes
> > away.  We have it off for now.
> >
> >>
> >> > There seems to have other bug in this side. From my understanding, commit
> >> > 265f94ff54d6(net: Recompute sk_txhash on negative routing advice) tries to
> >> > select a different route. But the multipath selection code
> >> > (rt6_multipath_select) doesn't use sk_txhash or skb->hash, it does use
> >> > fl6.flowlabel, but that is the flowlabel user sets. So looks like the commit
> >> > doesn't change anything.
> >> >
> >> The routing functions typically don't use sock of skbuff, but use flow
> >> structs instead. It may be reasonable to add a hash to those.
> > The localhost's mutlipath selection is another existing issue.  AFAICT,
> > it does not take the sk_txhash (or skb->hash) into account and the following
> > dst_negative_advice() will also have no effect in the route selction.
> > It is another issue to be fixed and to be figured out how to pass the
> > sk_txhash down. (1)
> >
> > Shaohua is proposing to record the 20 bits of the sk_txhash in the
> > tw_flowlabel of the 'struct inet_timewait_sock'.  The tw_flowlabel could
> > potentially be used to do the multipath selection once we figured out
> > how to tackle (1).
> >
> > Thanks,
> > Martin
> >
> >
> >>
> >> > What's the 'src port for UDP encap'? I can't find the code setting skb->hash
> >> > to sk_txhash in UDP side.
> >> >
> >> udp_flow_src_port is function call by UDP encaps to set source port.
> >> This is call skb_get_hash. sk_set_txhash is function to set txhash
> >> right now to random value. skb_set_hash_from_sk set skb->hash when
> >> skbuff is owned by socket (skb_set_owner_w).
> >>
> >> Thanks,
> >> Tom
> >>
> >>
> >> > Thanks,
> >> > Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-18 20:51                       ` Martin KaFai Lau
@ 2017-08-18 22:27                         ` David Miller
  2017-11-08 17:44                           ` Tom Herbert
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2017-08-18 22:27 UTC (permalink / raw)
  To: kafai; +Cc: tom, shli, netdev

From: Martin KaFai Lau <kafai@fb.com>
Date: Fri, 18 Aug 2017 13:51:36 -0700

> It seems like that middle box specifically drops TCP_RST if it
> does not know anything about this flow.  Since the flowlabel of the TCP_RST
> (sent in TW state) is always different, it always lands to a different middle
> box.  All of these TCP_RST cannot be delivered.

This really is illegal behavior.  The flow label is not a flow _KEY_
by any definition whatsoever.

Flow labels are an optimization, not a determinant for flow matching
particularly for proper TCP state processing.

I'd rather you invest all of this energy getting that vendor to fix
their kit.

Thank you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-08-18 22:27                         ` David Miller
@ 2017-11-08 17:44                           ` Tom Herbert
  2017-11-08 20:01                             ` Tom Herbert
  2017-11-14 18:24                             ` Shaohua Li
  0 siblings, 2 replies; 32+ messages in thread
From: Tom Herbert @ 2017-11-08 17:44 UTC (permalink / raw)
  To: David Miller; +Cc: Martin Lau, Shaohua Li, Linux Kernel Network Developers

On Fri, Aug 18, 2017 at 3:27 PM, David Miller <davem@davemloft.net> wrote:
> From: Martin KaFai Lau <kafai@fb.com>
> Date: Fri, 18 Aug 2017 13:51:36 -0700
>
>> It seems like that middle box specifically drops TCP_RST if it
>> does not know anything about this flow.  Since the flowlabel of the TCP_RST
>> (sent in TW state) is always different, it always lands to a different middle
>> box.  All of these TCP_RST cannot be delivered.
>
> This really is illegal behavior.  The flow label is not a flow _KEY_
> by any definition whatsoever.
>
> Flow labels are an optimization, not a determinant for flow matching
> particularly for proper TCP state processing.
>
> I'd rather you invest all of this energy getting that vendor to fix
> their kit.
>
We're now seeing several router vendors recommending people to not use
flow labels for ECMP hashing. This is precisely because when a flow
label changes, network devices that maintain state (firewalls, NAT,
load balancers) can't deal with packets being rerouted so connections
are dropped. Unfortunately, the need for packets of a flow to always
follow the same path has become an implicit requirement that I think
we need follow at least as the default behavior.

Martin: is there any change you could resurrect these patches? In
order to solve the general problem of making routing consistent, I
believe we want to keep sk_tx_hash consistent for the connection from
which a consistent flow label can be derived. To avoid the overhead of
a hash field in sk_common, maybe we could initially set a connection
hash to a five-tuple hash for a flow instead of a random value? So in
TW state the consistent hash can be computed on the fly.

Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-11-08 17:44                           ` Tom Herbert
@ 2017-11-08 20:01                             ` Tom Herbert
  2017-11-08 21:41                               ` Martin KaFai Lau
  2017-11-14 18:24                             ` Shaohua Li
  1 sibling, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2017-11-08 20:01 UTC (permalink / raw)
  To: David Miller; +Cc: Martin Lau, Shaohua Li, Linux Kernel Network Developers

On Wed, Nov 8, 2017 at 9:44 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Fri, Aug 18, 2017 at 3:27 PM, David Miller <davem@davemloft.net> wrote:
>> From: Martin KaFai Lau <kafai@fb.com>
>> Date: Fri, 18 Aug 2017 13:51:36 -0700
>>
>>> It seems like that middle box specifically drops TCP_RST if it
>>> does not know anything about this flow.  Since the flowlabel of the TCP_RST
>>> (sent in TW state) is always different, it always lands to a different middle
>>> box.  All of these TCP_RST cannot be delivered.
>>
>> This really is illegal behavior.  The flow label is not a flow _KEY_
>> by any definition whatsoever.
>>
>> Flow labels are an optimization, not a determinant for flow matching
>> particularly for proper TCP state processing.
>>
>> I'd rather you invest all of this energy getting that vendor to fix
>> their kit.
>>
> We're now seeing several router vendors recommending people to not use
> flow labels for ECMP hashing. This is precisely because when a flow
> label changes, network devices that maintain state (firewalls, NAT,
> load balancers) can't deal with packets being rerouted so connections
> are dropped. Unfortunately, the need for packets of a flow to always
> follow the same path has become an implicit requirement that I think
> we need follow at least as the default behavior.
>
> Martin: is there any change you could resurrect these patches? In
> order to solve the general problem of making routing consistent, I
> believe we want to keep sk_tx_hash consistent for the connection from
> which a consistent flow label can be derived. To avoid the overhead of
> a hash field in sk_common, maybe we could initially set a connection
> hash to a five-tuple hash for a flow instead of a random value? So in
> TW state the consistent hash can be computed on the fly.
>
Sorry, I failed to give credit to Shaohua for submitting the initial
patch. Please take look!

> Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-11-08 20:01                             ` Tom Herbert
@ 2017-11-08 21:41                               ` Martin KaFai Lau
  0 siblings, 0 replies; 32+ messages in thread
From: Martin KaFai Lau @ 2017-11-08 21:41 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Shaohua Li, Linux Kernel Network Developers

On Wed, Nov 08, 2017 at 12:01:53PM -0800, Tom Herbert wrote:
> On Wed, Nov 8, 2017 at 9:44 AM, Tom Herbert <tom@herbertland.com> wrote:
> >>
> > We're now seeing several router vendors recommending people to not use
> > flow labels for ECMP hashing. This is precisely because when a flow
> > label changes, network devices that maintain state (firewalls, NAT,
> > load balancers) can't deal with packets being rerouted so connections
> > are dropped. Unfortunately, the need for packets of a flow to always
> > follow the same path has become an implicit requirement that I think
> > we need follow at least as the default behavior.
> >
> > Martin: is there any change you could resurrect these patches? In
> > order to solve the general problem of making routing consistent, I
> > believe we want to keep sk_tx_hash consistent for the connection from
> > which a consistent flow label can be derived. To avoid the overhead of
> > a hash field in sk_common, maybe we could initially set a connection
> > hash to a five-tuple hash for a flow instead of a random value? So in
> > TW state the consistent hash can be computed on the fly.
> >
> Sorry, I failed to give credit to Shaohua for submitting the initial
> patch. Please take look!
Hi Tom, thanks for the info.  Shaohua will revisit this
when he returns next week.

Martin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-11-08 17:44                           ` Tom Herbert
  2017-11-08 20:01                             ` Tom Herbert
@ 2017-11-14 18:24                             ` Shaohua Li
  2017-11-14 19:13                               ` Tom Herbert
  1 sibling, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-11-14 18:24 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Martin Lau, Linux Kernel Network Developers

On Wed, Nov 08, 2017 at 09:44:51AM -0800, Tom Herbert wrote:
> On Fri, Aug 18, 2017 at 3:27 PM, David Miller <davem@davemloft.net> wrote:
> > From: Martin KaFai Lau <kafai@fb.com>
> > Date: Fri, 18 Aug 2017 13:51:36 -0700
> >
> >> It seems like that middle box specifically drops TCP_RST if it
> >> does not know anything about this flow.  Since the flowlabel of the TCP_RST
> >> (sent in TW state) is always different, it always lands to a different middle
> >> box.  All of these TCP_RST cannot be delivered.
> >
> > This really is illegal behavior.  The flow label is not a flow _KEY_
> > by any definition whatsoever.
> >
> > Flow labels are an optimization, not a determinant for flow matching
> > particularly for proper TCP state processing.
> >
> > I'd rather you invest all of this energy getting that vendor to fix
> > their kit.
> >
> We're now seeing several router vendors recommending people to not use
> flow labels for ECMP hashing. This is precisely because when a flow
> label changes, network devices that maintain state (firewalls, NAT,
> load balancers) can't deal with packets being rerouted so connections
> are dropped. Unfortunately, the need for packets of a flow to always
> follow the same path has become an implicit requirement that I think
> we need follow at least as the default behavior.
> 
> Martin: is there any change you could resurrect these patches? In
> order to solve the general problem of making routing consistent, I
> believe we want to keep sk_tx_hash consistent for the connection from
> which a consistent flow label can be derived. To avoid the overhead of
> a hash field in sk_common, maybe we could initially set a connection
> hash to a five-tuple hash for a flow instead of a random value? So in
> TW state the consistent hash can be computed on the fly.

Hi Tom,
Do we really need to use the five-tupe hash? There are several places using
current random hash, which looks more lightweight. To fix issue, we only need
to make sure reset packet include the correct flowlabel. Like what my previous
patch did, we can set tw->tw_flowlabel in tcp_time_wait based on txhash and use
it reset packet. In this way we can use the random hash and not add extra field
in sock.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-11-14 18:24                             ` Shaohua Li
@ 2017-11-14 19:13                               ` Tom Herbert
  2017-11-14 21:59                                 ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2017-11-14 19:13 UTC (permalink / raw)
  To: Shaohua Li; +Cc: David Miller, Martin Lau, Linux Kernel Network Developers

On Tue, Nov 14, 2017 at 10:24 AM, Shaohua Li <shli@kernel.org> wrote:
> On Wed, Nov 08, 2017 at 09:44:51AM -0800, Tom Herbert wrote:
>> On Fri, Aug 18, 2017 at 3:27 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Martin KaFai Lau <kafai@fb.com>
>> > Date: Fri, 18 Aug 2017 13:51:36 -0700
>> >
>> >> It seems like that middle box specifically drops TCP_RST if it
>> >> does not know anything about this flow.  Since the flowlabel of the TCP_RST
>> >> (sent in TW state) is always different, it always lands to a different middle
>> >> box.  All of these TCP_RST cannot be delivered.
>> >
>> > This really is illegal behavior.  The flow label is not a flow _KEY_
>> > by any definition whatsoever.
>> >
>> > Flow labels are an optimization, not a determinant for flow matching
>> > particularly for proper TCP state processing.
>> >
>> > I'd rather you invest all of this energy getting that vendor to fix
>> > their kit.
>> >
>> We're now seeing several router vendors recommending people to not use
>> flow labels for ECMP hashing. This is precisely because when a flow
>> label changes, network devices that maintain state (firewalls, NAT,
>> load balancers) can't deal with packets being rerouted so connections
>> are dropped. Unfortunately, the need for packets of a flow to always
>> follow the same path has become an implicit requirement that I think
>> we need follow at least as the default behavior.
>>
>> Martin: is there any change you could resurrect these patches? In
>> order to solve the general problem of making routing consistent, I
>> believe we want to keep sk_tx_hash consistent for the connection from
>> which a consistent flow label can be derived. To avoid the overhead of
>> a hash field in sk_common, maybe we could initially set a connection
>> hash to a five-tuple hash for a flow instead of a random value? So in
>> TW state the consistent hash can be computed on the fly.
>
> Hi Tom,
> Do we really need to use the five-tupe hash? There are several places using
> current random hash, which looks more lightweight. To fix issue, we only need
> to make sure reset packet include the correct flowlabel. Like what my previous
> patch did, we can set tw->tw_flowlabel in tcp_time_wait based on txhash and use
> it reset packet. In this way we can use the random hash and not add extra field
> in sock.
>
Shaohua,

But that patch discards the full txhash in TW. So it's not just a
problem with the flow label. sk_tx_hash can also be used for route
selection in ECMP, port selection we're doing tunneling, etc. The
general solution should maintains tx_hash or be able to reconstruct it
in any state, flow label fix is a point solution.

Thanks,
Tom

> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-11-14 19:13                               ` Tom Herbert
@ 2017-11-14 21:59                                 ` Shaohua Li
  2017-11-15 18:27                                   ` Martin KaFai Lau
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2017-11-14 21:59 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Martin Lau, Linux Kernel Network Developers

On Tue, Nov 14, 2017 at 11:13:10AM -0800, Tom Herbert wrote:
> On Tue, Nov 14, 2017 at 10:24 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Wed, Nov 08, 2017 at 09:44:51AM -0800, Tom Herbert wrote:
> >> On Fri, Aug 18, 2017 at 3:27 PM, David Miller <davem@davemloft.net> wrote:
> >> > From: Martin KaFai Lau <kafai@fb.com>
> >> > Date: Fri, 18 Aug 2017 13:51:36 -0700
> >> >
> >> >> It seems like that middle box specifically drops TCP_RST if it
> >> >> does not know anything about this flow.  Since the flowlabel of the TCP_RST
> >> >> (sent in TW state) is always different, it always lands to a different middle
> >> >> box.  All of these TCP_RST cannot be delivered.
> >> >
> >> > This really is illegal behavior.  The flow label is not a flow _KEY_
> >> > by any definition whatsoever.
> >> >
> >> > Flow labels are an optimization, not a determinant for flow matching
> >> > particularly for proper TCP state processing.
> >> >
> >> > I'd rather you invest all of this energy getting that vendor to fix
> >> > their kit.
> >> >
> >> We're now seeing several router vendors recommending people to not use
> >> flow labels for ECMP hashing. This is precisely because when a flow
> >> label changes, network devices that maintain state (firewalls, NAT,
> >> load balancers) can't deal with packets being rerouted so connections
> >> are dropped. Unfortunately, the need for packets of a flow to always
> >> follow the same path has become an implicit requirement that I think
> >> we need follow at least as the default behavior.
> >>
> >> Martin: is there any change you could resurrect these patches? In
> >> order to solve the general problem of making routing consistent, I
> >> believe we want to keep sk_tx_hash consistent for the connection from
> >> which a consistent flow label can be derived. To avoid the overhead of
> >> a hash field in sk_common, maybe we could initially set a connection
> >> hash to a five-tuple hash for a flow instead of a random value? So in
> >> TW state the consistent hash can be computed on the fly.
> >
> > Hi Tom,
> > Do we really need to use the five-tupe hash? There are several places using
> > current random hash, which looks more lightweight. To fix issue, we only need
> > to make sure reset packet include the correct flowlabel. Like what my previous
> > patch did, we can set tw->tw_flowlabel in tcp_time_wait based on txhash and use
> > it reset packet. In this way we can use the random hash and not add extra field
> > in sock.
> >
> Shaohua,
> 
> But that patch discards the full txhash in TW. So it's not just a
> problem with the flow label. sk_tx_hash can also be used for route
> selection in ECMP, port selection we're doing tunneling, etc. The
> general solution should maintains tx_hash or be able to reconstruct it
> in any state, flow label fix is a point solution.

Hi Tom,

do you want to keep sk_rethink_txhash() then? If we changed the hash to random
number, we can't reconstruct it for sure.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
  2017-11-14 21:59                                 ` Shaohua Li
@ 2017-11-15 18:27                                   ` Martin KaFai Lau
  0 siblings, 0 replies; 32+ messages in thread
From: Martin KaFai Lau @ 2017-11-15 18:27 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Shaohua Li, David Miller, Linux Kernel Network Developers

On Tue, Nov 14, 2017 at 01:59:03PM -0800, Shaohua Li wrote:
> On Tue, Nov 14, 2017 at 11:13:10AM -0800, Tom Herbert wrote:
> > On Tue, Nov 14, 2017 at 10:24 AM, Shaohua Li <shli@kernel.org> wrote:
> > > On Wed, Nov 08, 2017 at 09:44:51AM -0800, Tom Herbert wrote:
> > >> On Fri, Aug 18, 2017 at 3:27 PM, David Miller <davem@davemloft.net> wrote:
> > >> > From: Martin KaFai Lau <kafai@fb.com>
> > >> > Date: Fri, 18 Aug 2017 13:51:36 -0700
> > >> >
> > >> >> It seems like that middle box specifically drops TCP_RST if it
> > >> >> does not know anything about this flow.  Since the flowlabel of the TCP_RST
> > >> >> (sent in TW state) is always different, it always lands to a different middle
> > >> >> box.  All of these TCP_RST cannot be delivered.
> > >> >
> > >> > This really is illegal behavior.  The flow label is not a flow _KEY_
> > >> > by any definition whatsoever.
> > >> >
> > >> > Flow labels are an optimization, not a determinant for flow matching
> > >> > particularly for proper TCP state processing.
> > >> >
> > >> > I'd rather you invest all of this energy getting that vendor to fix
> > >> > their kit.
> > >> >
> > >> We're now seeing several router vendors recommending people to not use
> > >> flow labels for ECMP hashing. This is precisely because when a flow
> > >> label changes, network devices that maintain state (firewalls, NAT,
> > >> load balancers) can't deal with packets being rerouted so connections
> > >> are dropped. Unfortunately, the need for packets of a flow to always
> > >> follow the same path has become an implicit requirement that I think
> > >> we need follow at least as the default behavior.
> > >>
> > >> Martin: is there any change you could resurrect these patches? In
> > >> order to solve the general problem of making routing consistent, I
> > >> believe we want to keep sk_tx_hash consistent for the connection from
> > >> which a consistent flow label can be derived. To avoid the overhead of
> > >> a hash field in sk_common, maybe we could initially set a connection
> > >> hash to a five-tuple hash for a flow instead of a random value? So in
> > >> TW state the consistent hash can be computed on the fly.
> > >
> > > Hi Tom,
> > > Do we really need to use the five-tupe hash? There are several places using
> > > current random hash, which looks more lightweight. To fix issue, we only need
> > > to make sure reset packet include the correct flowlabel. Like what my previous
> > > patch did, we can set tw->tw_flowlabel in tcp_time_wait based on txhash and use
> > > it reset packet. In this way we can use the random hash and not add extra field
> > > in sock.
> > >
> > Shaohua,
> > 
> > But that patch discards the full txhash in TW. So it's not just a
> > problem with the flow label. sk_tx_hash can also be used for route
> > selection in ECMP, port selection we're doing tunneling, etc. The
> > general solution should maintains tx_hash or be able to reconstruct it
> > in any state, flow label fix is a point solution.
> 
> Hi Tom,
> 
> do you want to keep sk_rethink_txhash() then? If we changed the hash to random
> number, we can't reconstruct it for sure.
A followup question on rethink.  Does it mean we need
a new sysctl (persistent_txhash) to avoid sk_rethink_txhash() together
such that it keeps the routing decision consistent (e.g. flowlabel) ?

> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-11-15 18:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-31 22:19 [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet Shaohua Li
2017-07-31 22:19 ` [PATCH V4 net 1/2] net: remove unnecessary rotation Shaohua Li
2017-07-31 22:19 ` [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6 Shaohua Li
2017-07-31 22:35   ` Cong Wang
2017-07-31 23:00     ` Shaohua Li
2017-08-01 21:17       ` Cong Wang
2017-08-01 21:42         ` Shaohua Li
2017-08-03  1:33           ` Cong Wang
2017-08-09 14:59 ` [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet Shaohua Li
2017-08-09 17:55   ` David Miller
2017-08-09 16:40 ` Tom Herbert
2017-08-10 16:30   ` Shaohua Li
2017-08-10 18:30     ` Tom Herbert
2017-08-10 19:13       ` Shaohua Li
2017-08-12  1:00         ` Tom Herbert
2017-08-15  2:52           ` Shaohua Li
2017-08-15 14:08             ` Tom Herbert
2017-08-15 22:42               ` Shaohua Li
2017-08-16  0:15                 ` Tom Herbert
2017-08-17 17:26                   ` Shaohua Li
2017-08-17 19:07                     ` Tom Herbert
2017-08-17 22:55                   ` Martin KaFai Lau
2017-08-18 14:50                     ` Tom Herbert
2017-08-18 20:51                       ` Martin KaFai Lau
2017-08-18 22:27                         ` David Miller
2017-11-08 17:44                           ` Tom Herbert
2017-11-08 20:01                             ` Tom Herbert
2017-11-08 21:41                               ` Martin KaFai Lau
2017-11-14 18:24                             ` Shaohua Li
2017-11-14 19:13                               ` Tom Herbert
2017-11-14 21:59                                 ` Shaohua Li
2017-11-15 18:27                                   ` Martin KaFai Lau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.