All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
@ 2018-04-19 21:23 Martin KaFai Lau
  2018-04-20 20:43 ` Julian Anastasov
  2018-05-02  6:38 ` Julian Anastasov
  0 siblings, 2 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-04-19 21:23 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team

This patch is not a proper fix and mainly serves for discussion purpose.
It is based on net-next which I have been using to debug the issue.

The change that works around the issue is in ensure_mtu_is_adequate().
Other changes are the rippling effect in function arg.

This bug was uncovered by one of our legacy service that
are still using ipvs for load balancing.  In that setup,
the ipvs encap the ipv6-tcp packet in another ipv6 hdr
before tx it out to eth0.

The problem is the kernel stack could pass a skb (which was
originated from a sys_write(tcp_fd)) to the driver with skb->len
bigger than the device MTU.  In one NIC setup (with gso and tso off)
that we are using, it upset the NIC/driver and caused the tx queue
stalled for tens of seconds which is how it got uncovered.
(On the NIC side, the NIC firmware and driver have been fixed
to avoid this tx queue stall after seeing this skb).

On the kernel side, based on the commit log, this bug should have
been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").

Before commit 815d22e55b0e, ipv6_gso_segment() would just error
out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
Due to this error out, it avoid passing it to the driver.  The TCP
stack then timeout and the TCP mtu probing eventually kicked in to
lower the skb->len enough to avoid gso_segment.

After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
-> tcp6_gso_segment() which segment the packet based on a mss
that does not account for the extra IPv6 hdr.

Here is a stack from the WARN_ON() that we added to the driver to
capture the issue:
[ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
...
[ 1129.016536] Call Trace:
[ 1129.021412]  ? skb_release_data+0xfc/0x120
[ 1129.029587]  ? kfree_skbmem+0x64/0x70
[ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
[ 1129.045262]  sch_direct_xmit+0x10f/0x280
[ 1129.053111]  __qdisc_run+0x223/0x5a0
[ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
[ 1129.068268]  dev_queue_xmit+0x10/0x20
[ 1129.075573]  ? dev_queue_xmit+0x10/0x20
[ 1129.083218]  ip6_finish_output2+0x2db/0x490
[ 1129.091573]  ip6_finish_output+0x125/0x190
[ 1129.099754]  ip6_output+0x5f/0x100
[ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
[ 1129.114212]  ip6_local_out+0x35/0x40
[ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
[ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
[ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
[ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
[ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
[ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
[ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
[ 1129.188849]  nf_hook_slow+0x3e/0xc0
[ 1129.195800]  ip6_xmit+0x30b/0x540
[ 1129.202421]  ? ac6_proc_exit+0x20/0x20
[ 1129.209909]  inet6_csk_xmit+0x82/0xd0
[ 1129.217207]  ? lock_timer_base+0x76/0xa0
[ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
[ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
[ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
[ 1129.250106]  tcp_push+0xde/0x100
[ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
[ 1129.264910]  tcp_sendmsg+0x2c/0x50
[ 1129.271703]  inet_sendmsg+0x31/0xb0
[ 1129.278672]  sock_write_iter+0xf8/0x110
[ 1129.286335]  new_sync_write+0xd9/0x120
[ 1129.293823]  vfs_write+0x18d/0x1e0
[ 1129.300614]  SyS_write+0x48/0xa0
[ 1129.307045]  do_syscall_64+0x69/0x1e0
[ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
...
[ 1129.648183] ---[ end trace 635061c9c300799e ]---
[ 1129.657407] skb->len:1554 MTU:1522

The tcp flow is connecting from the address ending ':27:0' to the ':85'.

[host-a] > ip -6 r show table local
local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium

[host-a] > ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
    inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
       valid_lft forever preferred_lft forever

[host-a] > cat /proc/net/ip_vs
TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
  -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
  -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
  -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7

[host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
send-something-long-here-to-trigger-the-bug

Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
can also side step the issue.  Like this:

> ip -6 r show table local
local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 11c416f3d6e3..88cc0d53ebce 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
 		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
 }
 
-static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
+static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
 					  int rt_mode,
 					  struct ip_vs_iphdr *ipvsh,
 					  struct sk_buff *skb, int mtu)
 {
+	struct netns_ipvs *ipvs = cp->ipvs;
+
 #ifdef CONFIG_IP_VS_IPV6
-	if (skb_af == AF_INET6) {
+	if (cp->af == AF_INET6) {
 		struct net *net = ipvs->net;
 
 		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
@@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 		}
 	}
 
+	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
+		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
+		unsigned short hdr_len = (th->doff << 2) +
+			skb_network_header_len(skb);
+
+		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
+			skb_decrease_gso_size(skb_shinfo(skb),
+					      skb_shinfo(skb)->gso_size -
+					      (mtu - hdr_len));
+	}
+
 	return true;
 }
 
@@ -305,13 +318,15 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
 
 /* Get route to destination or remote server */
 static int
-__ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
+__ip_vs_get_out_rt(struct ip_vs_conn *cp, struct sk_buff *skb,
 		   struct ip_vs_dest *dest,
 		   __be32 daddr, int rt_mode, __be32 *ret_saddr,
 		   struct ip_vs_iphdr *ipvsh)
 {
+	struct netns_ipvs *ipvs = cp->ipvs;
 	struct net *net = ipvs->net;
 	struct ip_vs_dest_dst *dest_dst;
+	int skb_af = cp->af;
 	struct rtable *rt;			/* Route to the other host */
 	int mtu;
 	int local, noref = 1;
@@ -389,7 +404,7 @@ __ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
 		maybe_update_pmtu(skb_af, skb, mtu);
 	}
 
-	if (!ensure_mtu_is_adequate(ipvs, skb_af, rt_mode, ipvsh, skb, mtu))
+	if (!ensure_mtu_is_adequate(cp, rt_mode, ipvsh, skb, mtu))
 		goto err_put;
 
 	skb_dst_drop(skb);
@@ -455,13 +470,15 @@ __ip_vs_route_output_v6(struct net *net, struct in6_addr *daddr,
  * Get route to destination or remote server
  */
 static int
-__ip_vs_get_out_rt_v6(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
+__ip_vs_get_out_rt_v6(struct ip_vs_conn *cp, struct sk_buff *skb,
 		      struct ip_vs_dest *dest,
 		      struct in6_addr *daddr, struct in6_addr *ret_saddr,
 		      struct ip_vs_iphdr *ipvsh, int do_xfrm, int rt_mode)
 {
+	struct netns_ipvs *ipvs = cp->ipvs;
 	struct net *net = ipvs->net;
 	struct ip_vs_dest_dst *dest_dst;
+	int skb_af = cp->af;
 	struct rt6_info *rt;			/* Route to the other host */
 	struct dst_entry *dst;
 	int mtu;
@@ -541,7 +558,7 @@ __ip_vs_get_out_rt_v6(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
 		maybe_update_pmtu(skb_af, skb, mtu);
 	}
 
-	if (!ensure_mtu_is_adequate(ipvs, skb_af, rt_mode, ipvsh, skb, mtu))
+	if (!ensure_mtu_is_adequate(cp, rt_mode, ipvsh, skb, mtu))
 		goto err_put;
 
 	skb_dst_drop(skb);
@@ -679,7 +696,7 @@ ip_vs_bypass_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	if (__ip_vs_get_out_rt(cp->ipvs, cp->af, skb, NULL, iph->daddr,
+	if (__ip_vs_get_out_rt(cp, skb, NULL, iph->daddr,
 			       IP_VS_RT_MODE_NON_LOCAL, NULL, ipvsh) < 0)
 		goto tx_error;
 
@@ -708,7 +725,7 @@ ip_vs_bypass_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	if (__ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, NULL,
+	if (__ip_vs_get_out_rt_v6(cp, skb, NULL,
 				  &iph->daddr, NULL,
 				  ipvsh, 0, IP_VS_RT_MODE_NON_LOCAL) < 0)
 		goto tx_error;
@@ -753,7 +770,7 @@ ip_vs_nat_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	}
 
 	was_input = rt_is_input_route(skb_rtable(skb));
-	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
 				   IP_VS_RT_MODE_RDR, NULL, ipvsh);
@@ -839,7 +856,7 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 		IP_VS_DBG(10, "filled cport=%d\n", ntohs(*p));
 	}
 
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6,
 				      NULL, ipvsh, 0,
 				      IP_VS_RT_MODE_LOCAL |
@@ -1031,7 +1048,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt(ipvs, cp->af, skb, cp->dest, cp->daddr.ip,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
 				   IP_VS_RT_MODE_CONNECT |
@@ -1129,7 +1146,7 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6,
 				      &saddr, ipvsh, 1,
 				      IP_VS_RT_MODE_LOCAL |
@@ -1218,7 +1235,7 @@ ip_vs_dr_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
 				   IP_VS_RT_MODE_KNOWN_NH, NULL, ipvsh);
@@ -1252,7 +1269,7 @@ ip_vs_dr_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6,
 				      NULL, ipvsh, 0,
 				      IP_VS_RT_MODE_LOCAL |
@@ -1317,7 +1334,7 @@ ip_vs_icmp_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	rt_mode = (hooknum != NF_INET_FORWARD) ?
 		  IP_VS_RT_MODE_LOCAL | IP_VS_RT_MODE_NON_LOCAL |
 		  IP_VS_RT_MODE_RDR : IP_VS_RT_MODE_NON_LOCAL;
-	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip, rt_mode,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip, rt_mode,
 				   NULL, iph);
 	if (local < 0)
 		goto tx_error;
@@ -1406,7 +1423,7 @@ ip_vs_icmp_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 	rt_mode = (hooknum != NF_INET_FORWARD) ?
 		  IP_VS_RT_MODE_LOCAL | IP_VS_RT_MODE_NON_LOCAL |
 		  IP_VS_RT_MODE_RDR : IP_VS_RT_MODE_NON_LOCAL;
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6, NULL, ipvsh, 0, rt_mode);
 	if (local < 0)
 		goto tx_error;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-04-19 21:23 [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP Martin KaFai Lau
@ 2018-04-20 20:43 ` Julian Anastasov
  2018-04-20 23:13   ` Martin KaFai Lau
  2018-05-02  6:38 ` Julian Anastasov
  1 sibling, 1 reply; 17+ messages in thread
From: Julian Anastasov @ 2018-04-20 20:43 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team


	Hello,

On Thu, 19 Apr 2018, Martin KaFai Lau wrote:

> This patch is not a proper fix and mainly serves for discussion purpose.
> It is based on net-next which I have been using to debug the issue.
> 
> The change that works around the issue is in ensure_mtu_is_adequate().
> Other changes are the rippling effect in function arg.
> 
> This bug was uncovered by one of our legacy service that
> are still using ipvs for load balancing.  In that setup,
> the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> before tx it out to eth0.
> 
> The problem is the kernel stack could pass a skb (which was
> originated from a sys_write(tcp_fd)) to the driver with skb->len
> bigger than the device MTU.  In one NIC setup (with gso and tso off)
> that we are using, it upset the NIC/driver and caused the tx queue
> stalled for tens of seconds which is how it got uncovered.
> (On the NIC side, the NIC firmware and driver have been fixed
> to avoid this tx queue stall after seeing this skb).
> 
> On the kernel side, based on the commit log, this bug should have
> been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").

	Before I go to deeply analyze the GSO code, here
are some preliminary observations:

- IPVS started to use iptunnel_handle_offloads() around 2014,
commit ea1d5d7755a3

- later (2016) the "ip6ip6: Support for GSO/GRO" commits started
to use skb_set_inner_ipproto(). But it is missing in the IPVS code.
I'm not sure if such call can help. At least, I don't see what
different happens in IPVS compared to ip6ip6_tnl_xmit(), for example.

> Before commit 815d22e55b0e, ipv6_gso_segment() would just error
> out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
> Due to this error out, it avoid passing it to the driver.  The TCP
> stack then timeout and the TCP mtu probing eventually kicked in to
> lower the skb->len enough to avoid gso_segment.
> 
> After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
> -> tcp6_gso_segment() which segment the packet based on a mss
> that does not account for the extra IPv6 hdr.
> 
> Here is a stack from the WARN_ON() that we added to the driver to
> capture the issue:
> [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
> ...
> [ 1129.016536] Call Trace:
> [ 1129.021412]  ? skb_release_data+0xfc/0x120
> [ 1129.029587]  ? kfree_skbmem+0x64/0x70
> [ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
> [ 1129.045262]  sch_direct_xmit+0x10f/0x280
> [ 1129.053111]  __qdisc_run+0x223/0x5a0
> [ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
> [ 1129.068268]  dev_queue_xmit+0x10/0x20
> [ 1129.075573]  ? dev_queue_xmit+0x10/0x20
> [ 1129.083218]  ip6_finish_output2+0x2db/0x490
> [ 1129.091573]  ip6_finish_output+0x125/0x190
> [ 1129.099754]  ip6_output+0x5f/0x100
> [ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
> [ 1129.114212]  ip6_local_out+0x35/0x40
> [ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
> [ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
> [ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
> [ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
> [ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
> [ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
> [ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
> [ 1129.188849]  nf_hook_slow+0x3e/0xc0
> [ 1129.195800]  ip6_xmit+0x30b/0x540
> [ 1129.202421]  ? ac6_proc_exit+0x20/0x20
> [ 1129.209909]  inet6_csk_xmit+0x82/0xd0
> [ 1129.217207]  ? lock_timer_base+0x76/0xa0
> [ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
> [ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
> [ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
> [ 1129.250106]  tcp_push+0xde/0x100
> [ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
> [ 1129.264910]  tcp_sendmsg+0x2c/0x50
> [ 1129.271703]  inet_sendmsg+0x31/0xb0
> [ 1129.278672]  sock_write_iter+0xf8/0x110
> [ 1129.286335]  new_sync_write+0xd9/0x120
> [ 1129.293823]  vfs_write+0x18d/0x1e0
> [ 1129.300614]  SyS_write+0x48/0xa0
> [ 1129.307045]  do_syscall_64+0x69/0x1e0
> [ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> ...
> [ 1129.648183] ---[ end trace 635061c9c300799e ]---
> [ 1129.657407] skb->len:1554 MTU:1522
> 
> The tcp flow is connecting from the address ending ':27:0' to the ':85'.
> 
> [host-a] > ip -6 r show table local
> local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium
> 
> [host-a] > ip -6 a
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
>     inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
>        valid_lft forever preferred_lft forever
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
>     inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
>        valid_lft forever preferred_lft forever
> 
> [host-a] > cat /proc/net/ip_vs
> TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
>   -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
>   -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
>   -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7
> 
> [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
> send-something-long-here-to-trigger-the-bug
> 
> Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
> can also side step the issue.  Like this:

	Yes, reducing the MTU was a solution recommended from
long time ago in the IPVS HOWTO.

> 
> > ip -6 r show table local
> local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
>  1 file changed, 33 insertions(+), 16 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 11c416f3d6e3..88cc0d53ebce 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
>  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
>  }
>  
> -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
>  					  int rt_mode,
>  					  struct ip_vs_iphdr *ipvsh,
>  					  struct sk_buff *skb, int mtu)
>  {
> +	struct netns_ipvs *ipvs = cp->ipvs;
> +
>  #ifdef CONFIG_IP_VS_IPV6
> -	if (skb_af == AF_INET6) {
> +	if (cp->af == AF_INET6) {
>  		struct net *net = ipvs->net;
>  
>  		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
> @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  		}
>  	}
>  
> +	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
> +		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
> +		unsigned short hdr_len = (th->doff << 2) +
> +			skb_network_header_len(skb);
> +
> +		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
> +			skb_decrease_gso_size(skb_shinfo(skb),
> +					      skb_shinfo(skb)->gso_size -
> +					      (mtu - hdr_len));

	So, software segmentation happens and we want the
tunnel header to be accounted immediately and not after PMTU
probing period? Is this a problem only for IPVS tunnels?
Do we see such delays with other tunnels? May be this should
be solved for all protocols (not just TCP) and for all tunnels?
Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to
local sender, so we should really alter the gso_size for proper
segmentation?

Regards

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-04-20 20:43 ` Julian Anastasov
@ 2018-04-20 23:13   ` Martin KaFai Lau
  0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-04-20 23:13 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team

On Fri, Apr 20, 2018 at 11:43:59PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
> 
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> > 
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> > 
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing.  In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> > 
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU.  In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> > 
> > On the kernel side, based on the commit log, this bug should have
> > been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").
> 
> 	Before I go to deeply analyze the GSO code, here
> are some preliminary observations:
> 
> - IPVS started to use iptunnel_handle_offloads() around 2014,
> commit ea1d5d7755a3
> 
> - later (2016) the "ip6ip6: Support for GSO/GRO" commits started
> to use skb_set_inner_ipproto(). But it is missing in the IPVS code.
> I'm not sure if such call can help. At least, I don't see what
> different happens in IPVS compared to ip6ip6_tnl_xmit(), for example.
I don't see skb->inner_ipproto used in
ipv6_gso_segment() -> ipv6_gso_segment() -> tcp6_gso_segment()

One thing may worth to highlight is that this problem sort of exist
even before commit 815d22e55b0e, it just happened that the ipv6_gso_segment()
error out (-EPROTONOSUPPORT) which then stops sending it down to the driver.

> 
> > Before commit 815d22e55b0e, ipv6_gso_segment() would just error
> > out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
> > Due to this error out, it avoid passing it to the driver.  The TCP
> > stack then timeout and the TCP mtu probing eventually kicked in to
> > lower the skb->len enough to avoid gso_segment.
> > 
> > After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
> > -> tcp6_gso_segment() which segment the packet based on a mss
> > that does not account for the extra IPv6 hdr.
> > 
> > Here is a stack from the WARN_ON() that we added to the driver to
> > capture the issue:
> > [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
> > ...
> > [ 1129.016536] Call Trace:
> > [ 1129.021412]  ? skb_release_data+0xfc/0x120
> > [ 1129.029587]  ? kfree_skbmem+0x64/0x70
> > [ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
> > [ 1129.045262]  sch_direct_xmit+0x10f/0x280
> > [ 1129.053111]  __qdisc_run+0x223/0x5a0
> > [ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
> > [ 1129.068268]  dev_queue_xmit+0x10/0x20
> > [ 1129.075573]  ? dev_queue_xmit+0x10/0x20
> > [ 1129.083218]  ip6_finish_output2+0x2db/0x490
> > [ 1129.091573]  ip6_finish_output+0x125/0x190
> > [ 1129.099754]  ip6_output+0x5f/0x100
> > [ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
> > [ 1129.114212]  ip6_local_out+0x35/0x40
> > [ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
> > [ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
> > [ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
> > [ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
> > [ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
> > [ 1129.188849]  nf_hook_slow+0x3e/0xc0
> > [ 1129.195800]  ip6_xmit+0x30b/0x540
> > [ 1129.202421]  ? ac6_proc_exit+0x20/0x20
> > [ 1129.209909]  inet6_csk_xmit+0x82/0xd0
> > [ 1129.217207]  ? lock_timer_base+0x76/0xa0
> > [ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
> > [ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
> > [ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
> > [ 1129.250106]  tcp_push+0xde/0x100
> > [ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
> > [ 1129.264910]  tcp_sendmsg+0x2c/0x50
> > [ 1129.271703]  inet_sendmsg+0x31/0xb0
> > [ 1129.278672]  sock_write_iter+0xf8/0x110
> > [ 1129.286335]  new_sync_write+0xd9/0x120
> > [ 1129.293823]  vfs_write+0x18d/0x1e0
> > [ 1129.300614]  SyS_write+0x48/0xa0
> > [ 1129.307045]  do_syscall_64+0x69/0x1e0
> > [ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > ...
> > [ 1129.648183] ---[ end trace 635061c9c300799e ]---
> > [ 1129.657407] skb->len:1554 MTU:1522
> > 
> > The tcp flow is connecting from the address ending ':27:0' to the ':85'.
> > 
> > [host-a] > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium
> > 
> > [host-a] > ip -6 a
> > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
> >     inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
> >        valid_lft forever preferred_lft forever
> > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
> >     inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
> >        valid_lft forever preferred_lft forever
> > 
> > [host-a] > cat /proc/net/ip_vs
> > TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
> >   -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
> >   -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
> >   -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7
> > 
> > [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
> > send-something-long-here-to-trigger-the-bug
> > 
> > Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
> > can also side step the issue.  Like this:
> 
> 	Yes, reducing the MTU was a solution recommended from
> long time ago in the IPVS HOWTO.
> 
> > 
> > > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
> >  1 file changed, 33 insertions(+), 16 deletions(-)
> > 
> > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> > index 11c416f3d6e3..88cc0d53ebce 100644
> > --- a/net/netfilter/ipvs/ip_vs_xmit.c
> > +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> > @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
> >  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
> >  }
> >  
> > -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> > +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
> >  					  int rt_mode,
> >  					  struct ip_vs_iphdr *ipvsh,
> >  					  struct sk_buff *skb, int mtu)
> >  {
> > +	struct netns_ipvs *ipvs = cp->ipvs;
> > +
> >  #ifdef CONFIG_IP_VS_IPV6
> > -	if (skb_af == AF_INET6) {
> > +	if (cp->af == AF_INET6) {
> >  		struct net *net = ipvs->net;
> >  
> >  		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
> > @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> >  		}
> >  	}
> >  
> > +	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
> > +		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
> > +		unsigned short hdr_len = (th->doff << 2) +
> > +			skb_network_header_len(skb);
> > +
> > +		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
> > +			skb_decrease_gso_size(skb_shinfo(skb),
> > +					      skb_shinfo(skb)->gso_size -
> > +					      (mtu - hdr_len));
> 
> 	So, software segmentation happens and we want the
> tunnel header to be accounted immediately and not after PMTU
> probing period?
I think accounting tunnel header correctly early on is the expected
behavior.  Otherwise, if we choose to drop the big skb instead (either
gso_segment error out, driver error out or NIC drops it from tx queue),
the first few packets of a TCP connection will guarantee to
experience timeout.

> Is this a problem only for IPVS tunnels?
> Do we see such delays with other tunnels?
Based on what I understand, a tun dev (setup by iproute2) should
not experience this issue because the MTU of that tun dev has
already accounted for the extra header.

Our current work around using a smaller MTU in the local route has
similar effect (in the sense that introducing a smaller MTU betewen
the phyical eth0 and the TCP stack).

Thanks for looking into it!

Martin

> May be this should
> be solved for all protocols (not just TCP) and for all tunnels?
> Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to
> local sender, so we should really alter the gso_size for proper
> segmentation?
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-04-19 21:23 [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP Martin KaFai Lau
  2018-04-20 20:43 ` Julian Anastasov
@ 2018-05-02  6:38 ` Julian Anastasov
  2018-05-02 17:10     ` Martin KaFai Lau
  1 sibling, 1 reply; 17+ messages in thread
From: Julian Anastasov @ 2018-05-02  6:38 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team,
	lvs-devel


	Hello,

On Thu, 19 Apr 2018, Martin KaFai Lau wrote:

> This patch is not a proper fix and mainly serves for discussion purpose.
> It is based on net-next which I have been using to debug the issue.
> 
> The change that works around the issue is in ensure_mtu_is_adequate().
> Other changes are the rippling effect in function arg.
> 
> This bug was uncovered by one of our legacy service that
> are still using ipvs for load balancing.  In that setup,
> the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> before tx it out to eth0.
> 
> The problem is the kernel stack could pass a skb (which was
> originated from a sys_write(tcp_fd)) to the driver with skb->len
> bigger than the device MTU.  In one NIC setup (with gso and tso off)
> that we are using, it upset the NIC/driver and caused the tx queue
> stalled for tens of seconds which is how it got uncovered.
> (On the NIC side, the NIC firmware and driver have been fixed
> to avoid this tx queue stall after seeing this skb).

	In the last week I analyzed the situation and found
that just changes in route.c are able to solve the problems,
at 99% :) I'm posting a separate patch for this.

	Here is what happens, I'm testing with FTP which
starts with connection to port 21 and then with data connection,
from local ftp client to local Virtual IP (forwarded to remote
real server via tunnel).

- TCP creates local route which after commit 839da4d98960
appears to be non-cached, before this commit it is cached.
Route is saved and reused.

- initial traffic for port 21 does not use GSO. But after
every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
to report the reduced MTU. These updates are stored in fnhe_pmtu
but they do not go to any route, even if we try to get fresh
output route. Why? Because the local routes are not cached, so
they can not use the fnhe. This is what my patch for route.c
will fix. With this fix FTP-DATA gets route with reduced PMTU.

- later, there are large GSO packets in the FTP-DATA connection.
Currently, IPVS does not send ICMP error for exceeded PMTU
by GSO packets (this will be fixed, see patch below). Even
if they reach TCP and it refreshes its route, TCP can not get
any actual PMTU values from the routing, so continues to
use the large gso_size without the patch for route.c

	For the changes in IPVS that I show below as RFC:

- I synchronized the PMTU checks in __mtu_check_toobig* with
IPv4/IPv6 forwarding

- I modified your changes, see ipvs_gso_hlen() and how I use it
at start of ensure_mtu_is_adequate(), after skb is made writable,
before the PMTU checks.

	In my tests, all variants work, just with skb_decrease_gso_size
or even if I add SKB_GSO_DODGY+gso_segs=0. But I'm not sure
if this is safe for the different GSO configurations.

	I'm analyzing what can be changed in route.c, so that
dst.ops->update_pmtu changes to take effect for the provided
route. If that is possible, it will allow the PMTU change to
take immediate effect for the local routes.

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 4527921..7a2a0a89 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -104,9 +104,32 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
 	return dest_dst;
 }
 
-static inline bool
-__mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
+/* Check if packet size violates MTU */
+static bool __mtu_check_toobig(const struct sk_buff *skb, u32 mtu)
 {
+	if (skb->len <= mtu)
+		return false;
+
+	if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
+		return false;
+
+	if (IPCB(skb)->frag_max_size) {
+		/* frag_max_size tell us that, this packet have been
+		 * defragmented by netfilter IPv4 conntrack module.
+		 */
+		if (IPCB(skb)->frag_max_size > mtu)
+			return true; /* largest fragment violate MTU */
+	}
+	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
+		return false;
+	return true;
+}
+
+/* Check if packet size violates MTU */
+static bool __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
+{
+	if (skb->len <= mtu)
+		return false;
 	if (IP6CB(skb)->frag_max_size) {
 		/* frag_max_size tell us that, this packet have been
 		 * defragmented by netfilter IPv6 conntrack module.
@@ -114,10 +137,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
 		if (IP6CB(skb)->frag_max_size > mtu)
 			return true; /* largest fragment violate MTU */
 	}
-	else if (skb->len > mtu && !skb_is_gso(skb)) {
-		return true; /* Packet size violate MTU size */
-	}
-	return false;
+	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
+		return false;
+	return true;
 }
 
 /* Get route to daddr, update *saddr, optionally bind route to saddr */
@@ -212,11 +234,42 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
 		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
 }
 
+/* GSO: length of network and transport headers, 0=unsupported */
+static unsigned short ipvs_gso_hlen(struct sk_buff *skb,
+				    struct ip_vs_iphdr *ipvsh)
+{
+	unsigned short hlen = ipvsh->len - ipvsh->off;
+
+	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) {
+		struct tcphdr _tcph, *th;
+
+		th = skb_header_pointer(skb, ipvsh->len, sizeof(_tcph), &_tcph);
+		if (th)
+			return hlen + (th->doff << 2);
+	}
+	return 0;
+}
+ 
 static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 					  int rt_mode,
 					  struct ip_vs_iphdr *ipvsh,
 					  struct sk_buff *skb, int mtu)
 {
+	/* Re-segment traffic from local clients */
+	if (!skb->dev && skb_is_gso(skb)) {
+		unsigned short hlen = ipvs_gso_hlen(skb, ipvsh);
+
+		if (hlen && mtu - hlen < skb_shinfo(skb)->gso_size &&
+		    mtu > hlen) {
+			u16 reduce = skb_shinfo(skb)->gso_size - (mtu - hlen);
+
+			IP_VS_DBG(10, "Reducing gso_size=%u with %u\n",
+				skb_shinfo(skb)->gso_size, reduce);
+			skb_decrease_gso_size(skb_shinfo(skb), reduce);
+			skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+			skb_shinfo(skb)->gso_segs = 0;
+		}
+	}
 #ifdef CONFIG_IP_VS_IPV6
 	if (skb_af == AF_INET6) {
 		struct net *net = ipvs->net;
@@ -240,9 +293,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 		if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs))
 			return true;
 
-		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
-			     skb->len > mtu && !skb_is_gso(skb) &&
-			     !ip_vs_iph_icmp(ipvsh))) {
+		if (unlikely(__mtu_check_toobig(skb, mtu))) {
 			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 				  htonl(mtu));
 			IP_VS_DBG(1, "frag needed for %pI4\n",

Regards

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-05-02  6:38 ` Julian Anastasov
@ 2018-05-02 17:10     ` Martin KaFai Lau
  0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-05-02 17:10 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel

On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
> 
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> > 
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> > 
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing.  In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> > 
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU.  In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> 
> 	In the last week I analyzed the situation and found
> that just changes in route.c are able to solve the problems,
> at 99% :) I'm posting a separate patch for this.
> 
> 	Here is what happens, I'm testing with FTP which
> starts with connection to port 21 and then with data connection,
> from local ftp client to local Virtual IP (forwarded to remote
> real server via tunnel).
> 
> - TCP creates local route which after commit 839da4d98960
> appears to be non-cached, before this commit it is cached.
> Route is saved and reused.
> 
> - initial traffic for port 21 does not use GSO. But after
> every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
> to report the reduced MTU. These updates are stored in fnhe_pmtu
> but they do not go to any route, even if we try to get fresh
> output route. Why? Because the local routes are not cached, so
> they can not use the fnhe. This is what my patch for route.c
> will fix. With this fix FTP-DATA gets route with reduced PMTU.
For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in
__ip6_rt_update_pmtu() may need to be lifted also.

> 
> - later, there are large GSO packets in the FTP-DATA connection.
> Currently, IPVS does not send ICMP error for exceeded PMTU
> by GSO packets (this will be fixed, see patch below). Even
> if they reach TCP and it refreshes its route, TCP can not get
> any actual PMTU values from the routing, so continues to
> use the large gso_size without the patch for route.c
> 
> 	For the changes in IPVS that I show below as RFC:
> 
> - I synchronized the PMTU checks in __mtu_check_toobig* with
> IPv4/IPv6 forwarding
> 
> - I modified your changes, see ipvs_gso_hlen() and how I use it
> at start of ensure_mtu_is_adequate(), after skb is made writable,
> before the PMTU checks.
> 
> 	In my tests, all variants work, just with skb_decrease_gso_size
> or even if I add SKB_GSO_DODGY+gso_segs=0. But I'm not sure
> if this is safe for the different GSO configurations.
> 
> 	I'm analyzing what can be changed in route.c, so that
> dst.ops->update_pmtu changes to take effect for the provided
> route. If that is possible, it will allow the PMTU change to
> take immediate effect for the local routes.
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 4527921..7a2a0a89 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -104,9 +104,32 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
>  	return dest_dst;
>  }
>  
> -static inline bool
> -__mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> +/* Check if packet size violates MTU */
> +static bool __mtu_check_toobig(const struct sk_buff *skb, u32 mtu)
>  {
> +	if (skb->len <= mtu)
> +		return false;
> +
> +	if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
> +		return false;
> +
> +	if (IPCB(skb)->frag_max_size) {
> +		/* frag_max_size tell us that, this packet have been
> +		 * defragmented by netfilter IPv4 conntrack module.
> +		 */
> +		if (IPCB(skb)->frag_max_size > mtu)
> +			return true; /* largest fragment violate MTU */
> +	}
> +	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
> +		return false;
> +	return true;
> +}
> +
> +/* Check if packet size violates MTU */
> +static bool __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> +{
> +	if (skb->len <= mtu)
> +		return false;
>  	if (IP6CB(skb)->frag_max_size) {
>  		/* frag_max_size tell us that, this packet have been
>  		 * defragmented by netfilter IPv6 conntrack module.
> @@ -114,10 +137,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
>  		if (IP6CB(skb)->frag_max_size > mtu)
>  			return true; /* largest fragment violate MTU */
>  	}
> -	else if (skb->len > mtu && !skb_is_gso(skb)) {
> -		return true; /* Packet size violate MTU size */
> -	}
> -	return false;
> +	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
> +		return false;
> +	return true;
>  }
>  
>  /* Get route to daddr, update *saddr, optionally bind route to saddr */
> @@ -212,11 +234,42 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
>  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
>  }
>  
> +/* GSO: length of network and transport headers, 0=unsupported */
> +static unsigned short ipvs_gso_hlen(struct sk_buff *skb,
> +				    struct ip_vs_iphdr *ipvsh)
> +{
> +	unsigned short hlen = ipvsh->len - ipvsh->off;
> +
> +	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) {
> +		struct tcphdr _tcph, *th;
> +
> +		th = skb_header_pointer(skb, ipvsh->len, sizeof(_tcph), &_tcph);
> +		if (th)
> +			return hlen + (th->doff << 2);
> +	}
> +	return 0;
> +}
> + 
>  static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  					  int rt_mode,
>  					  struct ip_vs_iphdr *ipvsh,
>  					  struct sk_buff *skb, int mtu)
>  {
> +	/* Re-segment traffic from local clients */
> +	if (!skb->dev && skb_is_gso(skb)) {
> +		unsigned short hlen = ipvs_gso_hlen(skb, ipvsh);
> +
> +		if (hlen && mtu - hlen < skb_shinfo(skb)->gso_size &&
> +		    mtu > hlen) {
> +			u16 reduce = skb_shinfo(skb)->gso_size - (mtu - hlen);
> +
> +			IP_VS_DBG(10, "Reducing gso_size=%u with %u\n",
> +				skb_shinfo(skb)->gso_size, reduce);
> +			skb_decrease_gso_size(skb_shinfo(skb), reduce);
> +			skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> +			skb_shinfo(skb)->gso_segs = 0;
> +		}
> +	}
>  #ifdef CONFIG_IP_VS_IPV6
>  	if (skb_af == AF_INET6) {
>  		struct net *net = ipvs->net;
> @@ -240,9 +293,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  		if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs))
>  			return true;
>  
> -		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
> -			     skb->len > mtu && !skb_is_gso(skb) &&
> -			     !ip_vs_iph_icmp(ipvsh))) {
> +		if (unlikely(__mtu_check_toobig(skb, mtu))) {
>  			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  				  htonl(mtu));
>  			IP_VS_DBG(1, "frag needed for %pI4\n",
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
@ 2018-05-02 17:10     ` Martin KaFai Lau
  0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-05-02 17:10 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel

On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
> 
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> > 
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> > 
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing.  In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> > 
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU.  In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> 
> 	In the last week I analyzed the situation and found
> that just changes in route.c are able to solve the problems,
> at 99% :) I'm posting a separate patch for this.
> 
> 	Here is what happens, I'm testing with FTP which
> starts with connection to port 21 and then with data connection,
> from local ftp client to local Virtual IP (forwarded to remote
> real server via tunnel).
> 
> - TCP creates local route which after commit 839da4d98960
> appears to be non-cached, before this commit it is cached.
> Route is saved and reused.
> 
> - initial traffic for port 21 does not use GSO. But after
> every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
> to report the reduced MTU. These updates are stored in fnhe_pmtu
> but they do not go to any route, even if we try to get fresh
> output route. Why? Because the local routes are not cached, so
> they can not use the fnhe. This is what my patch for route.c
> will fix. With this fix FTP-DATA gets route with reduced PMTU.
For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in
__ip6_rt_update_pmtu() may need to be lifted also.

> 
> - later, there are large GSO packets in the FTP-DATA connection.
> Currently, IPVS does not send ICMP error for exceeded PMTU
> by GSO packets (this will be fixed, see patch below). Even
> if they reach TCP and it refreshes its route, TCP can not get
> any actual PMTU values from the routing, so continues to
> use the large gso_size without the patch for route.c
> 
> 	For the changes in IPVS that I show below as RFC:
> 
> - I synchronized the PMTU checks in __mtu_check_toobig* with
> IPv4/IPv6 forwarding
> 
> - I modified your changes, see ipvs_gso_hlen() and how I use it
> at start of ensure_mtu_is_adequate(), after skb is made writable,
> before the PMTU checks.
> 
> 	In my tests, all variants work, just with skb_decrease_gso_size
> or even if I add SKB_GSO_DODGY+gso_segs=0. But I'm not sure
> if this is safe for the different GSO configurations.
> 
> 	I'm analyzing what can be changed in route.c, so that
> dst.ops->update_pmtu changes to take effect for the provided
> route. If that is possible, it will allow the PMTU change to
> take immediate effect for the local routes.
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 4527921..7a2a0a89 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -104,9 +104,32 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
>  	return dest_dst;
>  }
>  
> -static inline bool
> -__mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> +/* Check if packet size violates MTU */
> +static bool __mtu_check_toobig(const struct sk_buff *skb, u32 mtu)
>  {
> +	if (skb->len <= mtu)
> +		return false;
> +
> +	if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
> +		return false;
> +
> +	if (IPCB(skb)->frag_max_size) {
> +		/* frag_max_size tell us that, this packet have been
> +		 * defragmented by netfilter IPv4 conntrack module.
> +		 */
> +		if (IPCB(skb)->frag_max_size > mtu)
> +			return true; /* largest fragment violate MTU */
> +	}
> +	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
> +		return false;
> +	return true;
> +}
> +
> +/* Check if packet size violates MTU */
> +static bool __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> +{
> +	if (skb->len <= mtu)
> +		return false;
>  	if (IP6CB(skb)->frag_max_size) {
>  		/* frag_max_size tell us that, this packet have been
>  		 * defragmented by netfilter IPv6 conntrack module.
> @@ -114,10 +137,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
>  		if (IP6CB(skb)->frag_max_size > mtu)
>  			return true; /* largest fragment violate MTU */
>  	}
> -	else if (skb->len > mtu && !skb_is_gso(skb)) {
> -		return true; /* Packet size violate MTU size */
> -	}
> -	return false;
> +	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
> +		return false;
> +	return true;
>  }
>  
>  /* Get route to daddr, update *saddr, optionally bind route to saddr */
> @@ -212,11 +234,42 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
>  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
>  }
>  
> +/* GSO: length of network and transport headers, 0=unsupported */
> +static unsigned short ipvs_gso_hlen(struct sk_buff *skb,
> +				    struct ip_vs_iphdr *ipvsh)
> +{
> +	unsigned short hlen = ipvsh->len - ipvsh->off;
> +
> +	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) {
> +		struct tcphdr _tcph, *th;
> +
> +		th = skb_header_pointer(skb, ipvsh->len, sizeof(_tcph), &_tcph);
> +		if (th)
> +			return hlen + (th->doff << 2);
> +	}
> +	return 0;
> +}
> + 
>  static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  					  int rt_mode,
>  					  struct ip_vs_iphdr *ipvsh,
>  					  struct sk_buff *skb, int mtu)
>  {
> +	/* Re-segment traffic from local clients */
> +	if (!skb->dev && skb_is_gso(skb)) {
> +		unsigned short hlen = ipvs_gso_hlen(skb, ipvsh);
> +
> +		if (hlen && mtu - hlen < skb_shinfo(skb)->gso_size &&
> +		    mtu > hlen) {
> +			u16 reduce = skb_shinfo(skb)->gso_size - (mtu - hlen);
> +
> +			IP_VS_DBG(10, "Reducing gso_size=%u with %u\n",
> +				skb_shinfo(skb)->gso_size, reduce);
> +			skb_decrease_gso_size(skb_shinfo(skb), reduce);
> +			skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> +			skb_shinfo(skb)->gso_segs = 0;
> +		}
> +	}
>  #ifdef CONFIG_IP_VS_IPV6
>  	if (skb_af == AF_INET6) {
>  		struct net *net = ipvs->net;
> @@ -240,9 +293,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  		if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs))
>  			return true;
>  
> -		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
> -			     skb->len > mtu && !skb_is_gso(skb) &&
> -			     !ip_vs_iph_icmp(ipvsh))) {
> +		if (unlikely(__mtu_check_toobig(skb, mtu))) {
>  			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  				  htonl(mtu));
>  			IP_VS_DBG(1, "frag needed for %pI4\n",
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-05-02 17:10     ` Martin KaFai Lau
  (?)
@ 2018-05-02 19:30     ` Julian Anastasov
  2018-05-03  7:01         ` Martin KaFai Lau
  -1 siblings, 1 reply; 17+ messages in thread
From: Julian Anastasov @ 2018-05-02 19:30 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel


	Hello,

On Wed, 2 May 2018, Martin KaFai Lau wrote:

> On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote:
> > 
> > - initial traffic for port 21 does not use GSO. But after
> > every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
> > to report the reduced MTU. These updates are stored in fnhe_pmtu
> > but they do not go to any route, even if we try to get fresh
> > output route. Why? Because the local routes are not cached, so
> > they can not use the fnhe. This is what my patch for route.c
> > will fix. With this fix FTP-DATA gets route with reduced PMTU.
> For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in
> __ip6_rt_update_pmtu() may need to be lifted also.

	Probably. I completely forgot the IPv6 part
but as I don't know the IPv6 code enough, it may take
some time to understand what can be the problem there...
I'm not sure whether everything started with commit 0a6b2a1dc2a2,
so that in some configurations before that commit things
worked and problem was not noticed.

	I think, we should focus on such direction for IPv6:

- do we remember per-VIP PMTU for the local routes

- when exactly we start to use the new PMTU, eg. what happens
in case socket caches the route, whether route is killed via
dst->obsolete. Or may be while the PMTU expiration is handled
per-packet, the PMTU change is noticed only on ICMP...

- as IPVS reports the PMTU via dst.ops->update_pmtu() long
before any large packets are sent, do we propagate the
PMTU. Also, for IPv4 __ip_rt_update_pmtu() has some protection
from such per-packet updates that do not change the PMTU.

- if IPVS starts to send ICMP when gso_size exceeds PMTU,
like in my draft patch, whether the PMTU is propagated
to route and then to socket. As for the gso_size decrease,
playing in IPVS is not very safe, at least, we need help
from GSO experts to know how we should use it.

Regards

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-05-02 19:30     ` Julian Anastasov
@ 2018-05-03  7:01         ` Martin KaFai Lau
  0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-05-03  7:01 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel

On Wed, May 02, 2018 at 10:30:32PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Wed, 2 May 2018, Martin KaFai Lau wrote:
> 
> > On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote:
> > > 
> > > - initial traffic for port 21 does not use GSO. But after
> > > every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
> > > to report the reduced MTU. These updates are stored in fnhe_pmtu
> > > but they do not go to any route, even if we try to get fresh
> > > output route. Why? Because the local routes are not cached, so
> > > they can not use the fnhe. This is what my patch for route.c
> > > will fix. With this fix FTP-DATA gets route with reduced PMTU.
> > For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in
> > __ip6_rt_update_pmtu() may need to be lifted also.
> 
> 	Probably. I completely forgot the IPv6 part
> but as I don't know the IPv6 code enough, it may take
> some time to understand what can be the problem there...
> I'm not sure whether everything started with commit 0a6b2a1dc2a2,
> so that in some configurations before that commit things
> worked and problem was not noticed.
> 
> 	I think, we should focus on such direction for IPv6:
> 
> - do we remember per-VIP PMTU for the local routes
IPv6 used not to create cache route for DST_HOST route which
is a /128 route (that includes local /128 route).

Because of this, it had a bug such that a PMTU for the DST_HOST
route will trigger dst.ops->update_pmtu() which then set
an expire on the permanent /128 route instead of a cache
route.  The permanent route got unexpectedly expired/removed
later.

The fix was to allow creating /128 cache route as long as
it is not RTF_LOCAL in 653437d02f1f and 7035870d1219.  The
first post spelled out the problem better:
https://patchwork.ozlabs.org/patch/456050/

Later, when we only create cache route after seeing PMTU
in 45e4fd26683c, this RTF_LOCAL checking was carried over
to __ip6_rt_update_pmtu().

Out of my head, I don't see issue removing the
RTF_LOCAL check from __ip6_rt_update_pmtu().
DavidA, what do you think?

> 
> - when exactly we start to use the new PMTU, eg. what happens
> in case socket caches the route, whether route is killed via
> dst->obsolete. Or may be while the PMTU expiration is handled
> per-packet, the PMTU change is noticed only on ICMP...
Before sk can reuse its dst cache, the sk will notice
its dst cache is no longer valid by calling dst_check().
dst_check() should return NULL which is one of the side
effect of the earlier update_pmtu().  This dst_check()
is usually only called when the sk needs to do output,
so the new PMTU route (i.e. the RTF_CACHE IPv6 route)
only have effect to the later packets.

> 
> - as IPVS reports the PMTU via dst.ops->update_pmtu() long
> before any large packets are sent, do we propagate the
> PMTU. Also, for IPv4 __ip_rt_update_pmtu() has some protection
> from such per-packet updates that do not change the PMTU.
> 
> - if IPVS starts to send ICMP when gso_size exceeds PMTU,
> like in my draft patch, whether the PMTU is propagated
> to route and then to socket. As for the gso_size decrease,
> playing in IPVS is not very safe, at least, we need help
> from GSO experts to know how we should use it.
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
@ 2018-05-03  7:01         ` Martin KaFai Lau
  0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-05-03  7:01 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel

On Wed, May 02, 2018 at 10:30:32PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Wed, 2 May 2018, Martin KaFai Lau wrote:
> 
> > On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote:
> > > 
> > > - initial traffic for port 21 does not use GSO. But after
> > > every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
> > > to report the reduced MTU. These updates are stored in fnhe_pmtu
> > > but they do not go to any route, even if we try to get fresh
> > > output route. Why? Because the local routes are not cached, so
> > > they can not use the fnhe. This is what my patch for route.c
> > > will fix. With this fix FTP-DATA gets route with reduced PMTU.
> > For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in
> > __ip6_rt_update_pmtu() may need to be lifted also.
> 
> 	Probably. I completely forgot the IPv6 part
> but as I don't know the IPv6 code enough, it may take
> some time to understand what can be the problem there...
> I'm not sure whether everything started with commit 0a6b2a1dc2a2,
> so that in some configurations before that commit things
> worked and problem was not noticed.
> 
> 	I think, we should focus on such direction for IPv6:
> 
> - do we remember per-VIP PMTU for the local routes
IPv6 used not to create cache route for DST_HOST route which
is a /128 route (that includes local /128 route).

Because of this, it had a bug such that a PMTU for the DST_HOST
route will trigger dst.ops->update_pmtu() which then set
an expire on the permanent /128 route instead of a cache
route.  The permanent route got unexpectedly expired/removed
later.

The fix was to allow creating /128 cache route as long as
it is not RTF_LOCAL in 653437d02f1f and 7035870d1219.  The
first post spelled out the problem better:
https://patchwork.ozlabs.org/patch/456050/

Later, when we only create cache route after seeing PMTU
in 45e4fd26683c, this RTF_LOCAL checking was carried over
to __ip6_rt_update_pmtu().

Out of my head, I don't see issue removing the
RTF_LOCAL check from __ip6_rt_update_pmtu().
DavidA, what do you think?

> 
> - when exactly we start to use the new PMTU, eg. what happens
> in case socket caches the route, whether route is killed via
> dst->obsolete. Or may be while the PMTU expiration is handled
> per-packet, the PMTU change is noticed only on ICMP...
Before sk can reuse its dst cache, the sk will notice
its dst cache is no longer valid by calling dst_check().
dst_check() should return NULL which is one of the side
effect of the earlier update_pmtu().  This dst_check()
is usually only called when the sk needs to do output,
so the new PMTU route (i.e. the RTF_CACHE IPv6 route)
only have effect to the later packets.

> 
> - as IPVS reports the PMTU via dst.ops->update_pmtu() long
> before any large packets are sent, do we propagate the
> PMTU. Also, for IPv4 __ip_rt_update_pmtu() has some protection
> from such per-packet updates that do not change the PMTU.
> 
> - if IPVS starts to send ICMP when gso_size exceeds PMTU,
> like in my draft patch, whether the PMTU is propagated
> to route and then to socket. As for the gso_size decrease,
> playing in IPVS is not very safe, at least, we need help
> from GSO experts to know how we should use it.
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-05-03  7:01         ` Martin KaFai Lau
  (?)
@ 2018-05-04  3:50         ` David Ahern
  -1 siblings, 0 replies; 17+ messages in thread
From: David Ahern @ 2018-05-04  3:50 UTC (permalink / raw)
  To: Martin KaFai Lau, Julian Anastasov
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team,
	lvs-devel

On 5/3/18 1:01 AM, Martin KaFai Lau wrote:
> On Wed, May 02, 2018 at 10:30:32PM +0300, Julian Anastasov wrote:
>>
>> 	Hello,
>>
>> On Wed, 2 May 2018, Martin KaFai Lau wrote:
>>
>>> On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote:
>>>>
>>>> - initial traffic for port 21 does not use GSO. But after
>>>> every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
>>>> to report the reduced MTU. These updates are stored in fnhe_pmtu
>>>> but they do not go to any route, even if we try to get fresh
>>>> output route. Why? Because the local routes are not cached, so
>>>> they can not use the fnhe. This is what my patch for route.c
>>>> will fix. With this fix FTP-DATA gets route with reduced PMTU.
>>> For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in
>>> __ip6_rt_update_pmtu() may need to be lifted also.
>>
>> 	Probably. I completely forgot the IPv6 part
>> but as I don't know the IPv6 code enough, it may take
>> some time to understand what can be the problem there...
>> I'm not sure whether everything started with commit 0a6b2a1dc2a2,
>> so that in some configurations before that commit things
>> worked and problem was not noticed.
>>
>> 	I think, we should focus on such direction for IPv6:
>>
>> - do we remember per-VIP PMTU for the local routes
> IPv6 used not to create cache route for DST_HOST route which
> is a /128 route (that includes local /128 route).
> 
> Because of this, it had a bug such that a PMTU for the DST_HOST
> route will trigger dst.ops->update_pmtu() which then set
> an expire on the permanent /128 route instead of a cache
> route.  The permanent route got unexpectedly expired/removed
> later.
> 
> The fix was to allow creating /128 cache route as long as
> it is not RTF_LOCAL in 653437d02f1f and 7035870d1219.  The
> first post spelled out the problem better:
> https://patchwork.ozlabs.org/patch/456050/
> 
> Later, when we only create cache route after seeing PMTU
> in 45e4fd26683c, this RTF_LOCAL checking was carried over
> to __ip6_rt_update_pmtu().
> 
> Out of my head, I don't see issue removing the
> RTF_LOCAL check from __ip6_rt_update_pmtu().
> DavidA, what do you think?

I agree; I think it is fine. A route is route. The IPVS use cases with
local redirects are blurring the line with needs for local routes.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-05-03  7:01         ` Martin KaFai Lau
  (?)
  (?)
@ 2018-05-05 12:58         ` Julian Anastasov
  2018-05-07 16:22             ` Martin KaFai Lau
  2018-06-05 21:31             ` Martin KaFai Lau
  -1 siblings, 2 replies; 17+ messages in thread
From: Julian Anastasov @ 2018-05-05 12:58 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel


	Hello,

On Thu, 3 May 2018, Martin KaFai Lau wrote:

> > - when exactly we start to use the new PMTU, eg. what happens
> > in case socket caches the route, whether route is killed via
> > dst->obsolete. Or may be while the PMTU expiration is handled
> > per-packet, the PMTU change is noticed only on ICMP...
> Before sk can reuse its dst cache, the sk will notice
> its dst cache is no longer valid by calling dst_check().
> dst_check() should return NULL which is one of the side
> effect of the earlier update_pmtu().  This dst_check()
> is usually only called when the sk needs to do output,
> so the new PMTU route (i.e. the RTF_CACHE IPv6 route)
> only have effect to the later packets.

	I checked again the code and it looks like sockets
are forced to use new exceptional route (RTF_CACHE/fnhe) via
dst_check only when the PMTU update should move them away
from old non-exceptional routes. Later, if PMTU is
reduced/updated this is noticed for every packet via dst_mtu,
as in the case with TCP.

	So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
we should have no other issues. Only one minor bit is strange to me,
why rt6_insert_exception warns for RTF_PCPU if rt6_cache_allowed_for_pmtu
allows it when returning true... 

	Also, commit 0d3f6d297bfb allows rt6_do_update_pmtu() for
routes without RTF_CACHE, RTF_PCPU and rt6i_node. Should we
restrict rt6_do_update_pmtu only to RTF_CACHE routes?

 	if (!rt6_cache_allowed_for_pmtu(rt6)) {
-		rt6_do_update_pmtu(rt6, mtu);
-		/* update rt6_ex->stamp for cache */
-		if (rt6->rt6i_flags & RTF_CACHE)
+		if (rt6->rt6i_flags & RTF_CACHE) {
+			rt6_do_update_pmtu(rt6, mtu);
+			/* update rt6_ex->stamp for cache */
 			rt6_update_exception_stamp_rt(rt6);
+		}
 	} else if (daddr) {

Regards

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-05-05 12:58         ` Julian Anastasov
@ 2018-05-07 16:22             ` Martin KaFai Lau
  2018-06-05 21:31             ` Martin KaFai Lau
  1 sibling, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-05-07 16:22 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel

On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 3 May 2018, Martin KaFai Lau wrote:
> 
> > > - when exactly we start to use the new PMTU, eg. what happens
> > > in case socket caches the route, whether route is killed via
> > > dst->obsolete. Or may be while the PMTU expiration is handled
> > > per-packet, the PMTU change is noticed only on ICMP...
> > Before sk can reuse its dst cache, the sk will notice
> > its dst cache is no longer valid by calling dst_check().
> > dst_check() should return NULL which is one of the side
> > effect of the earlier update_pmtu().  This dst_check()
> > is usually only called when the sk needs to do output,
> > so the new PMTU route (i.e. the RTF_CACHE IPv6 route)
> > only have effect to the later packets.
> 
> 	I checked again the code and it looks like sockets
> are forced to use new exceptional route (RTF_CACHE/fnhe) via
> dst_check only when the PMTU update should move them away
> from old non-exceptional routes. Later, if PMTU is
> reduced/updated this is noticed for every packet via dst_mtu,
> as in the case with TCP.
> 
> 	So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
> we should have no other issues. Only one minor bit is strange to me,
> why rt6_insert_exception warns for RTF_PCPU if rt6_cache_allowed_for_pmtu
> allows it when returning true...
hmm...I am not sure I follow this bits.  Where is the warn?

Note that "nrt6" and "from" are passed to rt6_insert_exception()
instead of "rt6".

> 
> 	Also, commit 0d3f6d297bfb allows rt6_do_update_pmtu() for
> routes without RTF_CACHE, RTF_PCPU and rt6i_node. Should we
> restrict rt6_do_update_pmtu only to RTF_CACHE routes?
> 
>  	if (!rt6_cache_allowed_for_pmtu(rt6)) {
> -		rt6_do_update_pmtu(rt6, mtu);
The existing rt6_do_update_pmtu() looks correct.
The mtu of the dst created by icmp6_dst_alloc()
needs to be udpated and this dst does not have
the RTF_CACHE.


> -		/* update rt6_ex->stamp for cache */
> -		if (rt6->rt6i_flags & RTF_CACHE)
> +		if (rt6->rt6i_flags & RTF_CACHE) {
> +			rt6_do_update_pmtu(rt6, mtu);
> +			/* update rt6_ex->stamp for cache */
>  			rt6_update_exception_stamp_rt(rt6);
> +		}
>  	} else if (daddr) {
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
@ 2018-05-07 16:22             ` Martin KaFai Lau
  0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-05-07 16:22 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel

On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 3 May 2018, Martin KaFai Lau wrote:
> 
> > > - when exactly we start to use the new PMTU, eg. what happens
> > > in case socket caches the route, whether route is killed via
> > > dst->obsolete. Or may be while the PMTU expiration is handled
> > > per-packet, the PMTU change is noticed only on ICMP...
> > Before sk can reuse its dst cache, the sk will notice
> > its dst cache is no longer valid by calling dst_check().
> > dst_check() should return NULL which is one of the side
> > effect of the earlier update_pmtu().  This dst_check()
> > is usually only called when the sk needs to do output,
> > so the new PMTU route (i.e. the RTF_CACHE IPv6 route)
> > only have effect to the later packets.
> 
> 	I checked again the code and it looks like sockets
> are forced to use new exceptional route (RTF_CACHE/fnhe) via
> dst_check only when the PMTU update should move them away
> from old non-exceptional routes. Later, if PMTU is
> reduced/updated this is noticed for every packet via dst_mtu,
> as in the case with TCP.
> 
> 	So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
> we should have no other issues. Only one minor bit is strange to me,
> why rt6_insert_exception warns for RTF_PCPU if rt6_cache_allowed_for_pmtu
> allows it when returning true...
hmm...I am not sure I follow this bits.  Where is the warn?

Note that "nrt6" and "from" are passed to rt6_insert_exception()
instead of "rt6".

> 
> 	Also, commit 0d3f6d297bfb allows rt6_do_update_pmtu() for
> routes without RTF_CACHE, RTF_PCPU and rt6i_node. Should we
> restrict rt6_do_update_pmtu only to RTF_CACHE routes?
> 
>  	if (!rt6_cache_allowed_for_pmtu(rt6)) {
> -		rt6_do_update_pmtu(rt6, mtu);
The existing rt6_do_update_pmtu() looks correct.
The mtu of the dst created by icmp6_dst_alloc()
needs to be udpated and this dst does not have
the RTF_CACHE.


> -		/* update rt6_ex->stamp for cache */
> -		if (rt6->rt6i_flags & RTF_CACHE)
> +		if (rt6->rt6i_flags & RTF_CACHE) {
> +			rt6_do_update_pmtu(rt6, mtu);
> +			/* update rt6_ex->stamp for cache */
>  			rt6_update_exception_stamp_rt(rt6);
> +		}
>  	} else if (daddr) {
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-05-07 16:22             ` Martin KaFai Lau
  (?)
@ 2018-05-07 17:00             ` Julian Anastasov
  -1 siblings, 0 replies; 17+ messages in thread
From: Julian Anastasov @ 2018-05-07 17:00 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel


	Hello,

On Mon, 7 May 2018, Martin KaFai Lau wrote:

> On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote:
> > 
> > 	So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
> > we should have no other issues. Only one minor bit is strange to me,
> > why rt6_insert_exception warns for RTF_PCPU if rt6_cache_allowed_for_pmtu
> > allows it when returning true...
> hmm...I am not sure I follow this bits.  Where is the warn?

        if (ort->rt6i_flags & (RTF_CACHE | RTF_PCPU))
                ort = ort->from;

	Sorry, my fault, I missed above re-assignment...

        WARN_ON_ONCE(ort->rt6i_flags & (RTF_CACHE | RTF_PCPU));

> Note that "nrt6" and "from" are passed to rt6_insert_exception()
> instead of "rt6".
> 
> > 
> > 	Also, commit 0d3f6d297bfb allows rt6_do_update_pmtu() for
> > routes without RTF_CACHE, RTF_PCPU and rt6i_node. Should we
> > restrict rt6_do_update_pmtu only to RTF_CACHE routes?
> > 
> >  	if (!rt6_cache_allowed_for_pmtu(rt6)) {
> > -		rt6_do_update_pmtu(rt6, mtu);
> The existing rt6_do_update_pmtu() looks correct.
> The mtu of the dst created by icmp6_dst_alloc()
> needs to be udpated and this dst does not have
> the RTF_CACHE.

	Aha, ok. I thought, only RTF_CACHE routes can
hold PMTU.

Regards

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-05-05 12:58         ` Julian Anastasov
@ 2018-06-05 21:31             ` Martin KaFai Lau
  2018-06-05 21:31             ` Martin KaFai Lau
  1 sibling, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-06-05 21:31 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel

On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote:
> 	So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
> we should have no other issues. 
Hi Julian,

Do you have a chance to work on the IPv6 side?

Thanks,
Martin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
@ 2018-06-05 21:31             ` Martin KaFai Lau
  0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2018-06-05 21:31 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel

On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote:
> 	So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
> we should have no other issues. 
Hi Julian,

Do you have a chance to work on the IPv6 side?

Thanks,
Martin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
  2018-06-05 21:31             ` Martin KaFai Lau
  (?)
@ 2018-06-05 22:35             ` Julian Anastasov
  -1 siblings, 0 replies; 17+ messages in thread
From: Julian Anastasov @ 2018-06-05 22:35 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel


	Hello,

On Tue, 5 Jun 2018, Martin KaFai Lau wrote:

> On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote:
> > 	So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
> > we should have no other issues. 
> Hi Julian,
> 
> Do you have a chance to work on the IPv6 side?

	I was chasing other IPVS issues while preparing
to finalize these changes. I plan to do tests this weekend
and to submit my patch but without gso_size modifications.
Can post patch for this check in __ip6_rt_update_pmtu too,
separately after making sure all is fine.

Regards

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-06-05 22:35 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-19 21:23 [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP Martin KaFai Lau
2018-04-20 20:43 ` Julian Anastasov
2018-04-20 23:13   ` Martin KaFai Lau
2018-05-02  6:38 ` Julian Anastasov
2018-05-02 17:10   ` Martin KaFai Lau
2018-05-02 17:10     ` Martin KaFai Lau
2018-05-02 19:30     ` Julian Anastasov
2018-05-03  7:01       ` Martin KaFai Lau
2018-05-03  7:01         ` Martin KaFai Lau
2018-05-04  3:50         ` David Ahern
2018-05-05 12:58         ` Julian Anastasov
2018-05-07 16:22           ` Martin KaFai Lau
2018-05-07 16:22             ` Martin KaFai Lau
2018-05-07 17:00             ` Julian Anastasov
2018-06-05 21:31           ` Martin KaFai Lau
2018-06-05 21:31             ` Martin KaFai Lau
2018-06-05 22:35             ` Julian Anastasov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.