Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP

From: Julian Anastasov <ja@ssi.bg>
To: Martin KaFai Lau <kafai@fb.com>
Cc: netdev@vger.kernel.org, Tom Herbert <tom@herbertland.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Nikita Shirokov <tehnerd@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
Date: Fri, 20 Apr 2018 23:43:59 +0300 (EEST)	[thread overview]
Message-ID: <alpine.LFD.2.20.1804202205290.2906@ja.home.ssi.bg> (raw)
In-Reply-To: <20180419212324.1542504-1-kafai@fb.com>

	Hello,

On Thu, 19 Apr 2018, Martin KaFai Lau wrote:

> This patch is not a proper fix and mainly serves for discussion purpose.
> It is based on net-next which I have been using to debug the issue.
> 
> The change that works around the issue is in ensure_mtu_is_adequate().
> Other changes are the rippling effect in function arg.
> 
> This bug was uncovered by one of our legacy service that
> are still using ipvs for load balancing.  In that setup,
> the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> before tx it out to eth0.
> 
> The problem is the kernel stack could pass a skb (which was
> originated from a sys_write(tcp_fd)) to the driver with skb->len
> bigger than the device MTU.  In one NIC setup (with gso and tso off)
> that we are using, it upset the NIC/driver and caused the tx queue
> stalled for tens of seconds which is how it got uncovered.
> (On the NIC side, the NIC firmware and driver have been fixed
> to avoid this tx queue stall after seeing this skb).
> 
> On the kernel side, based on the commit log, this bug should have
> been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").

	Before I go to deeply analyze the GSO code, here
are some preliminary observations:

- IPVS started to use iptunnel_handle_offloads() around 2014,
commit ea1d5d7755a3

- later (2016) the "ip6ip6: Support for GSO/GRO" commits started
to use skb_set_inner_ipproto(). But it is missing in the IPVS code.
I'm not sure if such call can help. At least, I don't see what
different happens in IPVS compared to ip6ip6_tnl_xmit(), for example.

> Before commit 815d22e55b0e, ipv6_gso_segment() would just error
> out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
> Due to this error out, it avoid passing it to the driver.  The TCP
> stack then timeout and the TCP mtu probing eventually kicked in to
> lower the skb->len enough to avoid gso_segment.
> 
> After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
> -> tcp6_gso_segment() which segment the packet based on a mss
> that does not account for the extra IPv6 hdr.
> 
> Here is a stack from the WARN_ON() that we added to the driver to
> capture the issue:
> [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
> ...
> [ 1129.016536] Call Trace:
> [ 1129.021412]  ? skb_release_data+0xfc/0x120
> [ 1129.029587]  ? kfree_skbmem+0x64/0x70
> [ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
> [ 1129.045262]  sch_direct_xmit+0x10f/0x280
> [ 1129.053111]  __qdisc_run+0x223/0x5a0
> [ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
> [ 1129.068268]  dev_queue_xmit+0x10/0x20
> [ 1129.075573]  ? dev_queue_xmit+0x10/0x20
> [ 1129.083218]  ip6_finish_output2+0x2db/0x490
> [ 1129.091573]  ip6_finish_output+0x125/0x190
> [ 1129.099754]  ip6_output+0x5f/0x100
> [ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
> [ 1129.114212]  ip6_local_out+0x35/0x40
> [ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
> [ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
> [ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
> [ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
> [ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
> [ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
> [ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
> [ 1129.188849]  nf_hook_slow+0x3e/0xc0
> [ 1129.195800]  ip6_xmit+0x30b/0x540
> [ 1129.202421]  ? ac6_proc_exit+0x20/0x20
> [ 1129.209909]  inet6_csk_xmit+0x82/0xd0
> [ 1129.217207]  ? lock_timer_base+0x76/0xa0
> [ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
> [ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
> [ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
> [ 1129.250106]  tcp_push+0xde/0x100
> [ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
> [ 1129.264910]  tcp_sendmsg+0x2c/0x50
> [ 1129.271703]  inet_sendmsg+0x31/0xb0
> [ 1129.278672]  sock_write_iter+0xf8/0x110
> [ 1129.286335]  new_sync_write+0xd9/0x120
> [ 1129.293823]  vfs_write+0x18d/0x1e0
> [ 1129.300614]  SyS_write+0x48/0xa0
> [ 1129.307045]  do_syscall_64+0x69/0x1e0
> [ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> ...
> [ 1129.648183] ---[ end trace 635061c9c300799e ]---
> [ 1129.657407] skb->len:1554 MTU:1522
> 
> The tcp flow is connecting from the address ending ':27:0' to the ':85'.
> 
> [host-a] > ip -6 r show table local
> local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium
> 
> [host-a] > ip -6 a
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
>     inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
>        valid_lft forever preferred_lft forever
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
>     inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
>        valid_lft forever preferred_lft forever
> 
> [host-a] > cat /proc/net/ip_vs
> TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
>   -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
>   -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
>   -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7
> 
> [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
> send-something-long-here-to-trigger-the-bug
> 
> Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
> can also side step the issue.  Like this:

	Yes, reducing the MTU was a solution recommended from
long time ago in the IPVS HOWTO.

> 
> > ip -6 r show table local
> local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
>  1 file changed, 33 insertions(+), 16 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 11c416f3d6e3..88cc0d53ebce 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
>  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
>  }
>  
> -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
>  					  int rt_mode,
>  					  struct ip_vs_iphdr *ipvsh,
>  					  struct sk_buff *skb, int mtu)
>  {
> +	struct netns_ipvs *ipvs = cp->ipvs;
> +
>  #ifdef CONFIG_IP_VS_IPV6
> -	if (skb_af == AF_INET6) {
> +	if (cp->af == AF_INET6) {
>  		struct net *net = ipvs->net;
>  
>  		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
> @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  		}
>  	}
>  
> +	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
> +		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
> +		unsigned short hdr_len = (th->doff << 2) +
> +			skb_network_header_len(skb);
> +
> +		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
> +			skb_decrease_gso_size(skb_shinfo(skb),
> +					      skb_shinfo(skb)->gso_size -
> +					      (mtu - hdr_len));

	So, software segmentation happens and we want the
tunnel header to be accounted immediately and not after PMTU
probing period? Is this a problem only for IPVS tunnels?
Do we see such delays with other tunnels? May be this should
be solved for all protocols (not just TCP) and for all tunnels?
Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to
local sender, so we should really alter the gso_size for proper
segmentation?

Regards