From mboxrd@z Thu Jan  1 00:00:00 1970
From: Martin KaFai Lau <kafai@fb.com>
Subject: Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
Date: Fri, 20 Apr 2018 16:13:13 -0700
Message-ID: <20180420231311.xissamn6go5fznua@kafai-mbp>
References: <20180419212324.1542504-1-kafai@fb.com>
 <alpine.LFD.2.20.1804202205290.2906@ja.home.ssi.bg>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Cc: <netdev@vger.kernel.org>, Tom Herbert <tom@herbertland.com>,
        Eric Dumazet <eric.dumazet@gmail.com>,
        Nikita Shirokov <tehnerd@fb.com>, <kernel-team@fb.com>
To: Julian Anastasov <ja@ssi.bg>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:38660 "EHLO
        mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1751227AbeDTXNc (ORCPT
        <rfc822;netdev@vger.kernel.org>); Fri, 20 Apr 2018 19:13:32 -0400
Content-Disposition: inline
In-Reply-To: <alpine.LFD.2.20.1804202205290.2906@ja.home.ssi.bg>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Fri, Apr 20, 2018 at 11:43:59PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
> 
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> > 
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> > 
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing.  In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> > 
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU.  In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> > 
> > On the kernel side, based on the commit log, this bug should have
> > been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").
> 
> 	Before I go to deeply analyze the GSO code, here
> are some preliminary observations:
> 
> - IPVS started to use iptunnel_handle_offloads() around 2014,
> commit ea1d5d7755a3
> 
> - later (2016) the "ip6ip6: Support for GSO/GRO" commits started
> to use skb_set_inner_ipproto(). But it is missing in the IPVS code.
> I'm not sure if such call can help. At least, I don't see what
> different happens in IPVS compared to ip6ip6_tnl_xmit(), for example.
I don't see skb->inner_ipproto used in
ipv6_gso_segment() -> ipv6_gso_segment() -> tcp6_gso_segment()

One thing may worth to highlight is that this problem sort of exist
even before commit 815d22e55b0e, it just happened that the ipv6_gso_segment()
error out (-EPROTONOSUPPORT) which then stops sending it down to the driver.

> 
> > Before commit 815d22e55b0e, ipv6_gso_segment() would just error
> > out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
> > Due to this error out, it avoid passing it to the driver.  The TCP
> > stack then timeout and the TCP mtu probing eventually kicked in to
> > lower the skb->len enough to avoid gso_segment.
> > 
> > After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
> > -> tcp6_gso_segment() which segment the packet based on a mss
> > that does not account for the extra IPv6 hdr.
> > 
> > Here is a stack from the WARN_ON() that we added to the driver to
> > capture the issue:
> > [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
> > ...
> > [ 1129.016536] Call Trace:
> > [ 1129.021412]  ? skb_release_data+0xfc/0x120
> > [ 1129.029587]  ? kfree_skbmem+0x64/0x70
> > [ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
> > [ 1129.045262]  sch_direct_xmit+0x10f/0x280
> > [ 1129.053111]  __qdisc_run+0x223/0x5a0
> > [ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
> > [ 1129.068268]  dev_queue_xmit+0x10/0x20
> > [ 1129.075573]  ? dev_queue_xmit+0x10/0x20
> > [ 1129.083218]  ip6_finish_output2+0x2db/0x490
> > [ 1129.091573]  ip6_finish_output+0x125/0x190
> > [ 1129.099754]  ip6_output+0x5f/0x100
> > [ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
> > [ 1129.114212]  ip6_local_out+0x35/0x40
> > [ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
> > [ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
> > [ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
> > [ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
> > [ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
> > [ 1129.188849]  nf_hook_slow+0x3e/0xc0
> > [ 1129.195800]  ip6_xmit+0x30b/0x540
> > [ 1129.202421]  ? ac6_proc_exit+0x20/0x20
> > [ 1129.209909]  inet6_csk_xmit+0x82/0xd0
> > [ 1129.217207]  ? lock_timer_base+0x76/0xa0
> > [ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
> > [ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
> > [ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
> > [ 1129.250106]  tcp_push+0xde/0x100
> > [ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
> > [ 1129.264910]  tcp_sendmsg+0x2c/0x50
> > [ 1129.271703]  inet_sendmsg+0x31/0xb0
> > [ 1129.278672]  sock_write_iter+0xf8/0x110
> > [ 1129.286335]  new_sync_write+0xd9/0x120
> > [ 1129.293823]  vfs_write+0x18d/0x1e0
> > [ 1129.300614]  SyS_write+0x48/0xa0
> > [ 1129.307045]  do_syscall_64+0x69/0x1e0
> > [ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > ...
> > [ 1129.648183] ---[ end trace 635061c9c300799e ]---
> > [ 1129.657407] skb->len:1554 MTU:1522
> > 
> > The tcp flow is connecting from the address ending ':27:0' to the ':85'.
> > 
> > [host-a] > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium
> > 
> > [host-a] > ip -6 a
> > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
> >     inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
> >        valid_lft forever preferred_lft forever
> > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
> >     inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
> >        valid_lft forever preferred_lft forever
> > 
> > [host-a] > cat /proc/net/ip_vs
> > TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
> >   -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
> >   -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
> >   -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7
> > 
> > [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
> > send-something-long-here-to-trigger-the-bug
> > 
> > Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
> > can also side step the issue.  Like this:
> 
> 	Yes, reducing the MTU was a solution recommended from
> long time ago in the IPVS HOWTO.
> 
> > 
> > > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
> >  1 file changed, 33 insertions(+), 16 deletions(-)
> > 
> > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> > index 11c416f3d6e3..88cc0d53ebce 100644
> > --- a/net/netfilter/ipvs/ip_vs_xmit.c
> > +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> > @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
> >  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
> >  }
> >  
> > -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> > +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
> >  					  int rt_mode,
> >  					  struct ip_vs_iphdr *ipvsh,
> >  					  struct sk_buff *skb, int mtu)
> >  {
> > +	struct netns_ipvs *ipvs = cp->ipvs;
> > +
> >  #ifdef CONFIG_IP_VS_IPV6
> > -	if (skb_af == AF_INET6) {
> > +	if (cp->af == AF_INET6) {
> >  		struct net *net = ipvs->net;
> >  
> >  		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
> > @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> >  		}
> >  	}
> >  
> > +	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
> > +		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
> > +		unsigned short hdr_len = (th->doff << 2) +
> > +			skb_network_header_len(skb);
> > +
> > +		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
> > +			skb_decrease_gso_size(skb_shinfo(skb),
> > +					      skb_shinfo(skb)->gso_size -
> > +					      (mtu - hdr_len));
> 
> 	So, software segmentation happens and we want the
> tunnel header to be accounted immediately and not after PMTU
> probing period?
I think accounting tunnel header correctly early on is the expected
behavior.  Otherwise, if we choose to drop the big skb instead (either
gso_segment error out, driver error out or NIC drops it from tx queue),
the first few packets of a TCP connection will guarantee to
experience timeout.

> Is this a problem only for IPVS tunnels?
> Do we see such delays with other tunnels?
Based on what I understand, a tun dev (setup by iproute2) should
not experience this issue because the MTU of that tun dev has
already accounted for the extra header.

Our current work around using a smaller MTU in the local route has
similar effect (in the sense that introducing a smaller MTU betewen
the phyical eth0 and the TCP stack).

Thanks for looking into it!

Martin

> May be this should
> be solved for all protocols (not just TCP) and for all tunnels?
> Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to
> local sender, so we should really alter the gso_size for proper
> segmentation?
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>