From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin KaFai Lau Subject: Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP Date: Fri, 20 Apr 2018 16:13:13 -0700 Message-ID: <20180420231311.xissamn6go5fznua@kafai-mbp> References: <20180419212324.1542504-1-kafai@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: , Tom Herbert , Eric Dumazet , Nikita Shirokov , To: Julian Anastasov Return-path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:38660 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751227AbeDTXNc (ORCPT ); Fri, 20 Apr 2018 19:13:32 -0400 Content-Disposition: inline In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Fri, Apr 20, 2018 at 11:43:59PM +0300, Julian Anastasov wrote: > > Hello, > > On Thu, 19 Apr 2018, Martin KaFai Lau wrote: > > > This patch is not a proper fix and mainly serves for discussion purpose. > > It is based on net-next which I have been using to debug the issue. > > > > The change that works around the issue is in ensure_mtu_is_adequate(). > > Other changes are the rippling effect in function arg. > > > > This bug was uncovered by one of our legacy service that > > are still using ipvs for load balancing. In that setup, > > the ipvs encap the ipv6-tcp packet in another ipv6 hdr > > before tx it out to eth0. > > > > The problem is the kernel stack could pass a skb (which was > > originated from a sys_write(tcp_fd)) to the driver with skb->len > > bigger than the device MTU. In one NIC setup (with gso and tso off) > > that we are using, it upset the NIC/driver and caused the tx queue > > stalled for tens of seconds which is how it got uncovered. > > (On the NIC side, the NIC firmware and driver have been fixed > > to avoid this tx queue stall after seeing this skb). > > > > On the kernel side, based on the commit log, this bug should have > > been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO"). > > Before I go to deeply analyze the GSO code, here > are some preliminary observations: > > - IPVS started to use iptunnel_handle_offloads() around 2014, > commit ea1d5d7755a3 > > - later (2016) the "ip6ip6: Support for GSO/GRO" commits started > to use skb_set_inner_ipproto(). But it is missing in the IPVS code. > I'm not sure if such call can help. At least, I don't see what > different happens in IPVS compared to ip6ip6_tnl_xmit(), for example. I don't see skb->inner_ipproto used in ipv6_gso_segment() -> ipv6_gso_segment() -> tcp6_gso_segment() One thing may worth to highlight is that this problem sort of exist even before commit 815d22e55b0e, it just happened that the ipv6_gso_segment() error out (-EPROTONOSUPPORT) which then stops sending it down to the driver. > > > Before commit 815d22e55b0e, ipv6_gso_segment() would just error > > out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6. > > Due to this error out, it avoid passing it to the driver. The TCP > > stack then timeout and the TCP mtu probing eventually kicked in to > > lower the skb->len enough to avoid gso_segment. > > > > After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment() > > -> tcp6_gso_segment() which segment the packet based on a mss > > that does not account for the extra IPv6 hdr. > > > > Here is a stack from the WARN_ON() that we added to the driver to > > capture the issue: > > [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814 > > ... > > [ 1129.016536] Call Trace: > > [ 1129.021412] ? skb_release_data+0xfc/0x120 > > [ 1129.029587] ? kfree_skbmem+0x64/0x70 > > [ 1129.036905] dev_hard_start_xmit+0xa4/0x200 > > [ 1129.045262] sch_direct_xmit+0x10f/0x280 > > [ 1129.053111] __qdisc_run+0x223/0x5a0 > > [ 1129.060251] __dev_queue_xmit+0x245/0x7d0 > > [ 1129.068268] dev_queue_xmit+0x10/0x20 > > [ 1129.075573] ? dev_queue_xmit+0x10/0x20 > > [ 1129.083218] ip6_finish_output2+0x2db/0x490 > > [ 1129.091573] ip6_finish_output+0x125/0x190 > > [ 1129.099754] ip6_output+0x5f/0x100 > > [ 1129.106548] ? ip6_fragment+0x9f0/0x9f0 > > [ 1129.114212] ip6_local_out+0x35/0x40 > > [ 1129.121356] ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs] > > [ 1129.131443] ip_vs_in.part.24+0x302/0x710 [ip_vs] > > [ 1129.140837] ? ip_vs_in.part.24+0x302/0x710 [ip_vs] > > [ 1129.150578] ? ip_vs_conn_out_get+0x17/0x140 [ip_vs] > > [ 1129.160493] ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs] > > [ 1129.171273] ip_vs_in+0x43/0x130 [ip_vs] > > [ 1129.179109] ip_vs_local_request6+0x26/0x30 [ip_vs] > > [ 1129.188849] nf_hook_slow+0x3e/0xc0 > > [ 1129.195800] ip6_xmit+0x30b/0x540 > > [ 1129.202421] ? ac6_proc_exit+0x20/0x20 > > [ 1129.209909] inet6_csk_xmit+0x82/0xd0 > > [ 1129.217207] ? lock_timer_base+0x76/0xa0 > > [ 1129.225043] tcp_transmit_skb+0x56f/0xa40 > > [ 1129.233051] tcp_write_xmit+0x2b2/0x11b0 > > [ 1129.240885] __tcp_push_pending_frames+0x33/0xa0 > > [ 1129.250106] tcp_push+0xde/0x100 > > [ 1129.256554] tcp_sendmsg_locked+0x9ca/0xca0 > > [ 1129.264910] tcp_sendmsg+0x2c/0x50 > > [ 1129.271703] inet_sendmsg+0x31/0xb0 > > [ 1129.278672] sock_write_iter+0xf8/0x110 > > [ 1129.286335] new_sync_write+0xd9/0x120 > > [ 1129.293823] vfs_write+0x18d/0x1e0 > > [ 1129.300614] SyS_write+0x48/0xa0 > > [ 1129.307045] do_syscall_64+0x69/0x1e0 > > [ 1129.314361] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > ... > > [ 1129.648183] ---[ end trace 635061c9c300799e ]--- > > [ 1129.657407] skb->len:1554 MTU:1522 > > > > The tcp flow is connecting from the address ending ':27:0' to the ':85'. > > > > [host-a] > ip -6 r show table local > > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium > > > > [host-a] > ip -6 a > > 1: lo: mtu 65536 state UNKNOWN qlen 1000 > > inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global > > valid_lft forever preferred_lft forever > > 2: eth0: mtu 1500 state UP qlen 1000 > > inet6 2401:db00:1011:10af:face:0:27:0/64 scope global > > valid_lft forever preferred_lft forever > > > > [host-a] > cat /proc/net/ip_vs > > TCP [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr > > -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB Tunnel 6772 9 6 > > -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB Tunnel 6772 8 6 > > -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB Tunnel 6772 19 7 > > > > [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443 > > send-something-long-here-to-trigger-the-bug > > > > Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header > > can also side step the issue. Like this: > > Yes, reducing the MTU was a solution recommended from > long time ago in the IPVS HOWTO. > > > > > > ip -6 r show table local > > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium > > > > Signed-off-by: Martin KaFai Lau > > --- > > net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++-------------- > > 1 file changed, 33 insertions(+), 16 deletions(-) > > > > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c > > index 11c416f3d6e3..88cc0d53ebce 100644 > > --- a/net/netfilter/ipvs/ip_vs_xmit.c > > +++ b/net/netfilter/ipvs/ip_vs_xmit.c > > @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu) > > ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu); > > } > > > > -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af, > > +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp, > > int rt_mode, > > struct ip_vs_iphdr *ipvsh, > > struct sk_buff *skb, int mtu) > > { > > + struct netns_ipvs *ipvs = cp->ipvs; > > + > > #ifdef CONFIG_IP_VS_IPV6 > > - if (skb_af == AF_INET6) { > > + if (cp->af == AF_INET6) { > > struct net *net = ipvs->net; > > > > if (unlikely(__mtu_check_toobig_v6(skb, mtu))) { > > @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af, > > } > > } > > > > + if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) { > > + const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb); > > + unsigned short hdr_len = (th->doff << 2) + > > + skb_network_header_len(skb); > > + > > + if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size) > > + skb_decrease_gso_size(skb_shinfo(skb), > > + skb_shinfo(skb)->gso_size - > > + (mtu - hdr_len)); > > So, software segmentation happens and we want the > tunnel header to be accounted immediately and not after PMTU > probing period? I think accounting tunnel header correctly early on is the expected behavior. Otherwise, if we choose to drop the big skb instead (either gso_segment error out, driver error out or NIC drops it from tx queue), the first few packets of a TCP connection will guarantee to experience timeout. > Is this a problem only for IPVS tunnels? > Do we see such delays with other tunnels? Based on what I understand, a tun dev (setup by iproute2) should not experience this issue because the MTU of that tun dev has already accounted for the extra header. Our current work around using a smaller MTU in the local route has similar effect (in the sense that introducing a smaller MTU betewen the phyical eth0 and the TCP stack). Thanks for looking into it! Martin > May be this should > be solved for all protocols (not just TCP) and for all tunnels? > Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to > local sender, so we should really alter the gso_size for proper > segmentation? > > Regards > > -- > Julian Anastasov