From mboxrd@z Thu Jan 1 00:00:00 1970 From: Julian Anastasov Subject: Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP Date: Wed, 2 May 2018 09:38:43 +0300 (EEST) Message-ID: References: <20180419212324.1542504-1-kafai@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Cc: netdev@vger.kernel.org, Tom Herbert , Eric Dumazet , Nikita Shirokov , kernel-team@fb.com, lvs-devel@vger.kernel.org To: Martin KaFai Lau Return-path: Received: from ja.ssi.bg ([178.16.129.10]:57100 "EHLO ja.ssi.bg" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1750895AbeEBGiy (ORCPT ); Wed, 2 May 2018 02:38:54 -0400 In-Reply-To: <20180419212324.1542504-1-kafai@fb.com> Sender: netdev-owner@vger.kernel.org List-ID: Hello, On Thu, 19 Apr 2018, Martin KaFai Lau wrote: > This patch is not a proper fix and mainly serves for discussion purpose. > It is based on net-next which I have been using to debug the issue. > > The change that works around the issue is in ensure_mtu_is_adequate(). > Other changes are the rippling effect in function arg. > > This bug was uncovered by one of our legacy service that > are still using ipvs for load balancing. In that setup, > the ipvs encap the ipv6-tcp packet in another ipv6 hdr > before tx it out to eth0. > > The problem is the kernel stack could pass a skb (which was > originated from a sys_write(tcp_fd)) to the driver with skb->len > bigger than the device MTU. In one NIC setup (with gso and tso off) > that we are using, it upset the NIC/driver and caused the tx queue > stalled for tens of seconds which is how it got uncovered. > (On the NIC side, the NIC firmware and driver have been fixed > to avoid this tx queue stall after seeing this skb). In the last week I analyzed the situation and found that just changes in route.c are able to solve the problems, at 99% :) I'm posting a separate patch for this. Here is what happens, I'm testing with FTP which starts with connection to port 21 and then with data connection, from local ftp client to local Virtual IP (forwarded to remote real server via tunnel). - TCP creates local route which after commit 839da4d98960 appears to be non-cached, before this commit it is cached. Route is saved and reused. - initial traffic for port 21 does not use GSO. But after every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu) to report the reduced MTU. These updates are stored in fnhe_pmtu but they do not go to any route, even if we try to get fresh output route. Why? Because the local routes are not cached, so they can not use the fnhe. This is what my patch for route.c will fix. With this fix FTP-DATA gets route with reduced PMTU. - later, there are large GSO packets in the FTP-DATA connection. Currently, IPVS does not send ICMP error for exceeded PMTU by GSO packets (this will be fixed, see patch below). Even if they reach TCP and it refreshes its route, TCP can not get any actual PMTU values from the routing, so continues to use the large gso_size without the patch for route.c For the changes in IPVS that I show below as RFC: - I synchronized the PMTU checks in __mtu_check_toobig* with IPv4/IPv6 forwarding - I modified your changes, see ipvs_gso_hlen() and how I use it at start of ensure_mtu_is_adequate(), after skb is made writable, before the PMTU checks. In my tests, all variants work, just with skb_decrease_gso_size or even if I add SKB_GSO_DODGY+gso_segs=0. But I'm not sure if this is safe for the different GSO configurations. I'm analyzing what can be changed in route.c, so that dst.ops->update_pmtu changes to take effect for the provided route. If that is possible, it will allow the PMTU change to take immediate effect for the local routes. diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index 4527921..7a2a0a89 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -104,9 +104,32 @@ __ip_vs_dst_check(struct ip_vs_dest *dest) return dest_dst; } -static inline bool -__mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu) +/* Check if packet size violates MTU */ +static bool __mtu_check_toobig(const struct sk_buff *skb, u32 mtu) { + if (skb->len <= mtu) + return false; + + if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0)) + return false; + + if (IPCB(skb)->frag_max_size) { + /* frag_max_size tell us that, this packet have been + * defragmented by netfilter IPv4 conntrack module. + */ + if (IPCB(skb)->frag_max_size > mtu) + return true; /* largest fragment violate MTU */ + } + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) + return false; + return true; +} + +/* Check if packet size violates MTU */ +static bool __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu) +{ + if (skb->len <= mtu) + return false; if (IP6CB(skb)->frag_max_size) { /* frag_max_size tell us that, this packet have been * defragmented by netfilter IPv6 conntrack module. @@ -114,10 +137,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu) if (IP6CB(skb)->frag_max_size > mtu) return true; /* largest fragment violate MTU */ } - else if (skb->len > mtu && !skb_is_gso(skb)) { - return true; /* Packet size violate MTU size */ - } - return false; + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) + return false; + return true; } /* Get route to daddr, update *saddr, optionally bind route to saddr */ @@ -212,11 +234,42 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu) ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu); } +/* GSO: length of network and transport headers, 0=unsupported */ +static unsigned short ipvs_gso_hlen(struct sk_buff *skb, + struct ip_vs_iphdr *ipvsh) +{ + unsigned short hlen = ipvsh->len - ipvsh->off; + + if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) { + struct tcphdr _tcph, *th; + + th = skb_header_pointer(skb, ipvsh->len, sizeof(_tcph), &_tcph); + if (th) + return hlen + (th->doff << 2); + } + return 0; +} + static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af, int rt_mode, struct ip_vs_iphdr *ipvsh, struct sk_buff *skb, int mtu) { + /* Re-segment traffic from local clients */ + if (!skb->dev && skb_is_gso(skb)) { + unsigned short hlen = ipvs_gso_hlen(skb, ipvsh); + + if (hlen && mtu - hlen < skb_shinfo(skb)->gso_size && + mtu > hlen) { + u16 reduce = skb_shinfo(skb)->gso_size - (mtu - hlen); + + IP_VS_DBG(10, "Reducing gso_size=%u with %u\n", + skb_shinfo(skb)->gso_size, reduce); + skb_decrease_gso_size(skb_shinfo(skb), reduce); + skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; + skb_shinfo(skb)->gso_segs = 0; + } + } #ifdef CONFIG_IP_VS_IPV6 if (skb_af == AF_INET6) { struct net *net = ipvs->net; @@ -240,9 +293,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af, if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs)) return true; - if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) && - skb->len > mtu && !skb_is_gso(skb) && - !ip_vs_iph_icmp(ipvsh))) { + if (unlikely(__mtu_check_toobig(skb, mtu))) { icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); IP_VS_DBG(1, "frag needed for %pI4\n", Regards