* [net] ipv4: Fix broken PMTUD when using L4 multipath hash [not found] <20231012005721.2742-2-nalramli@fastly.com> @ 2023-10-12 23:40 ` Nabil S. Alramli 2023-10-13 16:19 ` David Ahern 0 siblings, 1 reply; 5+ messages in thread From: Nabil S. Alramli @ 2023-10-12 23:40 UTC (permalink / raw) To: sbhogavilli, davem, dsahern, edumazet, kuba, pabeni, netdev, linux-kernel Cc: srao, dev From: Suresh Bhogavilli <sbhogavilli@fastly.com> On a node with multiple network interfaces, if we enable layer 4 hash policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is broken and TCP connection does not make progress unless the incoming ICMP Fragmentation Needed (type 3, code 4) message is received on the egress interface of selected nexthop of the socket. This is because build_sk_flow_key() does not provide the sport and dport from the socket when calling flowi4_init_output(). This appears to be a copy/paste error of build_skb_flow_key() -> __build_flow_key() -> flowi4_init_output() call used for packet forwarding where an skb is present, is passed later to fib_multipath_hash() call, and can scrape out both sport and dport from the skb if L4 hash policy is in use. In the socket write case, fib_multipath_hash() does not get an skb so it expects the fl4 to have sport and dport populated when L4 hashing is in use. Not populating them results in creating a nexthop exception entry against a nexthop that may not be the one used by the socket. Hence it is not later matched when inet_csk_rebuild_route is called to update the cached dst entry in the socket, so TCP does not lower its MSS and the connection does not make progress. Fix this by providing the source port and destination ports to flowi4_init_output() call in build_sk_flow_key(). Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Suresh Bhogavilli <sbhogavilli@fastly.com> --- net/ipv4/route.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index e2bf4602b559..2517eb12b7ef 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -557,7 +557,8 @@ static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk) inet_test_bit(HDRINCL, sk) ? IPPROTO_RAW : sk->sk_protocol, inet_sk_flowi_flags(sk), - daddr, inet->inet_saddr, 0, 0, sk->sk_uid); + daddr, inet->inet_saddr, inet->inet_dport, inet->inet_sport, + sk->sk_uid); rcu_read_unlock(); } -- 2.31.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash 2023-10-12 23:40 ` [net] ipv4: Fix broken PMTUD when using L4 multipath hash Nabil S. Alramli @ 2023-10-13 16:19 ` David Ahern 2023-10-16 18:51 ` Nabil S. Alramli 0 siblings, 1 reply; 5+ messages in thread From: David Ahern @ 2023-10-13 16:19 UTC (permalink / raw) To: Nabil S. Alramli, sbhogavilli, davem, edumazet, kuba, pabeni, netdev, linux-kernel Cc: srao, dev On 10/12/23 5:40 PM, Nabil S. Alramli wrote: > From: Suresh Bhogavilli <sbhogavilli@fastly.com> > > On a node with multiple network interfaces, if we enable layer 4 hash > policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is > broken and TCP connection does not make progress unless the incoming > ICMP Fragmentation Needed (type 3, code 4) message is received on the > egress interface of selected nexthop of the socket. known problem. > > This is because build_sk_flow_key() does not provide the sport and dport > from the socket when calling flowi4_init_output(). This appears to be a > copy/paste error of build_skb_flow_key() -> __build_flow_key() -> > flowi4_init_output() call used for packet forwarding where an skb is > present, is passed later to fib_multipath_hash() call, and can scrape > out both sport and dport from the skb if L4 hash policy is in use. are you sure? As I recall the problem is that the ICMP can be received on a different path. When it is processed, the exception is added to the ingress device of the ICMP and not the device the original packet egressed. I have scripts that somewhat reliably reproduced the problem; I started working on a fix and got distracted. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash 2023-10-13 16:19 ` David Ahern @ 2023-10-16 18:51 ` Nabil S. Alramli 2024-02-09 17:11 ` Suresh Bhogavilli 0 siblings, 1 reply; 5+ messages in thread From: Nabil S. Alramli @ 2023-10-16 18:51 UTC (permalink / raw) To: David Ahern, sbhogavilli, davem, edumazet, kuba, pabeni, netdev, linux-kernel Cc: jdamato, srao, dev Hi David, Thank you for your quick response. On 10/13/2023 12:19 PM, David Ahern wrote: > On 10/12/23 5:40 PM, Nabil S. Alramli wrote: >> From: Suresh Bhogavilli <sbhogavilli@fastly.com> >> >> On a node with multiple network interfaces, if we enable layer 4 hash >> policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is >> broken and TCP connection does not make progress unless the incoming >> ICMP Fragmentation Needed (type 3, code 4) message is received on the >> egress interface of selected nexthop of the socket. > > known problem. > >> >> This is because build_sk_flow_key() does not provide the sport and dport >> from the socket when calling flowi4_init_output(). This appears to be a >> copy/paste error of build_skb_flow_key() -> __build_flow_key() -> >> flowi4_init_output() call used for packet forwarding where an skb is >> present, is passed later to fib_multipath_hash() call, and can scrape >> out both sport and dport from the skb if L4 hash policy is in use. > > are you sure? > > As I recall the problem is that the ICMP can be received on a different > path. When it is processed, the exception is added to the ingress device > of the ICMP and not the device the original packet egressed. I have > scripts that somewhat reliably reproduced the problem; I started working > on a fix and got distracted. With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when an ICMP packet too big (PTB) message is received on an interface different from the socket egress interface, we see a cache entry added to the ICMP ingress interface but with parameters matching the route entry rather than the MTU reported in the ICMP message. On the below node, ICMP PTB messages arrive on an interface named vlan100. With net.ipv4.fib_multipath_hash_policy=0 - layer3 hashing - the path from this cache to 139.162.188.91 is via another interface named vlan200. When the ICMP PTB message arrives on vlan100, an exception entry does get added to vlan200 and the socket's cached mtu gets updated too. TCP connection makes progress (not shown). sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 cache expires 363sec mtu 905 advmss 1460 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 cache expires 363sec mtu 905 advmss 1460 With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when TCP traffic egresses over vlan200 (with ICMP PTB message arriving on vlan100 still), the cache entry still shows mtu of 1500 on the TCP egress interface of vlan200. No exception entry gets added to vlan100 as you noted: sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 cache mtu 1500 advmss 1460 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 cache mtu 1500 advmss 1460 In this case, the TCP connection does not make progress, ultimately timing out. If we retry TCP connections until one uses vlan100 to egress, then the exception entry does get added with an MTU matching those reported in the ICMP PTB message: sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head 139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100 cache expires 153sec mtu 905 advmss 1460 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 cache mtu 1500 advmss 1460 In this case the TCP connection over vlan100 does make progress. With the proposed patch applied, an exception entry does get created on the socket egress interface even when that is different from the ICMP PTB ingress interface. Below is the output after different TCP connections have used the two interfaces this node has: sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head 139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100 cache expires 565sec mtu 905 advmss 1460 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 cache expires 562sec mtu 905 advmss 1460 Thank you. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash 2023-10-16 18:51 ` Nabil S. Alramli @ 2024-02-09 17:11 ` Suresh Bhogavilli 2024-02-09 22:27 ` David Ahern 0 siblings, 1 reply; 5+ messages in thread From: Suresh Bhogavilli @ 2024-02-09 17:11 UTC (permalink / raw) To: Nabil S. Alramli, David Ahern, davem, edumazet, kuba, pabeni, netdev, linux-kernel Cc: jdamato, srao, dev Hi David, On 10/16/23 2:51 PM, Nabil S. Alramli wrote: > On 10/13/2023 12:19 PM, David Ahern wrote: >>> On a node with multiple network interfaces, if we enable layer 4 hash >>> policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is >>> broken and TCP connection does not make progress unless the incoming >>> ICMP Fragmentation Needed (type 3, code 4) message is received on the >>> egress interface of selected nexthop of the socket. >> known problem. >> >>> This is because build_sk_flow_key() does not provide the sport and dport >>> from the socket when calling flowi4_init_output(). This appears to be a >>> copy/paste error of build_skb_flow_key() -> __build_flow_key() -> >>> flowi4_init_output() call used for packet forwarding where an skb is >>> present, is passed later to fib_multipath_hash() call, and can scrape >>> out both sport and dport from the skb if L4 hash policy is in use. >> are you sure? >> >> As I recall the problem is that the ICMP can be received on a different >> path. When it is processed, the exception is added to the ingress device >> of the ICMP and not the device the original packet egressed. I have >> scripts that somewhat reliably reproduced the problem; I started working >> on a fix and got distracted. > With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when an > ICMP packet too big (PTB) message is received on an interface different > from the socket egress interface, we see a cache entry added to the > ICMP ingress interface but with parameters matching the route entry > rather than the MTU reported in the ICMP message. > > On the below node, ICMP PTB messages arrive on an interface named > vlan100. With net.ipv4.fib_multipath_hash_policy=0 - layer3 hashing - > the path from this cache to 139.162.188.91 is via another interface > named vlan200. > > When the ICMP PTB message arrives on vlan100, an exception entry does > get added to vlan200 and the socket's cached mtu gets updated too. TCP > connection makes progress (not shown). > > sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head > 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 > cache expires 363sec mtu 905 advmss 1460 > 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 > cache expires 363sec mtu 905 advmss 1460 > > With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when TCP > traffic egresses over vlan200 (with ICMP PTB message arriving on vlan100 > still), the cache entry still shows mtu of 1500 on the TCP egress > interface of vlan200. No exception entry gets added to vlan100 as you noted: > > sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head > 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 > cache mtu 1500 advmss 1460 > 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 > cache mtu 1500 advmss 1460 > > In this case, the TCP connection does not make progress, ultimately > timing out. > > If we retry TCP connections until one uses vlan100 to egress, then the > exception entry does get added with an MTU matching those reported in > the ICMP PTB message: > > sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head > 139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100 > cache expires 153sec mtu 905 advmss 1460 > 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 > cache mtu 1500 advmss 1460 > > In this case the TCP connection over vlan100 does make progress. > > With the proposed patch applied, an exception entry does get created on > the socket egress interface even when that is different from the ICMP > PTB ingress interface. Below is the output after different TCP > connections have used the two interfaces this node has: > > sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head > 139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100 > cache expires 565sec mtu 905 advmss 1460 > 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200 > cache expires 562sec mtu 905 advmss 1460 > > Thank you. Does that answer your question? Do you need me to make any changes to get your Reviewed-by? Best regards, Suresh ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash 2024-02-09 17:11 ` Suresh Bhogavilli @ 2024-02-09 22:27 ` David Ahern 0 siblings, 0 replies; 5+ messages in thread From: David Ahern @ 2024-02-09 22:27 UTC (permalink / raw) To: Suresh Bhogavilli, Nabil S. Alramli, davem, edumazet, kuba, pabeni, netdev, linux-kernel Cc: jdamato, srao, dev On 2/9/24 10:11 AM, Suresh Bhogavilli wrote: > > Does that answer your question? Do you need me to make any changes to > get your Reviewed-by? re-send the patch enhancing the commit message with what happens for different policies. Would be good to get a unit test added to tools/testing/selftests/net/pmtu.sh (ipv4 and ipv6) ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-02-09 22:27 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20231012005721.2742-2-nalramli@fastly.com> 2023-10-12 23:40 ` [net] ipv4: Fix broken PMTUD when using L4 multipath hash Nabil S. Alramli 2023-10-13 16:19 ` David Ahern 2023-10-16 18:51 ` Nabil S. Alramli 2024-02-09 17:11 ` Suresh Bhogavilli 2024-02-09 22:27 ` David Ahern
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).