netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Nabil S. Alramli" <nalramli@fastly.com>
To: David Ahern <dsahern@kernel.org>,
	sbhogavilli@fastly.com, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: jdamato@fastly.com, srao@fastly.com, dev@nalramli.com
Subject: Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash
Date: Mon, 16 Oct 2023 14:51:21 -0400	[thread overview]
Message-ID: <4be64c29-f495-4fdb-a565-2540745d5412@fastly.com> (raw)
In-Reply-To: <e18c52e8-116e-f258-7f2c-030a80e88343@kernel.org>

Hi David,

Thank you for your quick response.

On 10/13/2023 12:19 PM, David Ahern wrote:
> On 10/12/23 5:40 PM, Nabil S. Alramli wrote:
>> From: Suresh Bhogavilli <sbhogavilli@fastly.com>
>>
>> On a node with multiple network interfaces, if we enable layer 4 hash
>> policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is
>> broken and TCP connection does not make progress unless the incoming
>> ICMP Fragmentation Needed (type 3, code 4) message is received on the
>> egress interface of selected nexthop of the socket.
> 
> known problem.
> 
>>
>> This is because build_sk_flow_key() does not provide the sport and dport
>> from the socket when calling flowi4_init_output(). This appears to be a
>> copy/paste error of build_skb_flow_key() -> __build_flow_key() ->
>> flowi4_init_output() call used for packet forwarding where an skb is
>> present, is passed later to fib_multipath_hash() call, and can scrape
>> out both sport and dport from the skb if L4 hash policy is in use.
> 
> are you sure?
> 
> As I recall the problem is that the ICMP can be received on a different
> path. When it is processed, the exception is added to the ingress device
> of the ICMP and not the device the original packet egressed. I have
> scripts that somewhat reliably reproduced the problem; I started working
> on a fix and got distracted.

With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when an
ICMP packet too big (PTB) message is received on an interface different
from the socket egress interface, we see a cache entry added to the
ICMP ingress interface but with parameters matching the route entry
rather than the MTU reported in the ICMP message.

On the below node, ICMP PTB messages arrive on an interface named
vlan100. With net.ipv4.fib_multipath_hash_policy=0 - layer3 hashing -
the path from this cache to 139.162.188.91 is via another interface
named vlan200.

When the ICMP PTB message arrives on vlan100, an exception entry does
get added to vlan200 and the socket's cached mtu gets updated too. TCP
connection makes progress (not shown).

sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
  cache expires 363sec mtu 905 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
  cache expires 363sec mtu 905 advmss 1460

With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when TCP
traffic egresses over vlan200 (with ICMP PTB message arriving on vlan100
still), the cache entry still shows mtu of 1500 on the TCP egress
interface of vlan200. No exception entry gets added to vlan100 as you noted:

sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
  cache mtu 1500 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
  cache mtu 1500 advmss 1460

In this case, the TCP connection does not make progress, ultimately
timing out.

If we retry TCP connections until one uses vlan100 to egress, then the
exception entry does get added with an MTU matching those reported in
the ICMP PTB message:

sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
  cache expires 153sec mtu 905 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
  cache mtu 1500 advmss 1460

In this case the TCP connection over vlan100 does make progress.

With the proposed patch applied, an exception entry does get created on
the socket egress interface even when that is different from the ICMP
PTB ingress interface. Below is the output after different TCP
connections have used the two interfaces this node has:

sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
  cache expires 565sec mtu 905 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
  cache expires 562sec mtu 905 advmss 1460

Thank you.

  reply	other threads:[~2023-10-16 18:51 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20231012005721.2742-2-nalramli@fastly.com>
2023-10-12 23:40 ` [net] ipv4: Fix broken PMTUD when using L4 multipath hash Nabil S. Alramli
2023-10-13 16:19   ` David Ahern
2023-10-16 18:51     ` Nabil S. Alramli [this message]
2024-02-09 17:11       ` Suresh Bhogavilli
2024-02-09 22:27         ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4be64c29-f495-4fdb-a565-2540745d5412@fastly.com \
    --to=nalramli@fastly.com \
    --cc=davem@davemloft.net \
    --cc=dev@nalramli.com \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=jdamato@fastly.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sbhogavilli@fastly.com \
    --cc=srao@fastly.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).