netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* PMTUD broken inside network namespace with multipath routing
@ 2020-08-03 11:14 mastertheknife
  2020-08-03 13:32 ` David Ahern
  0 siblings, 1 reply; 12+ messages in thread
From: mastertheknife @ 2020-08-03 11:14 UTC (permalink / raw)
  To: netdev

Hi,

I have observed that PMTUD (Path MTU discovery) is broken using
multipath routing inside a network namespace. This breaks TCP, because
it keeps trying to send oversized packets.
Observed on kernel 5.4.44, other kernels weren't tested. However i
went through net/ipv4/route.c and haven't spotted changes in this
area, so i believe this bug is still there.

Host test with multipath routing:
---------------------------------
root@host1:~# ip route add 192.168.247.100/32 dev vmbr2 nexthop via
192.168.252.250 dev vmbr2 nexthop via 192.168.252.252 dev vmbr2
root@host1:~# ip route | grep -A2 192.168.247.100
192.168.247.100
 nexthop via 192.168.252.250 dev vmbr2 weight 1
 nexthop via 192.168.252.252 dev vmbr2 weight 1
root@host1:~# ping -M do -s 1380 192.168.247.100
PING 192.168.247.100 (192.168.247.100) 1380(1408) bytes of data.
From 192.168.252.252 icmp_seq=1 Frag needed and DF set (mtu = 1406)
ping: local error: Message too long, mtu=1406
ping: local error: Message too long, mtu=1406
ping: local error: Message too long, mtu=1406
ping: local error: Message too long, mtu=1406
^C
--- 192.168.247.100 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 80ms
root@host1:~# ip route get 192.168.247.100
192.168.247.100 via 192.168.252.250 dev vmbr2 src 192.168.252.15 uid 0
    cache expires 583sec mtu 1406

LXC container inside that host with multipath routing:
------------------------------------------------------
[root@lxctest ~]# ip route add 192.168.247.100/32 dev eth0 nexthop via
192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0
[root@lxctest ~]# ip route
default via 192.168.252.100 dev eth0 proto static metric 100
192.168.247.100
 nexthop via 192.168.252.250 dev eth0 weight 1
 nexthop via 192.168.252.252 dev eth0 weight 1
192.168.252.0/24 dev eth0 proto kernel scope link src 192.168.252.207 metric 100
[root@lxctest ~]# ping -M do -s 1380 192.168.247.100
PING 192.168.247.100 (192.168.247.100) 1380(1408) bytes of data.
From 192.168.252.252 icmp_seq=1 Frag needed and DF set (mtu = 1406)
From 192.168.252.252 icmp_seq=2 Frag needed and DF set (mtu = 1406)
From 192.168.252.252 icmp_seq=3 Frag needed and DF set (mtu = 1406)
From 192.168.252.252 icmp_seq=4 Frag needed and DF set (mtu = 1406)
[root@lxctest ~]# ip route get 192.168.247.100
192.168.247.100 via 192.168.252.252 dev eth0 src 192.168.252.207 uid 0
    cache

LXC container inside that host with regular routing:
----------------------------------------------------
[root@lxctest ~]# ip route add 192.168.247.100/32 via 192.168.252.252 dev eth0
[root@lxctest ~]# ip route
default via 192.168.252.100 dev eth0 proto static metric 100
192.168.247.100 via 192.168.252.252 dev eth0
192.168.252.0/24 dev eth0 proto kernel scope link src 192.168.252.207 metric 100
[root@lxctest ~]# ping -M do -s 1380 192.168.247.100
PING 192.168.247.100 (192.168.247.100) 1380(1408) bytes of data.
From 192.168.252.252 icmp_seq=1 Frag needed and DF set (mtu = 1406)
ping: local error: Message too long, mtu=1406
ping: local error: Message too long, mtu=1406
ping: local error: Message too long, mtu=1406
^C
--- 192.168.247.100 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 82ms
[root@lxctest ~]# ip route get 192.168.247.100
192.168.247.100 via 192.168.252.252 dev eth0 src 192.168.252.207 uid 0
    cache expires 591sec mtu 1406


What seems to be happening, is that when multipath routing is used
inside LXC (or any network namespace), the kernel doesn't generate a
routing exception to force the lower MTU.
I believe this is a bug inside the kernel.


Kfir Itzhak

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-08-03 11:14 PMTUD broken inside network namespace with multipath routing mastertheknife
@ 2020-08-03 13:32 ` David Ahern
  2020-08-03 14:24   ` mastertheknife
  0 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2020-08-03 13:32 UTC (permalink / raw)
  To: mastertheknife, netdev

On 8/3/20 5:14 AM, mastertheknife wrote:
> What seems to be happening, is that when multipath routing is used
> inside LXC (or any network namespace), the kernel doesn't generate a
> routing exception to force the lower MTU.
> I believe this is a bug inside the kernel.
> 

Known problem. Original message can take path 1 and ICMP message can
path 2. The exception is then created on the wrong path.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-08-03 13:32 ` David Ahern
@ 2020-08-03 14:24   ` mastertheknife
  2020-08-03 15:38     ` David Ahern
  0 siblings, 1 reply; 12+ messages in thread
From: mastertheknife @ 2020-08-03 14:24 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

Hi David,

In this case, both paths are in the same layer2 network, there is no
symmetric multi-path routing.
If original message takes path 1, ICMP response will come from path 1
If original message takes path 2, ICMP response will come from path 2
Also, It works fine outside of LXC.


Thank you,
Kfir

On Mon, Aug 3, 2020 at 4:32 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 8/3/20 5:14 AM, mastertheknife wrote:
> > What seems to be happening, is that when multipath routing is used
> > inside LXC (or any network namespace), the kernel doesn't generate a
> > routing exception to force the lower MTU.
> > I believe this is a bug inside the kernel.
> >
>
> Known problem. Original message can take path 1 and ICMP message can
> path 2. The exception is then created on the wrong path.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-08-03 14:24   ` mastertheknife
@ 2020-08-03 15:38     ` David Ahern
  2020-08-03 18:39       ` mastertheknife
  0 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2020-08-03 15:38 UTC (permalink / raw)
  To: mastertheknife; +Cc: netdev

On 8/3/20 8:24 AM, mastertheknife wrote:
> Hi David,
> 
> In this case, both paths are in the same layer2 network, there is no
> symmetric multi-path routing.
> If original message takes path 1, ICMP response will come from path 1
> If original message takes path 2, ICMP response will come from path 2
> Also, It works fine outside of LXC.
> 
> 

I'll take a look when I get some time; most likely end of the week.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-08-03 15:38     ` David Ahern
@ 2020-08-03 18:39       ` mastertheknife
  2020-08-10 22:13         ` David Ahern
  0 siblings, 1 reply; 12+ messages in thread
From: mastertheknife @ 2020-08-03 18:39 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

Hi David,

I found something that can shed some light on the issue.
The issue only happens if the ICMP response doesn't come from the first nexthop.
In my case, both nexthops are linux routers, and they are the ones
generating the ICMP (because of IPSEC next). This is what I meant
earlier,
that the ICMP path is identical to the original message path.

Test IP #1 - 192.168.249.116 - Hash will choose nexthop #1
Test IP #2 - 192.168.249.117 - Hash will choose nexthop #2

Test with 252.250 as nexthop #1:
--------------------------------
root@lxctest:[~] # ip route add 192.168.249.0/24 dev eth1 nexthop via
192.168.252.250 dev eth1 nexthop via 192.168.252.252 dev eth1
root@lxctest:[~] # ping -M do -s 1450 192.168.249.116
PING 192.168.249.116 (192.168.249.116) 1450(1478) bytes of data.
From 192.168.252.250 icmp_seq=1 Frag needed and DF set (mtu = 1446)
ping: local error: Message too long, mtu=1446
ping: local error: Message too long, mtu=1446
ping: local error: Message too long, mtu=1446
^C
--- 192.168.249.116 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3067ms
root@testlxc:[~] # ping -M do -s 1450 192.168.249.117
PING 192.168.249.117 (192.168.249.117) 1450(1478) bytes of data.
From 192.168.252.252 icmp_seq=1 Frag needed and DF set (mtu = 1446)
From 192.168.252.252 icmp_seq=2 Frag needed and DF set (mtu = 1446)
From 192.168.252.252 icmp_seq=3 Frag needed and DF set (mtu = 1446)
^C
--- 192.168.249.117 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2052ms

Test with 252.252 as nexthop #1:
--------------------------------
root@testlxc:[~] # ip route add 192.168.249.0/24 dev eth1 nexthop via
192.168.252.252 dev eth1 nexthop via 192.168.252.250 dev eth1
root@testlxc:[~] # ping -M do -s 1450 192.168.249.116
PING 192.168.249.116 (192.168.249.116) 1450(1478) bytes of data.
From 192.168.252.252 icmp_seq=1 Frag needed and DF set (mtu = 1446)
ping: local error: Message too long, mtu=1446
ping: local error: Message too long, mtu=1446
^C
--- 192.168.249.116 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2044ms
root@testlxc:[~] # ping -M do -s 1450 192.168.249.117
PING 192.168.249.117 (192.168.249.117) 1450(1478) bytes of data.
From 192.168.252.250 icmp_seq=1 Frag needed and DF set (mtu = 1446)
From 192.168.252.250 icmp_seq=2 Frag needed and DF set (mtu = 1446)
From 192.168.252.250 icmp_seq=3 Frag needed and DF set (mtu = 1446)
^C
--- 192.168.249.117 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2046ms

In summary: It seems that it doesn't matter who is the nexthop. If the
ICMP response isn't from the nexthop, it'll be rejected.
About why i couldn't reproduce this outside LXC, i don't know yet but
i will keep trying to figure this out.

Let me know if you need me to test this.
Thank you,
Kfir Itzhak

On Mon, Aug 3, 2020 at 6:38 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 8/3/20 8:24 AM, mastertheknife wrote:
> > Hi David,
> >
> > In this case, both paths are in the same layer2 network, there is no
> > symmetric multi-path routing.
> > If original message takes path 1, ICMP response will come from path 1
> > If original message takes path 2, ICMP response will come from path 2
> > Also, It works fine outside of LXC.
> >
> >
>
> I'll take a look when I get some time; most likely end of the week.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-08-03 18:39       ` mastertheknife
@ 2020-08-10 22:13         ` David Ahern
  2020-08-12 12:37           ` mastertheknife
  0 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2020-08-10 22:13 UTC (permalink / raw)
  To: mastertheknife; +Cc: netdev

On 8/3/20 12:39 PM, mastertheknife wrote:
> In summary: It seems that it doesn't matter who is the nexthop. If the
> ICMP response isn't from the nexthop, it'll be rejected.
> About why i couldn't reproduce this outside LXC, i don't know yet but
> i will keep trying to figure this out.

do you have a shell script that reproduces the problem?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-08-10 22:13         ` David Ahern
@ 2020-08-12 12:37           ` mastertheknife
  2020-08-12 19:21             ` David Ahern
  0 siblings, 1 reply; 12+ messages in thread
From: mastertheknife @ 2020-08-12 12:37 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

Hello David,

I tried and it seems i can reproduce it:

# Create test NS
root@host:~# ip netns add testns
# Create veth pair, veth0 in host, veth1 in NS
root@host:~# ip link add veth0 type veth peer name veth1
root@host:~# ip link set veth1 netns testns
# Configure veth1 (NS)
root@host:~# ip netns exec testns ip addr add 192.168.252.209/24 dev veth1
root@host:~# ip netns exec testns ip link set dev veth1 up
root@host:~# ip netns exec testns ip route add default via 192.168.252.100
root@host:~# ip netns exec testns ip route add 192.168.249.0/24
nexthop via 192.168.252.250 nexthop via 192.168.252.252
# Configure veth0 (host)
root@host:~# brctl addif vmbr2 veth0
root@host:~# ip link set veth0 up


# Tests
root@host:~# ip netns exec testns ping -M do -s 1450 192.168.249.116
PING 192.168.249.116 (192.168.249.116) 1450(1478) bytes of data.
From 192.168.252.252 icmp_seq=1 Frag needed and DF set (mtu = 1366)
ping: local error: Message too long, mtu=1366
ping: local error: Message too long, mtu=1366
ping: local error: Message too long, mtu=1366
^C
--- 192.168.249.116 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 81ms

root@host:~# ip netns exec testns ping -M do -s 1450 192.168.249.134
PING 192.168.249.134 (192.168.249.134) 1450(1478) bytes of data.
From 192.168.252.252 icmp_seq=1 Frag needed and DF set (mtu = 1366)
From 192.168.252.252 icmp_seq=2 Frag needed and DF set (mtu = 1366)
From 192.168.252.252 icmp_seq=3 Frag needed and DF set (mtu = 1366)
^C
--- 192.168.249.134 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 40ms

root@host:~# ip netns exec testns ip route show cache
192.168.249.116 via 192.168.252.250 dev veth1
    cache expires 584sec mtu 1366
192.168.249.134 via 192.168.252.250 dev veth1
    cache expires 593sec mtu 1366
root@host:~# ip netns exec testns ip route get 192.168.249.116
192.168.249.116 via 192.168.252.250 dev veth1 src 192.168.252.209 uid 0
    cache expires 578sec mtu 1366
root@host:~# ip netns exec testns ip route get 192.168.249.134
192.168.249.134 via 192.168.252.252 dev veth1 src 192.168.252.209 uid 0
    cache


Please notice the above, 'ip route show cache' and 'ip route get'
return different nexthop for 192.168.249.134, i suspect that may be
part of the problem.

Thank you,
Kfir

On Tue, Aug 11, 2020 at 1:13 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 8/3/20 12:39 PM, mastertheknife wrote:
> > In summary: It seems that it doesn't matter who is the nexthop. If the
> > ICMP response isn't from the nexthop, it'll be rejected.
> > About why i couldn't reproduce this outside LXC, i don't know yet but
> > i will keep trying to figure this out.
>
> do you have a shell script that reproduces the problem?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-08-12 12:37           ` mastertheknife
@ 2020-08-12 19:21             ` David Ahern
  2020-08-14  7:08               ` mastertheknife
  0 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2020-08-12 19:21 UTC (permalink / raw)
  To: mastertheknife; +Cc: netdev

On 8/12/20 6:37 AM, mastertheknife wrote:
> Hello David,
> 
> I tried and it seems i can reproduce it:
> 
> # Create test NS
> root@host:~# ip netns add testns
> # Create veth pair, veth0 in host, veth1 in NS
> root@host:~# ip link add veth0 type veth peer name veth1
> root@host:~# ip link set veth1 netns testns
> # Configure veth1 (NS)
> root@host:~# ip netns exec testns ip addr add 192.168.252.209/24 dev veth1
> root@host:~# ip netns exec testns ip link set dev veth1 up
> root@host:~# ip netns exec testns ip route add default via 192.168.252.100
> root@host:~# ip netns exec testns ip route add 192.168.249.0/24
> nexthop via 192.168.252.250 nexthop via 192.168.252.252
> # Configure veth0 (host)
> root@host:~# brctl addif vmbr2 veth0

vmbr2's config is not defined.

ip li add vmbr2 type bridge
ip li set veth0 master vmbr2
ip link set veth0 up

anything else? e.g., address for vmbr2? What holds 192.168.252.250 and
192.168.252.252

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-08-12 19:21             ` David Ahern
@ 2020-08-14  7:08               ` mastertheknife
  2020-09-01 10:40                 ` mastertheknife
  0 siblings, 1 reply; 12+ messages in thread
From: mastertheknife @ 2020-08-14  7:08 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

Hello David,

It's on a production system, vmbr2 is a bridge with eth.X VLAN
interface inside for the connectivity on that 252.0/24 network. vmbr2
has address 192.168.252.5 in that case
192.168.252.250 and 192.168.252.252 are CentOS8 LXCs on another host,
with libreswan inside for any/any IPSECs with VTi interfaces.

Everything is kernel 5.4.44 LTS

I wish i could fully reproduce all of it in a script, but i am not
sure how to create such hops that return this ICMP

Thank you,
Kfir


On Wed, Aug 12, 2020 at 10:21 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 8/12/20 6:37 AM, mastertheknife wrote:
> > Hello David,
> >
> > I tried and it seems i can reproduce it:
> >
> > # Create test NS
> > root@host:~# ip netns add testns
> > # Create veth pair, veth0 in host, veth1 in NS
> > root@host:~# ip link add veth0 type veth peer name veth1
> > root@host:~# ip link set veth1 netns testns
> > # Configure veth1 (NS)
> > root@host:~# ip netns exec testns ip addr add 192.168.252.209/24 dev veth1
> > root@host:~# ip netns exec testns ip link set dev veth1 up
> > root@host:~# ip netns exec testns ip route add default via 192.168.252.100
> > root@host:~# ip netns exec testns ip route add 192.168.249.0/24
> > nexthop via 192.168.252.250 nexthop via 192.168.252.252
> > # Configure veth0 (host)
> > root@host:~# brctl addif vmbr2 veth0
>
> vmbr2's config is not defined.
>
> ip li add vmbr2 type bridge
> ip li set veth0 master vmbr2
> ip link set veth0 up
>
> anything else? e.g., address for vmbr2? What holds 192.168.252.250 and
> 192.168.252.252

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-08-14  7:08               ` mastertheknife
@ 2020-09-01 10:40                 ` mastertheknife
  2020-09-01 10:44                   ` mastertheknife
  2020-09-02  0:42                   ` David Ahern
  0 siblings, 2 replies; 12+ messages in thread
From: mastertheknife @ 2020-09-01 10:40 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

Hello David.

I was able to solve it while troubleshooting some fragmentation issue.
The VTI interfaces had MTU of 1480 by default. I reduced to them to
the real PMTUD (1366) and now its all working just fine.
I am not sure how its related and why, but seems like it solved the issue.

P.S: while reading the relevant code in the kernel, i think i spotted
some mistake in net/ipv4/route.c, in function "update_or_create_fnhe".
It looks like it loops over all the exceptions for the nexthop entry,
but always overwriting the first (and only) entry, so effectively only
1 exception can exist per nexthop entry.
Line 678:
"if (fnhe) {"
Should probably be:
"if (fnhe && fnhe->fnhe_daddr == daddr) {"


Thank you for your efforts,
Kfir Itzhak

On Fri, Aug 14, 2020 at 10:08 AM mastertheknife
<mastertheknife@gmail.com> wrote:
>
> Hello David,
>
> It's on a production system, vmbr2 is a bridge with eth.X VLAN
> interface inside for the connectivity on that 252.0/24 network. vmbr2
> has address 192.168.252.5 in that case
> 192.168.252.250 and 192.168.252.252 are CentOS8 LXCs on another host,
> with libreswan inside for any/any IPSECs with VTi interfaces.
>
> Everything is kernel 5.4.44 LTS
>
> I wish i could fully reproduce all of it in a script, but i am not
> sure how to create such hops that return this ICMP
>
> Thank you,
> Kfir
>
>
> On Wed, Aug 12, 2020 at 10:21 PM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 8/12/20 6:37 AM, mastertheknife wrote:
> > > Hello David,
> > >
> > > I tried and it seems i can reproduce it:
> > >
> > > # Create test NS
> > > root@host:~# ip netns add testns
> > > # Create veth pair, veth0 in host, veth1 in NS
> > > root@host:~# ip link add veth0 type veth peer name veth1
> > > root@host:~# ip link set veth1 netns testns
> > > # Configure veth1 (NS)
> > > root@host:~# ip netns exec testns ip addr add 192.168.252.209/24 dev veth1
> > > root@host:~# ip netns exec testns ip link set dev veth1 up
> > > root@host:~# ip netns exec testns ip route add default via 192.168.252.100
> > > root@host:~# ip netns exec testns ip route add 192.168.249.0/24
> > > nexthop via 192.168.252.250 nexthop via 192.168.252.252
> > > # Configure veth0 (host)
> > > root@host:~# brctl addif vmbr2 veth0
> >
> > vmbr2's config is not defined.
> >
> > ip li add vmbr2 type bridge
> > ip li set veth0 master vmbr2
> > ip link set veth0 up
> >
> > anything else? e.g., address for vmbr2? What holds 192.168.252.250 and
> > 192.168.252.252

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-09-01 10:40                 ` mastertheknife
@ 2020-09-01 10:44                   ` mastertheknife
  2020-09-02  0:42                   ` David Ahern
  1 sibling, 0 replies; 12+ messages in thread
From: mastertheknife @ 2020-09-01 10:44 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

Hello David,

A quick correction; The issue is not solved, it was a mistake in my
testing. The issue is still there.


Kfir

On Tue, Sep 1, 2020 at 1:40 PM mastertheknife <mastertheknife@gmail.com> wrote:
>
> Hello David.
>
> I was able to solve it while troubleshooting some fragmentation issue.
> The VTI interfaces had MTU of 1480 by default. I reduced to them to
> the real PMTUD (1366) and now its all working just fine.
> I am not sure how its related and why, but seems like it solved the issue.
>
> P.S: while reading the relevant code in the kernel, i think i spotted
> some mistake in net/ipv4/route.c, in function "update_or_create_fnhe".
> It looks like it loops over all the exceptions for the nexthop entry,
> but always overwriting the first (and only) entry, so effectively only
> 1 exception can exist per nexthop entry.
> Line 678:
> "if (fnhe) {"
> Should probably be:
> "if (fnhe && fnhe->fnhe_daddr == daddr) {"
>
>
> Thank you for your efforts,
> Kfir Itzhak
>
> On Fri, Aug 14, 2020 at 10:08 AM mastertheknife
> <mastertheknife@gmail.com> wrote:
> >
> > Hello David,
> >
> > It's on a production system, vmbr2 is a bridge with eth.X VLAN
> > interface inside for the connectivity on that 252.0/24 network. vmbr2
> > has address 192.168.252.5 in that case
> > 192.168.252.250 and 192.168.252.252 are CentOS8 LXCs on another host,
> > with libreswan inside for any/any IPSECs with VTi interfaces.
> >
> > Everything is kernel 5.4.44 LTS
> >
> > I wish i could fully reproduce all of it in a script, but i am not
> > sure how to create such hops that return this ICMP
> >
> > Thank you,
> > Kfir
> >
> >
> > On Wed, Aug 12, 2020 at 10:21 PM David Ahern <dsahern@gmail.com> wrote:
> > >
> > > On 8/12/20 6:37 AM, mastertheknife wrote:
> > > > Hello David,
> > > >
> > > > I tried and it seems i can reproduce it:
> > > >
> > > > # Create test NS
> > > > root@host:~# ip netns add testns
> > > > # Create veth pair, veth0 in host, veth1 in NS
> > > > root@host:~# ip link add veth0 type veth peer name veth1
> > > > root@host:~# ip link set veth1 netns testns
> > > > # Configure veth1 (NS)
> > > > root@host:~# ip netns exec testns ip addr add 192.168.252.209/24 dev veth1
> > > > root@host:~# ip netns exec testns ip link set dev veth1 up
> > > > root@host:~# ip netns exec testns ip route add default via 192.168.252.100
> > > > root@host:~# ip netns exec testns ip route add 192.168.249.0/24
> > > > nexthop via 192.168.252.250 nexthop via 192.168.252.252
> > > > # Configure veth0 (host)
> > > > root@host:~# brctl addif vmbr2 veth0
> > >
> > > vmbr2's config is not defined.
> > >
> > > ip li add vmbr2 type bridge
> > > ip li set veth0 master vmbr2
> > > ip link set veth0 up
> > >
> > > anything else? e.g., address for vmbr2? What holds 192.168.252.250 and
> > > 192.168.252.252

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PMTUD broken inside network namespace with multipath routing
  2020-09-01 10:40                 ` mastertheknife
  2020-09-01 10:44                   ` mastertheknife
@ 2020-09-02  0:42                   ` David Ahern
  1 sibling, 0 replies; 12+ messages in thread
From: David Ahern @ 2020-09-02  0:42 UTC (permalink / raw)
  To: mastertheknife; +Cc: netdev

On 9/1/20 4:40 AM, mastertheknife wrote:
> 
> P.S: while reading the relevant code in the kernel, i think i spotted
> some mistake in net/ipv4/route.c, in function "update_or_create_fnhe".
> It looks like it loops over all the exceptions for the nexthop entry,
> but always overwriting the first (and only) entry, so effectively only
> 1 exception can exist per nexthop entry.
> Line 678:
> "if (fnhe) {"
> Should probably be:
> "if (fnhe && fnhe->fnhe_daddr == daddr) {"
> 

Right above that line is:

        for (fnhe = rcu_dereference(hash->chain); fnhe;
             fnhe = rcu_dereference(fnhe->fnhe_next)) {
                if (fnhe->fnhe_daddr == daddr)
                        break;
                depth++;
        }

so fnhe is set based on daddr match.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-09-02  0:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-03 11:14 PMTUD broken inside network namespace with multipath routing mastertheknife
2020-08-03 13:32 ` David Ahern
2020-08-03 14:24   ` mastertheknife
2020-08-03 15:38     ` David Ahern
2020-08-03 18:39       ` mastertheknife
2020-08-10 22:13         ` David Ahern
2020-08-12 12:37           ` mastertheknife
2020-08-12 19:21             ` David Ahern
2020-08-14  7:08               ` mastertheknife
2020-09-01 10:40                 ` mastertheknife
2020-09-01 10:44                   ` mastertheknife
2020-09-02  0:42                   ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).