netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Maximilien Cuony <maximilien.cuony@arcanite.ch>
To: "David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [REGRESSION] Unable to NAT own TCP packets from another VRF with tcp_l3mdev_accept = 1
Date: Wed, 28 Sep 2022 16:02:43 +0200	[thread overview]
Message-ID: <98348818-28c5-4cb2-556b-5061f77e112c@arcanite.ch> (raw)

Hello,

We're using VRF with a machine used as a router and have a specific 
issue where the router doesn't handle his own packets correctly during 
NATing if the packet is coming from a different VRF.

We had the issue with debian buster (4.19), but the issue solved itself 
when we updated to debian bullseye (5.10.92).

However, during an upgrade of debian bullseye to the latest kernel, the 
issue appeared again (5.10.140).

We did a bisection and this leaded us to 
"b0d67ef5b43aedbb558b9def2da5b4fffeb19966 net: allow unbound socket for 
packets in VRF when tcp_l3mdev_accept set [ Upstream commit 
944fd1aeacb627fa617f85f8e5a34f7ae8ea4d8e ]".

Simplified case setup:

There is two machines in the setup. They both forward packets 
(net.ipv4.ip_forward = 1) and there is two interface between them.

The main machine has two VRF. The default VRF is using the second 
machine as the default route, on a specific interface.
The second machine has as default route to main machine, on the other 
VRF using the second pair of interfaces.

On the main machine, the second interface is in a specific VRF. In that 
VRF, packets are NATed to the internet on a third interface.

A visual schema with the normal flow is available there: 
https://etinacra.ch/kernel.png

Configuration command:

Main machine:
sysctl -w net.ipv4.tcp_l3mdev_accept = 1
sysctl -w systnet.ipv4.ip_forward = 1
iptables -t raw -A PREROUTING -i eth0 -j CT --zone 5
iptables -t raw -A OUTPUT -o eth0 -j CT --zone 5
iptables -t nat -A POSTROUTING -o eth2 -j SNAT --to 192.168.1.1
cat /etc/network/interfaces

auto firewall
iface firewall
     vrf-table 1200

auto eth0
iface eth0
     address 192.168.5.1/24
     gateway 192.168.5.2

auto eth1
iface eth1
     address 192.168.10.1/24
     vrf firewall
     up ip route add 192.168.5.0/24 via 192.168.10.2 vrf firewall

auto eth2
iface eth2
     address 192.168.1.1/24
     gateway 192.168.1.250
     vrf firewall

==

Second machine:

sysctl -w net.ipv4.ip_forward = 1

cat /etc/network/interfaces

auto eth0
iface eth0
     address 192.168.5.2/24

auto eth1
iface eth1
     address 192.168.10.2/24
     gateway 192.168.10.1

==

Without issue, if we look at a tcpdump on all interface on the main 
machine, everything is fine (output truncated):

10:28:32.811283 eth0 Out IP 192.168.5.1.55750 > 99.99.99.99.80: Flags 
[S], seq 2216112145
10:28:32.811666 eth1 In  IP 192.168.5.1.55750 > 99.99.99.99.80: Flags 
[S], seq 2216112145
10:28:32.811679 eth2 Out IP 192.168.1.1.55750 > 99.99.99.99.80: Flags 
[S], seq 2216112145
10:28:32.835138 eth2 In  IP 99.99.99.99.80 > 192.168.1.1.55750: Flags 
[S.], seq 383992840, ack 2216112146
10:28:32.835152 eth1 Out IP 99.99.99.99.80 > 192.168.5.1.55750: Flags 
[S.], seq 383992840, ack 2216112146
10:28:32.835457 eth0 In  IP 99.99.99.99.80 > 192.168.5.1.55750: Flags 
[S.], seq 383992840, ack 2216112146
10:28:32.835511 eth0 Out IP 192.168.5.1.55750 > 99.99.99.99.80: Flags 
[.], ack 1, win 502

However when the issue is present, the SYNACK does arrives on eth2, but 
is never "unNATed" back to eth1:

10:25:07.644433 eth0 Out IP 192.168.5.1.48684 > 99.99.99.99.80: Flags 
[S], seq 3207393154
10:25:07.644782 eth1 In  IP 192.168.5.1.48684 > 99.99.99.99.80: Flags 
[S], seq 3207393154
10:25:07.644793 eth2 Out IP 192.168.1.1.48684 > 99.99.99.99.80: Flags 
[S], seq 3207393154
10:25:07.668551 eth2 In  IP 54.36.61.42.80 > 192.168.1.1.48684: Flags 
[S.], seq 823335485, ack 3207393155

The issue is only with TCP connections. UDP or ICMP works fine.

Turing off net.ipv4.tcp_l3mdev_accept back to 0 also fix the issue, but 
we need this flag since we use some sockets that does not understand VRFs.

We did have a look at the diff and the code of inet_bound_dev_eq, but we 
didn't understand much the real problem - but it does seem now that 
bound_dev_if if now checked not to be False before the bound_dev_if == 
dif || bound_dev_if == sdif comparison, something that was not the case 
before (especially since it's dependent on l3mdev_accept).

Maybe our setup is wrong and we should not be able to route packets like 
that?

Thanks a lot and have a nice day!

Maximilien Cuony



             reply	other threads:[~2022-09-28 14:28 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-28 14:02 Maximilien Cuony [this message]
2022-09-30  7:36 ` [REGRESSION] Unable to NAT own TCP packets from another VRF with tcp_l3mdev_accept = 1 Thorsten Leemhuis
2022-10-01  0:42 ` Jakub Kicinski
2022-10-07 14:42   ` David Ahern
2022-10-07 16:47   ` Mike Manning
2022-10-12 12:24     ` Maximilien Cuony
2022-10-26 12:40       ` [REGRESSION] Unable to NAT own TCP packets from another VRF with tcp_l3mdev_accept = 1 #forregzbot Thorsten Leemhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98348818-28c5-4cb2-556b-5061f77e112c@arcanite.ch \
    --to=maximilien.cuony@arcanite.ch \
    --cc=davem@davemloft.net \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).