All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG: IPv4 conntrack reassembles forwarded packets
@ 2021-01-05 12:12 ` Christian Perle
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Perle @ 2021-01-05 12:12 UTC (permalink / raw)
  To: netfilter, netdev; +Cc: steffen.klassert

Hello,

During testing several tunnel scenarios, I noticed a problematic
behaviour of IPV4 conntrack.


BUG: IPv4 conntrack reassembles forwarded packets
=================================================

Conntrack needs to reassemble fragments in order to have complete
packets for rule matching. However the IPv4 stack should not change
forwarded packets if not explicitely told to do so.
Unwanted reassembly can even lead to packet loss.

Consider the following setup:

            +--------+       +---------+       +--------+
            |Router A|-------|Wanrouter|-------|Router B|
            |        |.IPIP..|         |..IPIP.|        |
            +--------+       +---------+       +--------+
           /                  mtu 1400                   \
          /                                               \
+--------+                                                 +--------+
|Client A|                                                 |Client B|
|        |                                                 |        |
+--------+                                                 +--------+

Router A and Router B use IPIP tunnel interfaces to tunnel traffic
between Client A and Client B over WAN. Wanrouter has MTU 1400 set
on its interfaces.

Detailed setup for Router A
---------------------------
Interfaces:
eth0: 10.2.2.1/24
eth1: 192.168.10.1/24
ipip0: No IP address, local 10.2.2.1 remote 10.4.4.1
Routes:
192.168.20.0/24 dev ipip0    (192.168.20.0/24 is subnet of Client B)
10.4.4.1 via 10.2.2.254      (Router B via Wanrouter)
No iptables rules at all.

Detailed setup for Router B
---------------------------
Interfaces:
eth0: 10.4.4.1/24
eth1: 192.168.20.1/24
ipip0: No IP address, local 10.4.4.1 remote 10.2.2.1
Routes:
192.168.10.0/24 dev ipip0    (192.168.10.0/24 is subnet of Client A)
10.2.2.1 via 10.4.4.254      (Router A via Wanrouter)
No iptables rules at all.

Path MTU discovery
------------------
Running tracepath from Client A to Client B shows PMTU discovery is working
as expected:

clienta:~# tracepath 192.168.20.2
 1?: [LOCALHOST]                      pmtu 1500
 1:  192.168.10.1                                          0.867ms
 1:  192.168.10.1                                          0.302ms
 2:  192.168.10.1                                          0.312ms pmtu 1480
 2:  no reply
 3:  192.168.10.1                                          0.510ms pmtu 1380
 3:  192.168.20.2                                          2.320ms reached
     Resume: pmtu 1380 hops 3 back 3

Router A has learned PMTU (1400) to Router B from Wanrouter.
Client A has learned PMTU (1400 - IPIP overhead = 1380) to Client B
from Router A.

Send large UDP packet
---------------------
Now we send a 1400 bytes UDP packet from Client A to Client B:

clienta:~# head -c1400 /dev/zero | tr "\000" "a" | nc -u 192.168.20.2 5000

The IPv4 stack on Client A already knows the PMTU to Client B, so the
UDP packet is sent as two fragments (1380 + 20). Router A forwards the
fragments between eth1 and ipip0. The fragments fit into the tunnel and
reach their destination.

Adding conntrack iptables rule ==> packet loss
----------------------------------------------
Now on Router A the following iptables rule is added:

routera:~# iptables -t mangle -A PREROUTING -m state \
  --state ESTABLISHED -j ACCEPT

When sending the large UDP packet again, Router A now reassembles the
fragments before routing the packet over ipip0. The resulting IPIP
packet is too big (1400) for the tunnel PMTU (1380) to Router B, it is
dropped on Router A before sending.

Client A cannot do anything to fix this, because it already respects the
PMTU (1380) to Client B and sends fragments fitting into it.

The problem also happens when using IPSec tunnels with XFRM interfaces
(this is the actual use case, the setup above just uses IPIP for
simplicity).

IPv6 does it right
------------------
When testing a similar setup with IPv6 and ip6tnl interfaces, the
conntrack ip6tables rule does not affect the forwarded UDP fragments.
Though reassembly takes place for conntrack, the reassembled packet is
not forwarded.

So the solution would be making IPv4 behaving like IPv6, using reassembly
for conntrack reasons *only* and not forwarding the reassembly result
but the original fragments.


Regards,
  Christian Perle
-- 
Christian Perle
Senior Berater / Senior Consultant
Netzwerk- und Client-Sicherheit / Network & Client Security
Öffentliche Auftraggeber / Public Authorities
secunet Security Networks AG

Tel.: +49 201 54 54-3533, Fax: +49 201 54 54-1323
E-Mail: christian.perle@secunet.com
Ammonstraße 74, 01067 Dresden, Deutschland
www.secunet.com

secunet Security Networks AG
Sitz: Kurfürstenstraße 58, 45138 Essen, Deutschland
Amtsgericht Essen HRB 13615
Vorstand: Axel Deininger (Vors.), Torsten Henn, Dr. Kai Martius, Thomas Pleines
Aufsichtsratsvorsitzender: Ralf Wintergerst

^ permalink raw reply	[flat|nested] 9+ messages in thread

* BUG: IPv4 conntrack reassembles forwarded packets
@ 2021-01-05 12:12 ` Christian Perle
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Perle @ 2021-01-05 12:12 UTC (permalink / raw)
  To: netfilter, netdev; +Cc: steffen.klassert

Hello,

During testing several tunnel scenarios, I noticed a problematic
behaviour of IPV4 conntrack.


BUG: IPv4 conntrack reassembles forwarded packets
=================================================

Conntrack needs to reassemble fragments in order to have complete
packets for rule matching. However the IPv4 stack should not change
forwarded packets if not explicitely told to do so.
Unwanted reassembly can even lead to packet loss.

Consider the following setup:

            +--------+       +---------+       +--------+
            |Router A|-------|Wanrouter|-------|Router B|
            |        |.IPIP..|         |..IPIP.|        |
            +--------+       +---------+       +--------+
           /                  mtu 1400                   \
          /                                               \
+--------+                                                 +--------+
|Client A|                                                 |Client B|
|        |                                                 |        |
+--------+                                                 +--------+

Router A and Router B use IPIP tunnel interfaces to tunnel traffic
between Client A and Client B over WAN. Wanrouter has MTU 1400 set
on its interfaces.

Detailed setup for Router A
---------------------------
Interfaces:
eth0: 10.2.2.1/24
eth1: 192.168.10.1/24
ipip0: No IP address, local 10.2.2.1 remote 10.4.4.1
Routes:
192.168.20.0/24 dev ipip0    (192.168.20.0/24 is subnet of Client B)
10.4.4.1 via 10.2.2.254      (Router B via Wanrouter)
No iptables rules at all.

Detailed setup for Router B
---------------------------
Interfaces:
eth0: 10.4.4.1/24
eth1: 192.168.20.1/24
ipip0: No IP address, local 10.4.4.1 remote 10.2.2.1
Routes:
192.168.10.0/24 dev ipip0    (192.168.10.0/24 is subnet of Client A)
10.2.2.1 via 10.4.4.254      (Router A via Wanrouter)
No iptables rules at all.

Path MTU discovery
------------------
Running tracepath from Client A to Client B shows PMTU discovery is working
as expected:

clienta:~# tracepath 192.168.20.2
 1?: [LOCALHOST]                      pmtu 1500
 1:  192.168.10.1                                          0.867ms
 1:  192.168.10.1                                          0.302ms
 2:  192.168.10.1                                          0.312ms pmtu 1480
 2:  no reply
 3:  192.168.10.1                                          0.510ms pmtu 1380
 3:  192.168.20.2                                          2.320ms reached
     Resume: pmtu 1380 hops 3 back 3

Router A has learned PMTU (1400) to Router B from Wanrouter.
Client A has learned PMTU (1400 - IPIP overhead = 1380) to Client B
from Router A.

Send large UDP packet
---------------------
Now we send a 1400 bytes UDP packet from Client A to Client B:

clienta:~# head -c1400 /dev/zero | tr "\000" "a" | nc -u 192.168.20.2 5000

The IPv4 stack on Client A already knows the PMTU to Client B, so the
UDP packet is sent as two fragments (1380 + 20). Router A forwards the
fragments between eth1 and ipip0. The fragments fit into the tunnel and
reach their destination.

Adding conntrack iptables rule ==> packet loss
----------------------------------------------
Now on Router A the following iptables rule is added:

routera:~# iptables -t mangle -A PREROUTING -m state \
  --state ESTABLISHED -j ACCEPT

When sending the large UDP packet again, Router A now reassembles the
fragments before routing the packet over ipip0. The resulting IPIP
packet is too big (1400) for the tunnel PMTU (1380) to Router B, it is
dropped on Router A before sending.

Client A cannot do anything to fix this, because it already respects the
PMTU (1380) to Client B and sends fragments fitting into it.

The problem also happens when using IPSec tunnels with XFRM interfaces
(this is the actual use case, the setup above just uses IPIP for
simplicity).

IPv6 does it right
------------------
When testing a similar setup with IPv6 and ip6tnl interfaces, the
conntrack ip6tables rule does not affect the forwarded UDP fragments.
Though reassembly takes place for conntrack, the reassembled packet is
not forwarded.

So the solution would be making IPv4 behaving like IPv6, using reassembly
for conntrack reasons *only* and not forwarding the reassembly result
but the original fragments.


Regards,
  Christian Perle
-- 
Christian Perle
Senior Berater / Senior Consultant
Netzwerk- und Client-Sicherheit / Network & Client Security
Öffentliche Auftraggeber / Public Authorities
secunet Security Networks AG

Tel.: +49 201 54 54-3533, Fax: +49 201 54 54-1323
E-Mail: christian.perle@secunet.com
Ammonstraße 74, 01067 Dresden, Deutschland
www.secunet.com

secunet Security Networks AG
Sitz: Kurfürstenstraße 58, 45138 Essen, Deutschland
Amtsgericht Essen HRB 13615
Vorstand: Axel Deininger (Vors.), Torsten Henn, Dr. Kai Martius, Thomas Pleines
Aufsichtsratsvorsitzender: Ralf Wintergerst

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole
  2021-01-05 12:12 ` Christian Perle
  (?)
@ 2021-01-05 23:15 ` Florian Westphal
  2021-01-05 23:15   ` [PATCH net 1/3] selftests: netfilter: add selftest for ipip pmtu discovery with enabled connection tracking Florian Westphal
                     ` (3 more replies)
  -1 siblings, 4 replies; 9+ messages in thread
From: Florian Westphal @ 2021-01-05 23:15 UTC (permalink / raw)
  To: netdev; +Cc: christian.perle, steffen.klassert, netfilter-devel

Christian Perle reported a PMTU blackhole due to unexpected interaction
between the ip defragmentation that comes with connection tracking and
ip tunnels.

Unfortunately setting 'nopmtudisc' on the tunnel breaks the test
scenario even without netfilter.

Christinas setup looks like this:
     +--------+       +---------+       +--------+
     |Router A|-------|Wanrouter|-------|Router B|
     |        |.IPIP..|         |..IPIP.|        |
     +--------+       +---------+       +--------+
          /             mtu 1400           \
         /                                  \
 +--------+                                  +--------+
 |Client A|                                  |Client B|
 +--------+                                  +--------+

MTU is 1500 everywhere, except on Router A to Wanrouter and
Wanrouter to Router B.

Router A and Router B use IPIP tunnel interfaces to tunnel traffic
between Client A and Client B over WAN.

Client A sends a 1400 byte UDP datagram to Client B.
This packet gets encapsulated in the IPIP tunnel.

This works, packet is received on client B.

When conntrack (or anything else that forces ip defragmentation) is
enabled on Router A, the packet gets dropped on Router A after
encapsulation because they exceed the link MTU.

Setting the 'nopmtudisc' flag on the IPIP tunnel makes things worse,
no packets pass even in the no-netfilter scenario.

Patch one is a reproducer script for selftest infra.

Patch two is a fix for 'nopmtudisc' behaviour so ip_tunnel will send
an icmp error to Client A.  This allows 'nopmtudisc' tunnel to forward
the UDP datagrams.

Patch three enables ip refragmentation for all reassembled packets, just
like ipv6.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net 1/3] selftests: netfilter: add selftest for ipip pmtu discovery with enabled connection tracking
  2021-01-05 23:15 ` [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole Florian Westphal
@ 2021-01-05 23:15   ` Florian Westphal
  2021-01-05 23:15   ` [PATCH net 2/3] net: fix pmtu check in nopmtudisc mode Florian Westphal
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2021-01-05 23:15 UTC (permalink / raw)
  To: netdev
  Cc: christian.perle, steffen.klassert, netfilter-devel,
	Florian Westphal, Shuah Khan, Pablo Neira Ayuso

Convert Christians bug description into a reproducer.

Cc: Shuah Khan <shuah@kernel.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Reported-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 tools/testing/selftests/netfilter/Makefile    |   3 +-
 .../selftests/netfilter/ipip-conntrack-mtu.sh | 206 ++++++++++++++++++
 2 files changed, 208 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/netfilter/ipip-conntrack-mtu.sh

diff --git a/tools/testing/selftests/netfilter/Makefile b/tools/testing/selftests/netfilter/Makefile
index a374e10ef506..3006a8e5b41a 100644
--- a/tools/testing/selftests/netfilter/Makefile
+++ b/tools/testing/selftests/netfilter/Makefile
@@ -4,7 +4,8 @@
 TEST_PROGS := nft_trans_stress.sh nft_nat.sh bridge_brouter.sh \
 	conntrack_icmp_related.sh nft_flowtable.sh ipvs.sh \
 	nft_concat_range.sh nft_conntrack_helper.sh \
-	nft_queue.sh nft_meta.sh
+	nft_queue.sh nft_meta.sh \
+	ipip-conntrack-mtu.sh
 
 LDLIBS = -lmnl
 TEST_GEN_FILES =  nf-queue
diff --git a/tools/testing/selftests/netfilter/ipip-conntrack-mtu.sh b/tools/testing/selftests/netfilter/ipip-conntrack-mtu.sh
new file mode 100755
index 000000000000..4a6f5c3b3215
--- /dev/null
+++ b/tools/testing/selftests/netfilter/ipip-conntrack-mtu.sh
@@ -0,0 +1,206 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+# Conntrack needs to reassemble fragments in order to have complete
+# packets for rule matching.  Reassembly can lead to packet loss.
+
+# Consider the following setup:
+#            +--------+       +---------+       +--------+
+#            |Router A|-------|Wanrouter|-------|Router B|
+#            |        |.IPIP..|         |..IPIP.|        |
+#            +--------+       +---------+       +--------+
+#           /                  mtu 1400                   \
+#          /                                               \
+#+--------+                                                 +--------+
+#|Client A|                                                 |Client B|
+#|        |                                                 |        |
+#+--------+                                                 +--------+
+
+# Router A and Router B use IPIP tunnel interfaces to tunnel traffic
+# between Client A and Client B over WAN. Wanrouter has MTU 1400 set
+# on its interfaces.
+
+rnd=$(mktemp -u XXXXXXXX)
+rx=$(mktemp)
+
+r_a="ns-ra-$rnd"
+r_b="ns-rb-$rnd"
+r_w="ns-rw-$rnd"
+c_a="ns-ca-$rnd"
+c_b="ns-cb-$rnd"
+
+checktool (){
+	if ! $1 > /dev/null 2>&1; then
+		echo "SKIP: Could not $2"
+		exit $ksft_skip
+	fi
+}
+
+checktool "iptables --version" "run test without iptables"
+checktool "ip -Version" "run test without ip tool"
+checktool "which nc" "run test without nc (netcat)"
+checktool "ip netns add ${r_a}" "create net namespace"
+
+for n in ${r_b} ${r_w} ${c_a} ${c_b};do
+	ip netns add ${n}
+done
+
+cleanup() {
+	for n in ${r_a} ${r_b} ${r_w} ${c_a} ${c_b};do
+		ip netns del ${n}
+	done
+	rm -f ${rx}
+}
+
+trap cleanup EXIT
+
+test_path() {
+	msg="$1"
+
+	ip netns exec ${c_b} nc -n -w 3 -q 3 -u -l -p 5000 > ${rx} < /dev/null &
+
+	sleep 1
+	for i in 1 2 3; do
+		head -c1400 /dev/zero | tr "\000" "a" | ip netns exec ${c_a} nc -n -w 1 -u 192.168.20.2 5000
+	done
+
+	wait
+
+	bytes=$(wc -c < ${rx})
+
+	if [ $bytes -eq 1400 ];then
+		echo "OK: PMTU $msg connection tracking"
+	else
+		echo "FAIL: PMTU $msg connection tracking: got $bytes, expected 1400"
+		exit 1
+	fi
+}
+
+# Detailed setup for Router A
+# ---------------------------
+# Interfaces:
+# eth0: 10.2.2.1/24
+# eth1: 192.168.10.1/24
+# ipip0: No IP address, local 10.2.2.1 remote 10.4.4.1
+# Routes:
+# 192.168.20.0/24 dev ipip0    (192.168.20.0/24 is subnet of Client B)
+# 10.4.4.1 via 10.2.2.254      (Router B via Wanrouter)
+# No iptables rules at all.
+
+ip link add veth0 netns ${r_a} type veth peer name veth0 netns ${r_w}
+ip link add veth1 netns ${r_a} type veth peer name veth0 netns ${c_a}
+
+l_addr="10.2.2.1"
+r_addr="10.4.4.1"
+ip netns exec ${r_a} ip link add ipip0 type ipip local ${l_addr} remote ${r_addr} mode ipip || exit $ksft_skip
+
+for dev in lo veth0 veth1 ipip0; do
+    ip -net ${r_a} link set $dev up
+done
+
+ip -net ${r_a} addr add 10.2.2.1/24 dev veth0
+ip -net ${r_a} addr add 192.168.10.1/24 dev veth1
+
+ip -net ${r_a} route add 192.168.20.0/24 dev ipip0
+ip -net ${r_a} route add 10.4.4.0/24 via 10.2.2.254
+
+ip netns exec ${r_a} sysctl -q net.ipv4.conf.all.forwarding=1 > /dev/null
+
+# Detailed setup for Router B
+# ---------------------------
+# Interfaces:
+# eth0: 10.4.4.1/24
+# eth1: 192.168.20.1/24
+# ipip0: No IP address, local 10.4.4.1 remote 10.2.2.1
+# Routes:
+# 192.168.10.0/24 dev ipip0    (192.168.10.0/24 is subnet of Client A)
+# 10.2.2.1 via 10.4.4.254      (Router A via Wanrouter)
+# No iptables rules at all.
+
+ip link add veth0 netns ${r_b} type veth peer name veth1 netns ${r_w}
+ip link add veth1 netns ${r_b} type veth peer name veth0 netns ${c_b}
+
+l_addr="10.4.4.1"
+r_addr="10.2.2.1"
+
+ip netns exec ${r_b} ip link add ipip0 type ipip local ${l_addr} remote ${r_addr} mode ipip || exit $ksft_skip
+
+for dev in lo veth0 veth1 ipip0; do
+	ip -net ${r_b} link set $dev up
+done
+
+ip -net ${r_b} addr add 10.4.4.1/24 dev veth0
+ip -net ${r_b} addr add 192.168.20.1/24 dev veth1
+
+ip -net ${r_b} route add 192.168.10.0/24 dev ipip0
+ip -net ${r_b} route add 10.2.2.0/24 via 10.4.4.254
+ip netns exec ${r_b} sysctl -q net.ipv4.conf.all.forwarding=1 > /dev/null
+
+# Client A
+ip -net ${c_a} addr add 192.168.10.2/24 dev veth0
+ip -net ${c_a} link set dev lo up
+ip -net ${c_a} link set dev veth0 up
+ip -net ${c_a} route add default via 192.168.10.1
+
+# Client A
+ip -net ${c_b} addr add 192.168.20.2/24 dev veth0
+ip -net ${c_b} link set dev veth0 up
+ip -net ${c_b} link set dev lo up
+ip -net ${c_b} route add default via 192.168.20.1
+
+# Wan
+ip -net ${r_w} addr add 10.2.2.254/24 dev veth0
+ip -net ${r_w} addr add 10.4.4.254/24 dev veth1
+
+ip -net ${r_w} link set dev lo up
+ip -net ${r_w} link set dev veth0 up mtu 1400
+ip -net ${r_w} link set dev veth1 up mtu 1400
+
+ip -net ${r_a} link set dev veth0 mtu 1400
+ip -net ${r_b} link set dev veth0 mtu 1400
+
+ip netns exec ${r_w} sysctl -q net.ipv4.conf.all.forwarding=1 > /dev/null
+
+# Path MTU discovery
+# ------------------
+# Running tracepath from Client A to Client B shows PMTU discovery is working
+# as expected:
+#
+# clienta:~# tracepath 192.168.20.2
+# 1?: [LOCALHOST]                      pmtu 1500
+# 1:  192.168.10.1                                          0.867ms
+# 1:  192.168.10.1                                          0.302ms
+# 2:  192.168.10.1                                          0.312ms pmtu 1480
+# 2:  no reply
+# 3:  192.168.10.1                                          0.510ms pmtu 1380
+# 3:  192.168.20.2                                          2.320ms reached
+# Resume: pmtu 1380 hops 3 back 3
+
+# ip netns exec ${c_a} traceroute --mtu 192.168.20.2
+
+# Router A has learned PMTU (1400) to Router B from Wanrouter.
+# Client A has learned PMTU (1400 - IPIP overhead = 1380) to Client B
+# from Router A.
+
+#Send large UDP packet
+#---------------------
+#Now we send a 1400 bytes UDP packet from Client A to Client B:
+
+# clienta:~# head -c1400 /dev/zero | tr "\000" "a" | nc -u 192.168.20.2 5000
+test_path "without"
+
+# The IPv4 stack on Client A already knows the PMTU to Client B, so the
+# UDP packet is sent as two fragments (1380 + 20). Router A forwards the
+# fragments between eth1 and ipip0. The fragments fit into the tunnel and
+# reach their destination.
+
+#When sending the large UDP packet again, Router A now reassembles the
+#fragments before routing the packet over ipip0. The resulting IPIP
+#packet is too big (1400) for the tunnel PMTU (1380) to Router B, it is
+#dropped on Router A before sending.
+
+ip netns exec ${r_a} iptables -A FORWARD -m conntrack --ctstate NEW
+test_path "with"
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net 2/3] net: fix pmtu check in nopmtudisc mode
  2021-01-05 23:15 ` [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole Florian Westphal
  2021-01-05 23:15   ` [PATCH net 1/3] selftests: netfilter: add selftest for ipip pmtu discovery with enabled connection tracking Florian Westphal
@ 2021-01-05 23:15   ` Florian Westphal
  2021-01-05 23:15   ` [PATCH net 3/3] net: ip: always refragment ip defragmented packets Florian Westphal
  2021-01-07 22:14   ` [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole Pablo Neira Ayuso
  3 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2021-01-05 23:15 UTC (permalink / raw)
  To: netdev
  Cc: christian.perle, steffen.klassert, netfilter-devel,
	Florian Westphal, Stefano Brivio

For some reason ip_tunnel insist on setting the DF bit anyway when the
inner header has the DF bit set, EVEN if the tunnel was configured with
'nopmtudisc'.

This means that the script added in the previous commit
cannot be made to work by adding the 'nopmtudisc' flag to the
ip tunnel configuration. Doing so breaks connectivity even for the
without-conntrack/netfilter scenario.

When nopmtudisc is set, the tunnel will skip the mtu check, so no
icmp error is sent to client. Then, because inner header has DF set,
the outer header gets added with DF bit set as well.

IP stack then sends an error to itself because the packet exceeds
the device MTU.

Fixes: 23a3647bc4f93 ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/ip_tunnel.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index ee65c9225178..64594aa755f0 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -759,8 +759,11 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 		goto tx_error;
 	}
 
-	if (tnl_update_pmtu(dev, skb, rt, tnl_params->frag_off, inner_iph,
-			    0, 0, false)) {
+	df = tnl_params->frag_off;
+	if (skb->protocol == htons(ETH_P_IP) && !tunnel->ignore_df)
+		df |= (inner_iph->frag_off & htons(IP_DF));
+
+	if (tnl_update_pmtu(dev, skb, rt, df, inner_iph, 0, 0, false)) {
 		ip_rt_put(rt);
 		goto tx_error;
 	}
@@ -788,10 +791,6 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 			ttl = ip4_dst_hoplimit(&rt->dst);
 	}
 
-	df = tnl_params->frag_off;
-	if (skb->protocol == htons(ETH_P_IP) && !tunnel->ignore_df)
-		df |= (inner_iph->frag_off&htons(IP_DF));
-
 	max_headroom = LL_RESERVED_SPACE(rt->dst.dev) + sizeof(struct iphdr)
 			+ rt->dst.header_len + ip_encap_hlen(&tunnel->encap);
 	if (max_headroom > dev->needed_headroom)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net 3/3] net: ip: always refragment ip defragmented packets
  2021-01-05 23:15 ` [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole Florian Westphal
  2021-01-05 23:15   ` [PATCH net 1/3] selftests: netfilter: add selftest for ipip pmtu discovery with enabled connection tracking Florian Westphal
  2021-01-05 23:15   ` [PATCH net 2/3] net: fix pmtu check in nopmtudisc mode Florian Westphal
@ 2021-01-05 23:15   ` Florian Westphal
  2021-01-07  7:52     ` Christian Perle
  2021-01-07 22:14   ` [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole Pablo Neira Ayuso
  3 siblings, 1 reply; 9+ messages in thread
From: Florian Westphal @ 2021-01-05 23:15 UTC (permalink / raw)
  To: netdev
  Cc: christian.perle, steffen.klassert, netfilter-devel, Florian Westphal

Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.

In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.

This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.

The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.

Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.

packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set.  At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.

If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.

Also, tunnel might be configured to force DF bit on outer header.

In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.

But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.

Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.

The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.

Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/ip_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 89fff5f59eea..2ed0b01f72f0 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -302,7 +302,7 @@ static int __ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *
 	if (skb_is_gso(skb))
 		return ip_finish_output_gso(net, sk, skb, mtu);
 
-	if (skb->len > mtu || (IPCB(skb)->flags & IPSKB_FRAG_PMTU))
+	if (skb->len > mtu || IPCB(skb)->frag_max_size)
 		return ip_fragment(net, sk, skb, mtu, ip_finish_output2);
 
 	return ip_finish_output2(net, sk, skb);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net 3/3] net: ip: always refragment ip defragmented packets
  2021-01-05 23:15   ` [PATCH net 3/3] net: ip: always refragment ip defragmented packets Florian Westphal
@ 2021-01-07  7:52     ` Christian Perle
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Perle @ 2021-01-07  7:52 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, steffen.klassert, netfilter-devel

Hello Florian,

On Wed, Jan 06, 2021 at 00:15:23 +0100, Florian Westphal wrote:

> Force refragmentation as per original sizes unconditionally so ip tunnel
> will encapsulate the fragments instead.
[...]
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 89fff5f59eea..2ed0b01f72f0 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -302,7 +302,7 @@ static int __ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *
>  	if (skb_is_gso(skb))
>  		return ip_finish_output_gso(net, sk, skb, mtu);
>  
> -	if (skb->len > mtu || (IPCB(skb)->flags & IPSKB_FRAG_PMTU))
> +	if (skb->len > mtu || IPCB(skb)->frag_max_size)
>  		return ip_fragment(net, sk, skb, mtu, ip_finish_output2);
>  
>  	return ip_finish_output2(net, sk, skb);
> -- 
> 2.26.2

Did some tests yesterday and I can confirm that this patch fixes the
problem for both IPIP tunnel and XFRM tunnel interfaces.

Thanks for the fix!
  Christian Perle
-- 
Christian Perle
Senior Berater / Senior Consultant
Netzwerk- und Client-Sicherheit / Network & Client Security
Öffentliche Auftraggeber / Public Authorities
secunet Security Networks AG

Tel.: +49 201 54 54-3533, Fax: +49 201 54 54-1323
E-Mail: christian.perle@secunet.com
Ammonstraße 74, 01067 Dresden, Deutschland
www.secunet.com

secunet Security Networks AG
Sitz: Kurfürstenstraße 58, 45138 Essen, Deutschland
Amtsgericht Essen HRB 13615
Vorstand: Axel Deininger (Vors.), Torsten Henn, Dr. Kai Martius, Thomas Pleines
Aufsichtsratsvorsitzender: Ralf Wintergerst

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole
  2021-01-05 23:15 ` [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole Florian Westphal
                     ` (2 preceding siblings ...)
  2021-01-05 23:15   ` [PATCH net 3/3] net: ip: always refragment ip defragmented packets Florian Westphal
@ 2021-01-07 22:14   ` Pablo Neira Ayuso
  2021-01-07 22:45     ` Jakub Kicinski
  3 siblings, 1 reply; 9+ messages in thread
From: Pablo Neira Ayuso @ 2021-01-07 22:14 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netdev, christian.perle, steffen.klassert, netfilter-devel

On Wed, Jan 06, 2021 at 12:15:20AM +0100, Florian Westphal wrote:
> Christian Perle reported a PMTU blackhole due to unexpected interaction
> between the ip defragmentation that comes with connection tracking and
> ip tunnels.
> 
> Unfortunately setting 'nopmtudisc' on the tunnel breaks the test
> scenario even without netfilter.
> 
> Christinas setup looks like this:
>      +--------+       +---------+       +--------+
>      |Router A|-------|Wanrouter|-------|Router B|
>      |        |.IPIP..|         |..IPIP.|        |
>      +--------+       +---------+       +--------+
>           /             mtu 1400           \
>          /                                  \
>  +--------+                                  +--------+
>  |Client A|                                  |Client B|
>  +--------+                                  +--------+
> 
> MTU is 1500 everywhere, except on Router A to Wanrouter and
> Wanrouter to Router B.
> 
> Router A and Router B use IPIP tunnel interfaces to tunnel traffic
> between Client A and Client B over WAN.
> 
> Client A sends a 1400 byte UDP datagram to Client B.
> This packet gets encapsulated in the IPIP tunnel.
> 
> This works, packet is received on client B.
> 
> When conntrack (or anything else that forces ip defragmentation) is
> enabled on Router A, the packet gets dropped on Router A after
> encapsulation because they exceed the link MTU.
> 
> Setting the 'nopmtudisc' flag on the IPIP tunnel makes things worse,
> no packets pass even in the no-netfilter scenario.
> 
> Patch one is a reproducer script for selftest infra.
> 
> Patch two is a fix for 'nopmtudisc' behaviour so ip_tunnel will send
> an icmp error to Client A.  This allows 'nopmtudisc' tunnel to forward
> the UDP datagrams.
> 
> Patch three enables ip refragmentation for all reassembled packets, just
> like ipv6.

Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole
  2021-01-07 22:14   ` [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole Pablo Neira Ayuso
@ 2021-01-07 22:45     ` Jakub Kicinski
  0 siblings, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2021-01-07 22:45 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Florian Westphal
  Cc: netdev, christian.perle, steffen.klassert, netfilter-devel

On Thu, 7 Jan 2021 23:14:03 +0100 Pablo Neira Ayuso wrote:
> On Wed, Jan 06, 2021 at 12:15:20AM +0100, Florian Westphal wrote:
> > Christian Perle reported a PMTU blackhole due to unexpected interaction
> > between the ip defragmentation that comes with connection tracking and
> > ip tunnels.
> > 
> > Unfortunately setting 'nopmtudisc' on the tunnel breaks the test
> > scenario even without netfilter.
> > 
> > Christinas setup looks like this:
> >      +--------+       +---------+       +--------+
> >      |Router A|-------|Wanrouter|-------|Router B|
> >      |        |.IPIP..|         |..IPIP.|        |
> >      +--------+       +---------+       +--------+
> >           /             mtu 1400           \
> >          /                                  \
> >  +--------+                                  +--------+
> >  |Client A|                                  |Client B|
> >  +--------+                                  +--------+
> > 
> > MTU is 1500 everywhere, except on Router A to Wanrouter and
> > Wanrouter to Router B.
> > 
> > Router A and Router B use IPIP tunnel interfaces to tunnel traffic
> > between Client A and Client B over WAN.
> > 
> > Client A sends a 1400 byte UDP datagram to Client B.
> > This packet gets encapsulated in the IPIP tunnel.
> > 
> > This works, packet is received on client B.
> > 
> > When conntrack (or anything else that forces ip defragmentation) is
> > enabled on Router A, the packet gets dropped on Router A after
> > encapsulation because they exceed the link MTU.
> > 
> > Setting the 'nopmtudisc' flag on the IPIP tunnel makes things worse,
> > no packets pass even in the no-netfilter scenario.
> > 
> > Patch one is a reproducer script for selftest infra.
> > 
> > Patch two is a fix for 'nopmtudisc' behaviour so ip_tunnel will send
> > an icmp error to Client A.  This allows 'nopmtudisc' tunnel to forward
> > the UDP datagrams.
> > 
> > Patch three enables ip refragmentation for all reassembled packets, just
> > like ipv6.  
> 
> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>

Applied, thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-01-07 22:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-05 12:12 BUG: IPv4 conntrack reassembles forwarded packets Christian Perle
2021-01-05 12:12 ` Christian Perle
2021-01-05 23:15 ` [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole Florian Westphal
2021-01-05 23:15   ` [PATCH net 1/3] selftests: netfilter: add selftest for ipip pmtu discovery with enabled connection tracking Florian Westphal
2021-01-05 23:15   ` [PATCH net 2/3] net: fix pmtu check in nopmtudisc mode Florian Westphal
2021-01-05 23:15   ` [PATCH net 3/3] net: ip: always refragment ip defragmented packets Florian Westphal
2021-01-07  7:52     ` Christian Perle
2021-01-07 22:14   ` [PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole Pablo Neira Ayuso
2021-01-07 22:45     ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.