Re: [PATCH] xfrm:fragmented ipv4 tunnel packets in inner interface

From: Steffen Klassert <steffen.klassert@secunet.com>
To: Lorenzo Colitti <lorenzo@google.com>
Cc: mtk81216 <lina.wang@mediatek.com>,
	"David S . Miller" <davem@davemloft.net>,
	"Alexey Kuznetsov" <kuznet@ms2.inr.ac.ru>,
	"Hideaki YOSHIFUJI" <yoshfuji@linux-ipv6.org>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Herbert Xu" <herbert@gondor.apana.org.au>,
	"Matthias Brugger" <matthias.bgg@gmail.com>,
	"Linux NetDev" <netdev@vger.kernel.org>,
	lkml <linux-kernel@vger.kernel.org>,
	linux-arm-kernel@lists.infradead.org,
	linux-mediatek@lists.infradead.org,
	"Maciej Żenczykowski" <maze@google.com>,
	"Greg Kroah-Hartman" <gregkh@google.com>
Subject: Re: [PATCH] xfrm:fragmented ipv4 tunnel packets in inner interface
Date: Mon, 9 Nov 2020 10:58:13 +0100	[thread overview]
Message-ID: <20201109095813.GV26422@gauss3.secunet.de> (raw)
In-Reply-To: <CAKD1Yr1VsueZWUtCL4iMWLhnADypUOtDK=dgqM2Y2HDvXc77iw@mail.gmail.com>

On Thu, Nov 05, 2020 at 01:52:01PM +0900, Lorenzo Colitti wrote:
> On Tue, Sep 15, 2020 at 4:30 PM Steffen Klassert
> <steffen.klassert@secunet.com> wrote:
> > > In esp's tunnel mode,if inner interface is ipv4,outer is ipv4,one big
> > > packet which travels through tunnel will be fragmented with outer
> > > interface's mtu,peer server will remove tunnelled esp header and assemble
> > > them in big packet.After forwarding such packet to next endpoint,it will
> > > be dropped because of exceeding mtu or be returned ICMP(packet-too-big).
> >
> > What is the exact case where packets are dropped? Given that the packet
> > was fragmented (and reassembled), I'd assume the DF bit was not set. So
> > every router along the path is allowed to fragment again if needed.
> 
> In general, isn't it just suboptimal to rely on fragmentation if the
> sender already knows the packet is too big? That's why we have things
> like path MTU discovery (RFC 1191).

When we setup packets that are sent from a local socket, we take
MTU/PMTU info we have into account. So we don't create fragments in
that case.

When forwarding packets it is different. The router that can not
TX the packet because it exceeds the MTU of the sending interface
is responsible to either fragment (if DF is not set), or send a
PMTU notification (if DF is set). So if we are able to transmit
the packet, we do it.

> Fragmentation is generally
> expensive, increases the chance of packet loss, and has historically
> caused lots of security vulnerabilities. Also, in real world networks,
> fragments sometimes just don't work, either because intermediate
> routers don't fragment, or because firewalls drop the fragments due to
> security reasons.
> 
> While it's possible in theory to ask these operators to configure
> their routers to fragment packets, that may not result in the network
> being fixed, due to hardware constraints, security policy or other
> reasons.

We can not really do anything here. If a flow has no DF bit set
on the packets, we can not rely on PMTU information. If we have PMTU
info on the route, then we have it because some other flow (that has
DF bit set on the packets) triggered PMTU discovery. That means that
the PMTU information is reset when this flow (with DF set) stops
sending packets. So the other flow (with DF not set) will send
big packets again.

> Those operators may also be in a position to place
> requirements on devices that have to use their network. If the Linux
> stack does not work as is on these networks, then those devices will
> have to meet those requirements by making out-of-tree changes. It
> would be good to avoid that if there's a better solution (e.g., make
> this configurable via sysctl).

We should not try to workaround broken configurations, there are just
too many possibilities to configure a broken network.