Re: [PATCH] xfrm:fragmented ipv4 tunnel packets in inner interface

From: "Maciej Żenczykowski" <maze@google.com>
To: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Lorenzo Colitti <lorenzo@google.com>,
	mtk81216 <lina.wang@mediatek.com>,
	"David S . Miller" <davem@davemloft.net>,
	Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>,
	Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	Jakub Kicinski <kuba@kernel.org>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Matthias Brugger <matthias.bgg@gmail.com>,
	Linux NetDev <netdev@vger.kernel.org>,
	lkml <linux-kernel@vger.kernel.org>,
	linux-arm-kernel@lists.infradead.org,
	linux-mediatek@lists.infradead.org,
	Greg Kroah-Hartman <gregkh@google.com>
Subject: Re: [PATCH] xfrm:fragmented ipv4 tunnel packets in inner interface
Date: Mon, 9 Nov 2020 11:38:16 -0800	[thread overview]
Message-ID: <CANP3RGfuOGoB1msF1evzsgKf5qZZbNDCHDzvgPBHRGyepDuu+g@mail.gmail.com> (raw)
In-Reply-To: <20201109095813.GV26422@gauss3.secunet.de>

On Mon, Nov 9, 2020 at 1:58 AM Steffen Klassert
<steffen.klassert@secunet.com> wrote:
>
> On Thu, Nov 05, 2020 at 01:52:01PM +0900, Lorenzo Colitti wrote:
> > On Tue, Sep 15, 2020 at 4:30 PM Steffen Klassert
> > <steffen.klassert@secunet.com> wrote:
> > > > In esp's tunnel mode,if inner interface is ipv4,outer is ipv4,one big
> > > > packet which travels through tunnel will be fragmented with outer
> > > > interface's mtu,peer server will remove tunnelled esp header and assemble
> > > > them in big packet.After forwarding such packet to next endpoint,it will
> > > > be dropped because of exceeding mtu or be returned ICMP(packet-too-big).
> > >
> > > What is the exact case where packets are dropped? Given that the packet
> > > was fragmented (and reassembled), I'd assume the DF bit was not set. So
> > > every router along the path is allowed to fragment again if needed.
> >
> > In general, isn't it just suboptimal to rely on fragmentation if the
> > sender already knows the packet is too big? That's why we have things
> > like path MTU discovery (RFC 1191).
>
> When we setup packets that are sent from a local socket, we take
> MTU/PMTU info we have into account. So we don't create fragments in
> that case.
>
> When forwarding packets it is different. The router that can not
> TX the packet because it exceeds the MTU of the sending interface
> is responsible to either fragment (if DF is not set), or send a
> PMTU notification (if DF is set). So if we are able to transmit
> the packet, we do it.
>
> > Fragmentation is generally
> > expensive, increases the chance of packet loss, and has historically
> > caused lots of security vulnerabilities. Also, in real world networks,
> > fragments sometimes just don't work, either because intermediate
> > routers don't fragment, or because firewalls drop the fragments due to
> > security reasons.
> >
> > While it's possible in theory to ask these operators to configure
> > their routers to fragment packets, that may not result in the network
> > being fixed, due to hardware constraints, security policy or other
> > reasons.
>
> We can not really do anything here. If a flow has no DF bit set
> on the packets, we can not rely on PMTU information. If we have PMTU
> info on the route, then we have it because some other flow (that has
> DF bit set on the packets) triggered PMTU discovery. That means that
> the PMTU information is reset when this flow (with DF set) stops
> sending packets. So the other flow (with DF not set) will send
> big packets again.

PMTU is by default ignored by forwarding - because it's spoofable.

That said I wonder if my recent changes to honour route mtu (for ipv4)
haven't fixed this particular issue in the presence of correctly
configured device/route mtus...

I don't understand if the problem here is locally generated packets,
or forwarded packets.

It does seem like there is (or was) a bug somewhere... but it might
already be fixed (see above) or might be caused by a misconfiguration
of device mtu or routing rules.

I don't really understand the example.

>
> > Those operators may also be in a position to place
> > requirements on devices that have to use their network. If the Linux
> > stack does not work as is on these networks, then those devices will
> > have to meet those requirements by making out-of-tree changes. It
> > would be good to avoid that if there's a better solution (e.g., make
> > this configurable via sysctl).
>
> We should not try to workaround broken configurations, there are just
> too many possibilities to configure a broken network.