question: frag_max_size not checked in ip_finish_output

From: Wenxin Wang <wenxin.wang94@gmail.com>
To: kernelnewbies@kernelnewbies.org
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>,
	"David S. Miller" <davem@davemloft.net>
Subject: question: frag_max_size not checked in ip_finish_output
Date: Wed, 21 Nov 2018 10:01:38 +0800	[thread overview]
Message-ID: <CAA3R8U-+i9mpWTyPNz5wgXCFf2Z8wQ=gT-NxfjRGKjkjFD62NA@mail.gmail.com> (raw)
Message-ID: <20181121020138.rM7Q9FpKfn7YcL_JDnBbPX4D2eYn2xVCUTEa78NdT0Y@z> (raw)

Dear developers,

It seems that with defragmentation enabled,
`ip_finish_output` doesn't honor `IPCB(skb)->frag_max_size`,
while `ip6_finish_output` checks `IP6CB(skb)->frag_max_size`.
(Sorry for the reposting, I found that I need to subscribe
to the mailing list, and I also add results of my experiment)

The relevant code is here
https://elixir.bootlin.com/linux/latest/source/net/ipv4/ip_output.c#L310
https://elixir.bootlin.com/linux/latest/source/net/ipv6/ip6_output.c#L151

As far as I know, `frag_max_size` prevents the forwarding routine
from sending packets longer than the maximum fragment received.
I'm wondering why `ip_finish_output` doesn't do the same as
`ip6_finish_output`, especially when `ip_fragment` and `ip_do_fragment`,
called (indirectly) by `ip_finish_output` itself, both cap the output mtu
by this `frag_max_size`.

I did some experiement with two connected machines, one as a router
with ipv4/ipv6 NAT and 1500 mtu, the other as a client behind NAT
and 1280 mtu. Since NAT was enabled on the router it will do
defragmentation. Using `traceroute`, the client sent udp packets with length
1500, which were fragmented by itself. I captured packets
on the ingress and egress port of the NAT router, and here's the output:

--- IPv4 with NAT (and defrag)
ingress:
09:26:39.030436 IP 192.168.1.2.44654 > 223.5.5.5.33434: UDP, bad
length 1472 > 1248
09:26:39.030451 IP 192.168.1.2 > 223.5.5.5: ip-proto-17

egress:
09:26:39.030543 IP 202.38.101.2.58599 > 223.5.5.5.33437: UDP, length 1472

--- IPv6 with NAT (and defrag)
ingress:
09:18:26.947246 IP6 fdff::2 > 2001:250:3::1: frag (0|1232) 57242 >
33434: UDP, bad length 1452 > 1224
09:18:26.947262 IP6 fdff::2 > 2001:250:3::1: frag (1232|228)

egress:
09:18:26.947362 IP6 2001:da8:c4c6:2::2 > 2001:250:3::1: frag (0|1232)
49616 > 33437: UDP, bad length 1452 > 1224
09:18:26.947365 IP6 2001:da8:c4c6:2::2 > 2001:250:3::1: frag (1232|228)
-

It can be seen that with defragmentation, IPv6 keeps the fragments
below `frag_max_size`, while IPv4 doesn't. I understand that IPv6 routers
are not allowed to meddle with fragmentation, while IPv4 routes can at least
further fragment packets; but judging from the behavior of `ip_fragment`,
I think the IPv4 code is also trying to honor `frag_max_size`, but didn't
check it when deciding to do fragmentation or not.

Many thanks in advance!
If I'm sending to the wrong person, or wrong mailing list, please let me
know. It's my first time trying to ask questions to Linux developers, and
sorry for the disturbance.

Thank you for making Linux great ;)
Sincerely,
Wenxin Wang

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies