netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* linux-3.6+, gre+ipsec+forwarding = IP fragmentation broken
@ 2013-03-13 15:14 Timo Teras
  2013-03-15  9:25 ` Timo Teras
  0 siblings, 1 reply; 6+ messages in thread
From: Timo Teras @ 2013-03-13 15:14 UTC (permalink / raw)
  To: netdev

In the typical DMVPN setup with IPv4-ESP-GRE-IPv4 stack, it seems that
IPv4 fragmentation got broke around 3.6 for forwarded packets.

It would seem that fragmentation works for locally generated packets.
Also PMTU (DF set) seems to work for both forwarded and locally
generated packets. But forwarded packets to gre device that gets IPsec
encrypted do not get fragmented properly.

3.4.x kernels work, 3.6 and 3.8 series tested and fail similarly.

I was going through the changelog and it seems that MTU is now handled
in nexthop exceptions and one needs to produce the full flow info to
update it. I'm wonding if this does not hold true in my code path as
ip_gre rewraps the forwarded packet and creates new IP header - when it
next goes to the xfrm code (which sends the ICMP error) the inner iphdr
is no longer accessible. Would this cause the breakage that I'm seeing?
Or the forward flow's mtu still updated somehow?

- Timo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: linux-3.6+, gre+ipsec+forwarding = IP fragmentation broken
  2013-03-13 15:14 linux-3.6+, gre+ipsec+forwarding = IP fragmentation broken Timo Teras
@ 2013-03-15  9:25 ` Timo Teras
  2013-03-15 11:38   ` Timo Teras
  0 siblings, 1 reply; 6+ messages in thread
From: Timo Teras @ 2013-03-15  9:25 UTC (permalink / raw)
  To: netdev

On Wed, 13 Mar 2013 17:14:53 +0200
Timo Teras <timo.teras@iki.fi> wrote:

> In the typical DMVPN setup with IPv4-ESP-GRE-IPv4 stack, it seems that
> IPv4 fragmentation got broke around 3.6 for forwarded packets.
> 
> It would seem that fragmentation works for locally generated packets.
> Also PMTU (DF set) seems to work for both forwarded and locally
> generated packets. But forwarded packets to gre device that gets IPsec
> encrypted do not get fragmented properly.
> 
> 3.4.x kernels work, 3.6 and 3.8 series tested and fail similarly.

Actually 3.4.x vanilla does not work. It works only with 38d523e "ipv4:
Remove output route check in ipv4_mtu" applied which I've been
cherry-picking to my builds.

> I was going through the changelog and it seems that MTU is now handled
> in nexthop exceptions and one needs to produce the full flow info to
> update it. I'm wonding if this does not hold true in my code path as
> ip_gre rewraps the forwarded packet and creates new IP header - when
> it next goes to the xfrm code (which sends the ICMP error) the inner
> iphdr is no longer accessible. Would this cause the breakage that I'm
> seeing? Or the forward flow's mtu still updated somehow?

I have now a theory on what goes wrong.

My gre tunnel is configured with 'ttl 64' so the tunnel IP header
always gets DF bit set to do proper path-mtu. The kind of locally
generated ICMP messages I get, imply that re-fragmentation happens only
on the tunnel's IPv4 header level - but it'll be too late then: the
large packet is queued, IPsec'ed and it is the IPsec'ed packet that
gets is tried to be fragmented (but it has DF set so it fails and
packet is dropped).

I believe ip_gre should explicitly fragment the inner IPv4 and IPv6
packets if the tunnel's ttl is not inherited (resulting in DF bit set
in the tunnel's IPv4 header).

So basically ip_gre worked wrong all along - things just happened to
work due to GRO/GSO not implemented in ip_gre, and the way (the now
deleted) routing cache exposed pmtu.

Does this make sense?

- Timo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: linux-3.6+, gre+ipsec+forwarding = IP fragmentation broken
  2013-03-15  9:25 ` Timo Teras
@ 2013-03-15 11:38   ` Timo Teras
  2013-03-15 13:03     ` Timo Teras
  0 siblings, 1 reply; 6+ messages in thread
From: Timo Teras @ 2013-03-15 11:38 UTC (permalink / raw)
  To: netdev

On Fri, 15 Mar 2013 11:25:16 +0200
Timo Teras <timo.teras@iki.fi> wrote:

> On Wed, 13 Mar 2013 17:14:53 +0200
> Timo Teras <timo.teras@iki.fi> wrote:
> 
> > In the typical DMVPN setup with IPv4-ESP-GRE-IPv4 stack, it seems
> > that IPv4 fragmentation got broke around 3.6 for forwarded packets.
> > 
> > It would seem that fragmentation works for locally generated
> > packets. Also PMTU (DF set) seems to work for both forwarded and
> > locally generated packets. But forwarded packets to gre device that
> > gets IPsec encrypted do not get fragmented properly.
> > 
> > 3.4.x kernels work, 3.6 and 3.8 series tested and fail similarly.
> 
> Actually 3.4.x vanilla does not work. It works only with 38d523e
> "ipv4: Remove output route check in ipv4_mtu" applied which I've been
> cherry-picking to my builds.
> 
> > I was going through the changelog and it seems that MTU is now
> > handled in nexthop exceptions and one needs to produce the full
> > flow info to update it. I'm wonding if this does not hold true in
> > my code path as ip_gre rewraps the forwarded packet and creates new
> > IP header - when it next goes to the xfrm code (which sends the
> > ICMP error) the inner iphdr is no longer accessible. Would this
> > cause the breakage that I'm seeing? Or the forward flow's mtu still
> > updated somehow?
> 
> I have now a theory on what goes wrong.
> 
> My gre tunnel is configured with 'ttl 64' so the tunnel IP header
> always gets DF bit set to do proper path-mtu. The kind of locally
> generated ICMP messages I get, imply that re-fragmentation happens
> only on the tunnel's IPv4 header level - but it'll be too late then:
> the large packet is queued, IPsec'ed and it is the IPsec'ed packet
> that gets is tried to be fragmented (but it has DF set so it fails and
> packet is dropped).
> 
> I believe ip_gre should explicitly fragment the inner IPv4 and IPv6
> packets if the tunnel's ttl is not inherited (resulting in DF bit set
> in the tunnel's IPv4 header).
> 
> So basically ip_gre worked wrong all along - things just happened to
> work due to GRO/GSO not implemented in ip_gre, and the way (the now
> deleted) routing cache exposed pmtu.
> 
> Does this make sense?

Not really. Seems the fragmentation should happen already on the
earlier dst level. Though, this implies that GSO cannot be used in
ip_gre if ttl != inherit.

I added some ip_gre debugging and the following seems to happen:

- the mtu is calculated correctly on xmit path:
  dst_mtu(&rt->dst) = 1458 (the tunnel's XFRMed IPv4 path)

- skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL, skb, mtu);
  is called with mtu=1430, which seems correct

- dst_mtu(skb_dst(skb)) seems to still return after above call the
  value 1472 which is wrong. so update_pmtu is not working.

- skb->dev->ifindex implies skb->dev points to gre device when
  update_pmtu is being called (and not the ethX from which the packet
  was received), so ip_rt_update_pmtu() which eventually calls
  build_skb_flow_key() is likely using wrong ifindex for the flow


- Timo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: linux-3.6+, gre+ipsec+forwarding = IP fragmentation broken
  2013-03-15 11:38   ` Timo Teras
@ 2013-03-15 13:03     ` Timo Teras
       [not found]       ` <20130320101318.4196d93a@vostro>
  0 siblings, 1 reply; 6+ messages in thread
From: Timo Teras @ 2013-03-15 13:03 UTC (permalink / raw)
  To: netdev

On Fri, 15 Mar 2013 13:38:20 +0200
Timo Teras <timo.teras@iki.fi> wrote:

> [snip]
> Though, this implies that GSO cannot be used in ip_gre if ttl != inherit.
> 
> I added some ip_gre debugging and the following seems to happen:
> 
> - the mtu is calculated correctly on xmit path:
>   dst_mtu(&rt->dst) = 1458 (the tunnel's XFRMed IPv4 path)
> 
> - skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL, skb, mtu);
>   is called with mtu=1430, which seems correct
> 
> - dst_mtu(skb_dst(skb)) seems to still return after above call the
>   value 1472 which is wrong. so update_pmtu is not working.
> 
> - skb->dev->ifindex implies skb->dev points to gre device when
>   update_pmtu is being called (and not the ethX from which the packet
>   was received), so ip_rt_update_pmtu() which eventually calls
>   build_skb_flow_key() is likely using wrong ifindex for the flow

One more observation is that the outer dst is an input route. And it
seems that input routes do not consult nexthop exception tables for
setting the rt_pmtu. Thus rt_pmtu is never set for forwarding entries.

- Timo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [regression] [analyzed] fragmentation broken for tunnel devices
       [not found]       ` <20130320101318.4196d93a@vostro>
@ 2013-03-20 17:46         ` David Miller
  2013-05-01  6:46           ` Timo Teras
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2013-03-20 17:46 UTC (permalink / raw)
  To: timo.teras; +Cc: netdev

From: Timo Teras <timo.teras@iki.fi>
Date: Wed, 20 Mar 2013 10:13:18 +0200

Thanks for investigating this issue.

> 3) Reimplement fragmentation in tunnel devices. This means some
> duplication of code. But now that there's GRO support in tunnels, this
> would seem the most performant option.

I think this is the best option, especially in the long term.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [regression] [analyzed] fragmentation broken for tunnel devices
  2013-03-20 17:46         ` [regression] [analyzed] fragmentation broken for tunnel devices David Miller
@ 2013-05-01  6:46           ` Timo Teras
  0 siblings, 0 replies; 6+ messages in thread
From: Timo Teras @ 2013-05-01  6:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Wed, 20 Mar 2013 13:46:40 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Timo Teras <timo.teras@iki.fi>
> Date: Wed, 20 Mar 2013 10:13:18 +0200
> 
> Thanks for investigating this issue.
> 
> > 3) Reimplement fragmentation in tunnel devices. This means some
> > duplication of code. But now that there's GRO support in tunnels,
> > this would seem the most performant option.
> 
> I think this is the best option, especially in the long term.

I've been thinking more. It seems there's going to be tricky cases with
this approach too. Looks like ICMP and other non-TCP/UDP packets are
not really sent as GSO/GRO packets. So for those we end up fragmenting
multiple times and end up sending double the amount of packets than
needed.

This would also mean that if we are ever to support new protocols, e.g.
MPLS, we'd get a whole new can of forms when needing fragmentation.

I'm wondering now if it'd be easier to just do the additional
nexthop consultation for forward routes where destination is a tunnel
device.

- Timo

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-05-01  6:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-13 15:14 linux-3.6+, gre+ipsec+forwarding = IP fragmentation broken Timo Teras
2013-03-15  9:25 ` Timo Teras
2013-03-15 11:38   ` Timo Teras
2013-03-15 13:03     ` Timo Teras
     [not found]       ` <20130320101318.4196d93a@vostro>
2013-03-20 17:46         ` [regression] [analyzed] fragmentation broken for tunnel devices David Miller
2013-05-01  6:46           ` Timo Teras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).