netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ip_forward_use_pmtu and forwarding to xfrm'ed gre
@ 2015-07-08 13:30 Timo Teras
  2015-07-08 15:52 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 6+ messages in thread
From: Timo Teras @ 2015-07-08 13:30 UTC (permalink / raw)
  To: Hannes Frederic Sowa, netdev

Hi,

It seems ip_forward_use_pmtu commit log says:
    Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
    won't be set and further fragmentation logic will use the path mtu
    to determine the fragmentation size. They also recheck packet size
    with help of path mtu discovery and report appropriate errors.

But this does not seem to be true in all paths. For example, I'm
forwarding from ethX -> greX (with gre having ttl 64; and thus
setting DF on tunnel always) and then gre output is finally IPsec
encrypted. But fragmentation does not work. Setting ip_forward_use_pmtu
makes it work again. tcpdump says the packet is fragmented based on the
greX device mtu, not the path mtu in this case.

This probably is due to the way how the xfrm+gre work together. On
first packet, the gre tunnel driver updates pmtu for the inner flow,
which is expected to be honored always. And if the 'ttl' value is set
for gre tunnel, no re-fragmentation is allowed as the inner flow
should know better. This does how the side effect that if the very
first packet is large, it'll be dropped to 'learn' the pmtu. 

It's probably not possible to detect this kind of target easily, as the
xfrm can be applied or not even on per inner target IP basis (as then
tunnel destination IP can be dynamic for nbma tunnels).

So I wonder if ip_gre driver can workaround this somehow, by e.g.
refragmenting if necessary. Or if we just should update the sysctl's
help text to say that this another scenario where it needs to be
turned on.

Thanks,
Timo
 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ip_forward_use_pmtu and forwarding to xfrm'ed gre
  2015-07-08 13:30 ip_forward_use_pmtu and forwarding to xfrm'ed gre Timo Teras
@ 2015-07-08 15:52 ` Hannes Frederic Sowa
  2015-07-08 16:17   ` Timo Teras
  0 siblings, 1 reply; 6+ messages in thread
From: Hannes Frederic Sowa @ 2015-07-08 15:52 UTC (permalink / raw)
  To: Timo Teras, netdev

Hello,

On Wed, 2015-07-08 at 16:30 +0300, Timo Teras wrote:
> Hi,
> 
> It seems ip_forward_use_pmtu commit log says:
>     Tunnel and ipsec output paths clear IPCB again, thus 
> IPSKB_FORWARDED
>     won't be set and further fragmentation logic will use the path mtu
>     to determine the fragmentation size. They also recheck packet size
>     with help of path mtu discovery and report appropriate errors.
> 
> But this does not seem to be true in all paths. For example, I'm
> forwarding from ethX -> greX (with gre having ttl 64; and thus
> setting DF on tunnel always) and then gre output is finally IPsec
> encrypted. But fragmentation does not work. Setting 
> ip_forward_use_pmtu
> makes it work again. tcpdump says the packet is fragmented based on 
> the
> greX device mtu, not the path mtu in this case.
> 
> This probably is due to the way how the xfrm+gre work together. On
> first packet, the gre tunnel driver updates pmtu for the inner flow,
> which is expected to be honored always. And if the 'ttl' value is set
> for gre tunnel, no re-fragmentation is allowed as the inner flow
> should know better. This does how the side effect that if the very
> first packet is large, it'll be dropped to 'learn' the pmtu. 
> 
> It's probably not possible to detect this kind of target easily, as 
> the
> xfrm can be applied or not even on per inner target IP basis (as then
> tunnel destination IP can be dynamic for nbma tunnels).

I am currently not sure if we actually have resolved the xfrm path at
the time we enter ip_forward, I actually thought we do. In this case we
should be able to use skb_dst->dst->path->header_len and substract it
before using it to fragment the packets. I hope it is so easy... :)

I would actually avoid telling anyone to enable using the path mtu
information in forwarding ever again.


> So I wonder if ip_gre driver can workaround this somehow, by e.g.
> refragmenting if necessary. Or if we just should update the sysctl's
> help text to say that this another scenario where it needs to be
> turned on.

If above idea does not work, we could simply add an option to gre driver
to set skb->ignore_df, but I don't like that much.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ip_forward_use_pmtu and forwarding to xfrm'ed gre
  2015-07-08 15:52 ` Hannes Frederic Sowa
@ 2015-07-08 16:17   ` Timo Teras
  2015-07-08 17:39     ` Hannes Frederic Sowa
  0 siblings, 1 reply; 6+ messages in thread
From: Timo Teras @ 2015-07-08 16:17 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev

On Wed, 08 Jul 2015 17:52:32 +0200
Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:

> On Wed, 2015-07-08 at 16:30 +0300, Timo Teras wrote:
> > This probably is due to the way how the xfrm+gre work together. On
> > first packet, the gre tunnel driver updates pmtu for the inner flow,
> > which is expected to be honored always. And if the 'ttl' value is
> > set for gre tunnel, no re-fragmentation is allowed as the inner flow
> > should know better. This does how the side effect that if the very
> > first packet is large, it'll be dropped to 'learn' the pmtu. 
> > 
> > It's probably not possible to detect this kind of target easily, as 
> > the xfrm can be applied or not even on per inner target IP basis (as
> > then tunnel destination IP can be dynamic for nbma tunnels).
> 
> I am currently not sure if we actually have resolved the xfrm path at
> the time we enter ip_forward, I actually thought we do. In this case
> we should be able to use skb_dst->dst->path->header_len and substract
> it before using it to fragment the packets. I hope it is so easy... :)

It is not. The inner skb just knows that it's going from ethX -> greX.
And that's what contains the path MTU, and that's what ip_forward will
use.

Only on gre_xmit it is resolved where the tunnel packet goes, and the
xfrm resolved. Thus the update_pmtu work fully internally here.

> I would actually avoid telling anyone to enable using the path mtu
> information in forwarding ever again.

The problem here is that pmtu framework is used internally to relay the
trusted stacking pmtu in addition to the from-the-wire learned pmtu.

> > So I wonder if ip_gre driver can workaround this somehow, by e.g.
> > refragmenting if necessary. Or if we just should update the sysctl's
> > help text to say that this another scenario where it needs to be
> > turned on.
> 
> If above idea does not work, we could simply add an option to gre
> driver to set skb->ignore_df, but I don't like that much.

This is not acceptable. The gre driver has two operating modes: DF and
non-DF mode (which is triggered by 'ttl inherit' or 'ttl <number>'
option on tunnel creation). The DF mode intentionally sets DF on all
tunnel packets so the pmtu is learned and relayed up the stack. In
non-DF mode the tunnel packets DF is derived from encapsulated packet.

Basically this info could be used. If the target is gre1 in DF mode, we
should be trusting the pmtu. Otherwise the existing internal mechanism
breaks.

Thoughts?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ip_forward_use_pmtu and forwarding to xfrm'ed gre
  2015-07-08 16:17   ` Timo Teras
@ 2015-07-08 17:39     ` Hannes Frederic Sowa
  2015-07-08 18:51       ` Timo Teras
  0 siblings, 1 reply; 6+ messages in thread
From: Hannes Frederic Sowa @ 2015-07-08 17:39 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev

Hello,

On Wed, 2015-07-08 at 19:17 +0300, Timo Teras wrote:
> On Wed, 08 Jul 2015 17:52:32 +0200
> Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> 
> > On Wed, 2015-07-08 at 16:30 +0300, Timo Teras wrote:
> > > This probably is due to the way how the xfrm+gre work together. On
> > > first packet, the gre tunnel driver updates pmtu for the inner 
> > > flow,
> > > which is expected to be honored always. And if the 'ttl' value is
> > > set for gre tunnel, no re-fragmentation is allowed as the inner 
> > > flow
> > > should know better. This does how the side effect that if the very
> > > first packet is large, it'll be dropped to 'learn' the pmtu. 
> > > 
> > > It's probably not possible to detect this kind of target easily, 
> > > as 
> > > the xfrm can be applied or not even on per inner target IP basis 
> > > (as
> > > then tunnel destination IP can be dynamic for nbma tunnels).
> > 
> > I am currently not sure if we actually have resolved the xfrm path 
> > at
> > the time we enter ip_forward, I actually thought we do. In this case
> > we should be able to use skb_dst->dst->path->header_len and 
> > substract
> > it before using it to fragment the packets. I hope it is so easy... 
> > :)
> 
> It is not. The inner skb just knows that it's going from ethX -> greX.
> And that's what contains the path MTU, and that's what ip_forward will
> use.
> 
> Only on gre_xmit it is resolved where the tunnel packet goes, and the
> xfrm resolved. Thus the update_pmtu work fully internally here.

Oh, yes, sorry, gre is not xfrm and doesn't propagate the information
towards the first routing lookup.

> > I would actually avoid telling anyone to enable using the path mtu
> > information in forwarding ever again.
> 
> The problem here is that pmtu framework is used internally to relay 
> the
> trusted stacking pmtu in addition to the from-the-wire learned pmtu.

Yes, and it is not easy to propagate this trusted state across all the
different mtu storage location we have (metrics, fnhe, etc...). I don't
know if it is worth the effort.

> > > So I wonder if ip_gre driver can workaround this somehow, by e.g.
> > > refragmenting if necessary. Or if we just should update the 
> > > sysctl's
> > > help text to say that this another scenario where it needs to be
> > > turned on.
> > 
> > If above idea does not work, we could simply add an option to gre
> > driver to set skb->ignore_df, but I don't like that much.
> 
> This is not acceptable. The gre driver has two operating modes: DF and
> non-DF mode (which is triggered by 'ttl inherit' or 'ttl <number>'
> option on tunnel creation). The DF mode intentionally sets DF on all
> tunnel packets so the pmtu is learned and relayed up the stack. In
> non-DF mode the tunnel packets DF is derived from encapsulated packet.
> 
> Basically this info could be used. If the target is gre1 in DF mode, 
> we
> should be trusting the pmtu. Otherwise the existing internal mechanism
> breaks.
> 
> Thoughts?

At least we know which interface the packet would leave. Should we
override this behavior on a per-interface basis?

(Although I am in favor of admins just correcting the mtu by hand and
documenting this as you proposed earlier. I really don't know if it is
worth the effort to propagate those information.).

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ip_forward_use_pmtu and forwarding to xfrm'ed gre
  2015-07-08 17:39     ` Hannes Frederic Sowa
@ 2015-07-08 18:51       ` Timo Teras
  2015-07-09 14:48         ` Timo Teras
  0 siblings, 1 reply; 6+ messages in thread
From: Timo Teras @ 2015-07-08 18:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev

Hi,

On Wed, 08 Jul 2015 19:39:58 +0200
Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:

> On Wed, 2015-07-08 at 19:17 +0300, Timo Teras wrote:
> > On Wed, 08 Jul 2015 17:52:32 +0200
> > Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> > 
> > > On Wed, 2015-07-08 at 16:30 +0300, Timo Teras wrote:
> > > If above idea does not work, we could simply add an option to gre
> > > driver to set skb->ignore_df, but I don't like that much.
> > 
> > This is not acceptable. The gre driver has two operating modes: DF
> > and non-DF mode (which is triggered by 'ttl inherit' or 'ttl
> > <number>' option on tunnel creation). The DF mode intentionally
> > sets DF on all tunnel packets so the pmtu is learned and relayed up
> > the stack. In non-DF mode the tunnel packets DF is derived from
> > encapsulated packet.
> > 
> > Basically this info could be used. If the target is gre1 in DF
> > mode, we
> > should be trusting the pmtu. Otherwise the existing internal
> > mechanism breaks.
> > 
> > Thoughts?
> 
> At least we know which interface the packet would leave. Should we
> override this behavior on a per-interface basis?

That would be one option. We could also make the exception just for GRE
interface in the DF mode. Or some sort of per-interface flag that is
set internally by the driver.

> (Although I am in favor of admins just correcting the mtu by hand and
> documenting this as you proposed earlier. I really don't know if it is
> worth the effort to propagate those information.).

The problem with GRE + DF is that the internal packet is no-DF and
potentially fragmented, but the tunneled packets (fragments) do have DF
set. So no interim router can defrag+frag if needed. This means pmtu
must be honored. And on nbma gre tunnels (target tunnel address depends
on encapsulated packet's address) the path mtu needs to be propagated
back to the sender.

In this configuration ip_forward_use_pmtu needs to be enabled (system
wide, per-interface config, or implicitly by interface flags). Or
alternatively the trusted pmtu needs to be propagated via alternate
mechanism. But that might be quite tricky to implement.

I believe other tunnels have similar mechanism. E.g. ipip tunnels seems
to share same DF vs. non-DF mode based on 'ttl' setting.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ip_forward_use_pmtu and forwarding to xfrm'ed gre
  2015-07-08 18:51       ` Timo Teras
@ 2015-07-09 14:48         ` Timo Teras
  0 siblings, 0 replies; 6+ messages in thread
From: Timo Teras @ 2015-07-09 14:48 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev

Hello,

On Wed, 8 Jul 2015 21:51:56 +0300
Timo Teras <timo.teras@iki.fi> wrote:

> On Wed, 08 Jul 2015 19:39:58 +0200
> Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> > 
> > At least we know which interface the packet would leave. Should we
> > override this behavior on a per-interface basis?
> 
> That would be one option. We could also make the exception just for
> GRE interface in the DF mode. Or some sort of per-interface flag that
> is set internally by the driver.
> 
> > (Although I am in favor of admins just correcting the mtu by hand
> > and documenting this as you proposed earlier. I really don't know
> > if it is worth the effort to propagate those information.).
> 
> The problem with GRE + DF is that the internal packet is no-DF and
> potentially fragmented, but the tunneled packets (fragments) do have
> DF set. So no interim router can defrag+frag if needed. This means
> pmtu must be honored. And on nbma gre tunnels (target tunnel address
> depends on encapsulated packet's address) the path mtu needs to be
> propagated back to the sender.
> 
> In this configuration ip_forward_use_pmtu needs to be enabled (system
> wide, per-interface config, or implicitly by interface flags). Or
> alternatively the trusted pmtu needs to be propagated via alternate
> mechanism. But that might be quite tricky to implement.
> 
> I believe other tunnels have similar mechanism. E.g. ipip tunnels
> seems to share same DF vs. non-DF mode based on 'ttl' setting.

Thinking more on this, the issue happens even without XFRM. XFRM just
makes it pretty much immediate to happen.

Basically tunnel devices with 'pmtudisc' flag set (default for many,
and especially if 'ttl' parameter is used), have the issue always. As
the tunnel header has DF always set the routers handling tunnel packets
cannot really re-fragment.

I believe we should have a flag describing this functionality, and
enable ip_forward_use_pmtu for those interface. IFF_XMIT_DST_RELEASE is
a prerequisite to update the inner flow pmtu, so it might be used as
hint. Though, it's set even if 'nopmtudisc' was configured. So I
suppose this would deserve it's own flag. Grepping for drivers calling
netif_keep_dst() gives a list of drivers which need to be checked if
they rely on this behaviour.

Thanks,
Timo

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-07-09 14:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-08 13:30 ip_forward_use_pmtu and forwarding to xfrm'ed gre Timo Teras
2015-07-08 15:52 ` Hannes Frederic Sowa
2015-07-08 16:17   ` Timo Teras
2015-07-08 17:39     ` Hannes Frederic Sowa
2015-07-08 18:51       ` Timo Teras
2015-07-09 14:48         ` Timo Teras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).