[PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
@ 2016-08-09  7:04 wenxu
  2016-08-11  0:35 ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: wenxu @ 2016-08-09  7:04 UTC (permalink / raw)
  To: davem, kuznet, jmorris, kaber, yoshfuji, netdev, shmulik.ladkani,
	wenx05124561

From: wenxu <wenxu@ucloud.cn>

commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen
exceeds MTU, allow segmentation for local udp tunneled skbs")

Given:
     - tap0 and ovs-gre
     - ovs-gre stacked on eth0, eth0 having the small mtu

After encapsulation these skbs have skb_gso_network_seglen that exceed
eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than
eth0 mtu. These packets maybe dropped.

It has the same problem if tap0 bridge with ipgre or gretap device. So
the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs.

Signed-off-by: wenxu <wenxu@ucloud.cn>
---
 net/ipv4/ip_tunnel_core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 9d847c3..7f36ea2 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -73,8 +73,9 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
 	skb_dst_set(skb, &rt->dst);
 	memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
 
-	if (skb_iif && proto == IPPROTO_UDP) {
-		/* Arrived from an ingress interface and got udp encapuslated.
+	if (skb_iif && (proto == IPPROTO_UDP || proto == IPPROTO_GRE)) {
+		/* Arrived from an ingress interface and got udp or gre
+		 * encapsulated.
 		 * The encapsulated network segment length may exceed dst mtu.
 		 * Allow IP Fragmentation of segments.
 		 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
  2016-08-09  7:04 [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs wenxu
@ 2016-08-11  0:35 ` David Miller
  2016-08-11 19:41   ` Shmulik Ladkani
  0 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2016-08-11  0:35 UTC (permalink / raw)
  To: wenxu
  Cc: kuznet, jmorris, kaber, yoshfuji, netdev, shmulik.ladkani,
	wenx05124561, hannes

From: wenxu@ucloud.cn
Date: Tue,  9 Aug 2016 15:04:21 +0800

> From: wenxu <wenxu@ucloud.cn>
> 
> commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen
> exceeds MTU, allow segmentation for local udp tunneled skbs")
> 
> Given:
>      - tap0 and ovs-gre
>      - ovs-gre stacked on eth0, eth0 having the small mtu
> 
> After encapsulation these skbs have skb_gso_network_seglen that exceed
> eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than
> eth0 mtu. These packets maybe dropped.
> 
> It has the same problem if tap0 bridge with ipgre or gretap device. So
> the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs.
> 
> Signed-off-by: wenxu <wenxu@ucloud.cn>

I am rather certain that this test is intentionally restricted to
UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe.

Hannes and Shmulik?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
  2016-08-11  0:35 ` David Miller
@ 2016-08-11 19:41   ` Shmulik Ladkani
  2016-08-12  4:29     ` wenxu
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Shmulik Ladkani @ 2016-08-11 19:41 UTC (permalink / raw)
  To: David Miller
  Cc: wenxu, kuznet, jmorris, kaber, yoshfuji, netdev, wenx05124561, hannes

On Wed, 10 Aug 2016 17:35:11 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
> > commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen
> > exceeds MTU, allow segmentation for local udp tunneled skbs")
> > 
> > Given:
> >      - tap0 and ovs-gre
> >      - ovs-gre stacked on eth0, eth0 having the small mtu
> > 
> > After encapsulation these skbs have skb_gso_network_seglen that exceed
> > eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than
> > eth0 mtu. These packets maybe dropped.
> > 
> > It has the same problem if tap0 bridge with ipgre or gretap device. So
> > the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs.
> > 
> > Signed-off-by: wenxu <wenxu@ucloud.cn>  
> 
> I am rather certain that this test is intentionally restricted to
> UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe.
> 
> Hannes and Shmulik?

It was restricted to UDP tun encaps per Hannes' suggestion, in order to
scope the change, as Hannes hinted gre and ipip are pmtu-safe, see [1].

Glancing at ip_tunnel_xmit() --> tnl_update_pmtu(), they do seem to
handle PMTU, but - if I understand correctly - only in the !skb_is_gso
case.
(since 68c3316311 "v4 GRE: Add TCP segmentation offload for GRE")

This is probably due the same hidden assumption that 'gso_size' will be
suitable for the egress mtu even after encapsulation, which does not
hold true in some usecases as described in [2].

Therefore it might be that b8247f095edd needs to be augmented for
non-udp tunnels as well.

Hannes?

[1] http://www.spinics.net/lists/netdev/msg386447.html
[2] http://www.spinics.net/lists/netdev/msg385776.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
  2016-08-11 19:41   ` Shmulik Ladkani
@ 2016-08-12  4:29     ` wenxu
       [not found]     ` <80d116d7-e61c-1bbd-64bf-e3b1f809419b@ucloud.cn>
  2016-08-12 11:11     ` Hannes Frederic Sowa
  2 siblings, 0 replies; 11+ messages in thread
From: wenxu @ 2016-08-12  4:29 UTC (permalink / raw)
  To: wenx05124561, netdev

> On Wed, 10 Aug 2016 17:35:11 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>>> commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen
>>> exceeds MTU, allow segmentation for local udp tunneled skbs")
>>>
>>> Given:
>>>      - tap0 and ovs-gre
>>>      - ovs-gre stacked on eth0, eth0 having the small mtu
>>>
>>> After encapsulation these skbs have skb_gso_network_seglen that exceed
>>> eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than
>>> eth0 mtu. These packets maybe dropped.
>>>
>>> It has the same problem if tap0 bridge with ipgre or gretap device. So
>>> the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs.
>>>
>>> Signed-off-by: wenxu <wenxu@ucloud.cn>  
>> I am rather certain that this test is intentionally restricted to
>> UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe.
>>
>> Hannes and Shmulik?
> It was restricted to UDP tun encaps per Hannes' suggestion, in order to
> scope the change, as Hannes hinted gre and ipip are pmtu-safe, see [1].
>
> Glancing at ip_tunnel_xmit() --> tnl_update_pmtu(), they do seem to
> handle PMTU, but - if I understand correctly - only in the !skb_is_gso
> case.
> (since 68c3316311 "v4 GRE: Add TCP segmentation offload for GRE")
>
> This is probably due the same hidden assumption that 'gso_size' will be
> suitable for the egress mtu even after encapsulation, which does not
> hold true in some usecases as described in [2].
>
> Therefore it might be that b8247f095edd needs to be augmented for
> non-udp tunnels as well.
>
> Hannes?
>
> [1] http://www.spinics.net/lists/netdev/msg386447.html
> [2] http://www.spinics.net/lists/netdev/msg385776.html
I think non-udp tunnels should also be set
And in b8247f095edd, condition skb_iif also should be removed
given:
ovs-internal-port bridge with ovs-gre

There is the same problem that packets from local.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
       [not found]     ` <80d116d7-e61c-1bbd-64bf-e3b1f809419b@ucloud.cn>
@ 2016-08-12  5:18       ` Shmulik Ladkani
  0 siblings, 0 replies; 11+ messages in thread
From: Shmulik Ladkani @ 2016-08-12  5:18 UTC (permalink / raw)
  To: wenxu
  Cc: David Miller, kuznet, jmorris, kaber, yoshfuji, netdev,
	wenx05124561, hannes

Hi,

On Fri, 12 Aug 2016 11:51:07 +0800 wenxu <wenxu@ucloud.cn> wrote:
> 
> And in b8247f095edd, the condition skb_iif also should be removed.
> given:
>      ovs-internal-dev bridge with ovs-gre
> 
> There are the same problem which the skb from local.

There's no need to remove the skb_iif criteria:

For the local bridge port, we have control over its mtu. This allows
setting an mtu value that takes into account the gre encapsulation
performed by other member ports.
By doing so, locally generated traffic (on br0) will have a proper
gso_size.

OTOH if packet arrives from an ingress member port (such as tap0) the
packet's gso_size could have been set by another system - which is not
always under our control.

( see discussion in http://www.spinics.net/lists/netdev/msg385085.html )

Regards,
Shmulik

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
  2016-08-11 19:41   ` Shmulik Ladkani
  2016-08-12  4:29     ` wenxu
       [not found]     ` <80d116d7-e61c-1bbd-64bf-e3b1f809419b@ucloud.cn>
@ 2016-08-12 11:11     ` Hannes Frederic Sowa
  2016-08-15 11:16       ` Shmulik Ladkani
  2 siblings, 1 reply; 11+ messages in thread
From: Hannes Frederic Sowa @ 2016-08-12 11:11 UTC (permalink / raw)
  To: Shmulik Ladkani, David Miller
  Cc: wenxu, kuznet, jmorris, kaber, yoshfuji, netdev, wenx05124561

On 11.08.2016 21:41, Shmulik Ladkani wrote:
> On Wed, 10 Aug 2016 17:35:11 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>>> commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen
>>> exceeds MTU, allow segmentation for local udp tunneled skbs")
>>>
>>> Given:
>>>      - tap0 and ovs-gre
>>>      - ovs-gre stacked on eth0, eth0 having the small mtu
>>>
>>> After encapsulation these skbs have skb_gso_network_seglen that exceed
>>> eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than
>>> eth0 mtu. These packets maybe dropped.
>>>
>>> It has the same problem if tap0 bridge with ipgre or gretap device. So
>>> the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs.
>>>
>>> Signed-off-by: wenxu <wenxu@ucloud.cn>  
>>
>> I am rather certain that this test is intentionally restricted to
>> UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe.
>>
>> Hannes and Shmulik?
> 
> It was restricted to UDP tun encaps per Hannes' suggestion, in order to
> scope the change, as Hannes hinted gre and ipip are pmtu-safe, see [1].
> 
> Glancing at ip_tunnel_xmit() --> tnl_update_pmtu(), they do seem to
> handle PMTU, but - if I understand correctly - only in the !skb_is_gso
> case.
> (since 68c3316311 "v4 GRE: Add TCP segmentation offload for GRE")
> 
> This is probably due the same hidden assumption that 'gso_size' will be
> suitable for the egress mtu even after encapsulation, which does not
> hold true in some usecases as described in [2].
> 
> Therefore it might be that b8247f095edd needs to be augmented for
> non-udp tunnels as well.
> 
> Hannes?

I really would not like to see this expanded to gre and other protocols.
All switches drop packets where the packets are exceeding the MTU,
bridges and also openvswitch should behave the same.

Unfortunately we already had this loophole in the kernel that vxlan udp
output path could fragment the packet again, even in case of switches.
But this stopped working for GSO packets, which violates another rule in
the kernel, GSO should always be transparent and user space should never
have to care if a packet is GSO or not.

Because we couldn't a) roll back the change that we fragment packets in
UDP output paths and b) should not violate GSO transparency rule, I
strongly believed it would be better too only change the kernel in a way
that it transparently works with GSO, too. If we argue that a VTEP is
its own UDP endpoint which is set up after the bridge, I still can sleep
well. :)

My understanding was that GRE failed consistently, GSO as well as
non-GSO packets are dropped, which would be the correct behavior for me.
I don't want to change this. A good argument against this would be if we
violate the GSO transparency rule again. But when I looked into the code
I couldn't see that.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
  2016-08-12 11:11     ` Hannes Frederic Sowa
@ 2016-08-15 11:16       ` Shmulik Ladkani
  2016-08-16  7:12         ` wenxu
  2016-08-19  7:26         ` Shmulik Ladkani
  0 siblings, 2 replies; 11+ messages in thread
From: Shmulik Ladkani @ 2016-08-15 11:16 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, wenxu, kuznet, jmorris, kaber, yoshfuji, netdev,
	wenx05124561

Hi,

On Fri, 12 Aug 2016 13:11:50 +0200, hannes@stressinduktion.org wrote:
> I really would not like to see this expanded to gre and other protocols.
> All switches drop packets where the packets are exceeding the MTU,
> bridges and also openvswitch should behave the same.
> 
> Unfortunately we already had this loophole in the kernel that vxlan udp
> output path could fragment the packet again, even in case of switches.
> But this stopped working for GSO packets, which violates another rule in
> the kernel, GSO should always be transparent and user space should never
> have to care if a packet is GSO or not.
> 
> Because we couldn't a) roll back the change that we fragment packets in
> UDP output paths and b) should not violate GSO transparency rule, I
> strongly believed it would be better too only change the kernel in a way
> that it transparently works with GSO, too. If we argue that a VTEP is
> its own UDP endpoint which is set up after the bridge, I still can sleep
> well. :)
> 
> My understanding was that GRE failed consistently, GSO as well as
> non-GSO packets are dropped, which would be the correct behavior for me.
> I don't want to change this. A good argument against this would be if we
> violate the GSO transparency rule again. But when I looked into the code
> I couldn't see that.

I completely agree with your arguments.

I think we may run into the same GSO vs Non-GSO anomaly if one uses
a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the
encapsulating iphdr 'df' is derived from 'tun_flags&TUNNEL_DONT_FRAGMENT'
(e.g. in case DF is not set).

I suspect OvS's vport-gre does exactly that, so I assume this is the
reason why the change was suggested.

Maybe we can change our criteria in the following manner:
 
-	if (skb_iif && proto == IPPROTO_UDP) {
+	if (skb_iif && !(df & htons(IP_DF))) {
		IPCB(skb)->flags |= IPSKB_FRAG_SEGS;

This way, any tunnel explicitly providing DF is NOT allowed to
further fragment the resulting segments (leading to tx segments being
dropped).
Others tunnels, that do not care (e.g. vxlan and geneve, and probably
ovs vport-gre, or other ovs encap vports, in df_default=false mode),
will behave same for gso and non-gso.

WDYT? Am I missing something here?

Thanks,
Shmulik

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
  2016-08-15 11:16       ` Shmulik Ladkani
@ 2016-08-16  7:12         ` wenxu
  2016-08-19  7:26         ` Shmulik Ladkani
  1 sibling, 0 replies; 11+ messages in thread
From: wenxu @ 2016-08-16  7:12 UTC (permalink / raw)
  To: Shmulik Ladkani, Hannes Frederic Sowa
  Cc: David Miller, kuznet, jmorris, kaber, yoshfuji, netdev, wenx05124561

Hi,

> Hi,
>
> On Fri, 12 Aug 2016 13:11:50 +0200, hannes@stressinduktion.org wrote:
>> I really would not like to see this expanded to gre and other protocols.
>> All switches drop packets where the packets are exceeding the MTU,
>> bridges and also openvswitch should behave the same.
>>
>> Unfortunately we already had this loophole in the kernel that vxlan udp
>> output path could fragment the packet again, even in case of switches.
>> But this stopped working for GSO packets, which violates another rule in
>> the kernel, GSO should always be transparent and user space should never
>> have to care if a packet is GSO or not.
>>
>> Because we couldn't a) roll back the change that we fragment packets in
>> UDP output paths and b) should not violate GSO transparency rule, I
>> strongly believed it would be better too only change the kernel in a way
>> that it transparently works with GSO, too. If we argue that a VTEP is
>> its own UDP endpoint which is set up after the bridge, I still can sleep
>> well. :)
>>
>> My understanding was that GRE failed consistently, GSO as well as
>> non-GSO packets are dropped, which would be the correct behavior for me.
>> I don't want to change this. A good argument against this would be if we
>> violate the GSO transparency rule again. But when I looked into the code
>> I couldn't see that.
> I completely agree with your arguments.
>
> I think we may run into the same GSO vs Non-GSO anomaly if one uses
> a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the
> encapsulating iphdr 'df' is derived from 'tun_flags&TUNNEL_DONT_FRAGMENT'
> (e.g. in case DF is not set).
>
> I suspect OvS's vport-gre does exactly that, so I assume this is the
> reason why the change was suggested.
>
> Maybe we can change our criteria in the following manner:
>  
> -	if (skb_iif && proto == IPPROTO_UDP) {
> +	if (skb_iif && !(df & htons(IP_DF))) {
> 		IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
>
> This way, any tunnel explicitly providing DF is NOT allowed to
> further fragment the resulting segments (leading to tx segments being
> dropped).
> Others tunnels, that do not care (e.g. vxlan and geneve, and probably
> ovs vport-gre, or other ovs encap vports, in df_default=false mode),
> will behave same for gso and non-gso.
>
> WDYT? Am I missing something here?
>
> Thanks,
> Shmulik

I think the criteria 'skb_iif && !(df & htons(IP_DF)' is suitable.   'nopmtudisc' tunnel and ovs-gre tunnel can clear the DF through df_default=false.
In this situation both gso(final segment) or no-gso packet can be fragment(if the packet size is more than mtu).

Thanks
Wenxu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
  2016-08-15 11:16       ` Shmulik Ladkani
  2016-08-16  7:12         ` wenxu
@ 2016-08-19  7:26         ` Shmulik Ladkani
  2016-08-19  9:20           ` Hannes Frederic Sowa
  1 sibling, 1 reply; 11+ messages in thread
From: Shmulik Ladkani @ 2016-08-19  7:26 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, wenxu, kuznet, jmorris, kaber, yoshfuji, netdev,
	wenx05124561

On Mon, 15 Aug 2016 14:16:39 +0300 Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> On Fri, 12 Aug 2016 13:11:50 +0200, hannes@stressinduktion.org wrote:
> > I really would not like to see this expanded to gre and other protocols.
> > All switches drop packets where the packets are exceeding the MTU,
> > bridges and also openvswitch should behave the same.
> > 
> > Unfortunately we already had this loophole in the kernel that vxlan udp
> > output path could fragment the packet again, even in case of switches.
> > But this stopped working for GSO packets, which violates another rule in
> > the kernel, GSO should always be transparent and user space should never
> > have to care if a packet is GSO or not.
> > 
> > Because we couldn't a) roll back the change that we fragment packets in
> > UDP output paths and b) should not violate GSO transparency rule, I
> > strongly believed it would be better too only change the kernel in a way
> > that it transparently works with GSO, too. If we argue that a VTEP is
> > its own UDP endpoint which is set up after the bridge, I still can sleep
> > well. :)
> > 
> > My understanding was that GRE failed consistently, GSO as well as
> > non-GSO packets are dropped, which would be the correct behavior for me.
> > I don't want to change this. A good argument against this would be if we
> > violate the GSO transparency rule again. But when I looked into the code
> > I couldn't see that.  
> 
> I completely agree with your arguments.
> 
> I think we may run into the same GSO vs Non-GSO anomaly if one uses
> a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the
> encapsulating iphdr 'df' is derived from 'tun_flags&TUNNEL_DONT_FRAGMENT'
> (e.g. in case DF is not set).
> 
> I suspect OvS's vport-gre does exactly that, so I assume this is the
> reason why the change was suggested.
> 
> Maybe we can change our criteria in the following manner:
>  
> -	if (skb_iif && proto == IPPROTO_UDP) {
> +	if (skb_iif && !(df & htons(IP_DF))) {
> 		IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
> 
> This way, any tunnel explicitly providing DF is NOT allowed to
> further fragment the resulting segments (leading to tx segments being
> dropped).
> Others tunnels, that do not care (e.g. vxlan and geneve, and probably
> ovs vport-gre, or other ovs encap vports, in df_default=false mode),
> will behave same for gso and non-gso.
> 
> WDYT? Am I missing something here?
> 

ping..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
  2016-08-19  7:26         ` Shmulik Ladkani
@ 2016-08-19  9:20           ` Hannes Frederic Sowa
  2016-08-19 13:40             ` Shmulik Ladkani
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Frederic Sowa @ 2016-08-19  9:20 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: David Miller, wenxu, kuznet, jmorris, kaber, yoshfuji, netdev,
	wenx05124561

On 19.08.2016 09:26, Shmulik Ladkani wrote:
> On Mon, 15 Aug 2016 14:16:39 +0300 Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
>> On Fri, 12 Aug 2016 13:11:50 +0200, hannes@stressinduktion.org wrote:
>>> I really would not like to see this expanded to gre and other protocols.
>>> All switches drop packets where the packets are exceeding the MTU,
>>> bridges and also openvswitch should behave the same.
>>>
>>> Unfortunately we already had this loophole in the kernel that vxlan udp
>>> output path could fragment the packet again, even in case of switches.
>>> But this stopped working for GSO packets, which violates another rule in
>>> the kernel, GSO should always be transparent and user space should never
>>> have to care if a packet is GSO or not.
>>>
>>> Because we couldn't a) roll back the change that we fragment packets in
>>> UDP output paths and b) should not violate GSO transparency rule, I
>>> strongly believed it would be better too only change the kernel in a way
>>> that it transparently works with GSO, too. If we argue that a VTEP is
>>> its own UDP endpoint which is set up after the bridge, I still can sleep
>>> well. :)
>>>
>>> My understanding was that GRE failed consistently, GSO as well as
>>> non-GSO packets are dropped, which would be the correct behavior for me.
>>> I don't want to change this. A good argument against this would be if we
>>> violate the GSO transparency rule again. But when I looked into the code
>>> I couldn't see that.  
>>
>> I completely agree with your arguments.
>>
>> I think we may run into the same GSO vs Non-GSO anomaly if one uses
>> a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the
>> encapsulating iphdr 'df' is derived from 'tun_flags&TUNNEL_DONT_FRAGMENT'
>> (e.g. in case DF is not set).
>>
>> I suspect OvS's vport-gre does exactly that, so I assume this is the
>> reason why the change was suggested.
>>
>> Maybe we can change our criteria in the following manner:
>>  
>> -	if (skb_iif && proto == IPPROTO_UDP) {
>> +	if (skb_iif && !(df & htons(IP_DF))) {
>> 		IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
>>
>> This way, any tunnel explicitly providing DF is NOT allowed to
>> further fragment the resulting segments (leading to tx segments being
>> dropped).
>> Others tunnels, that do not care (e.g. vxlan and geneve, and probably
>> ovs vport-gre, or other ovs encap vports, in df_default=false mode),
>> will behave same for gso and non-gso.
>>
>> WDYT? Am I missing something here?
>>
> 
> ping..

I am really not sure...

Probably we have no other choice. Bridges caring about df bit set is
rather unusual. I wonder if it might not be more sensitive to actually
add a sysctl for that or make it depending on some per-tunnel
configuration which can be updated with netlink?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
  2016-08-19  9:20           ` Hannes Frederic Sowa
@ 2016-08-19 13:40             ` Shmulik Ladkani
  0 siblings, 0 replies; 11+ messages in thread
From: Shmulik Ladkani @ 2016-08-19 13:40 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, wenxu, kuznet, jmorris, kaber, yoshfuji, netdev,
	wenx05124561

Hi,

On Fri, 19 Aug 2016 11:20:40 +0200 Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> >> Maybe we can change our criteria in the following manner:
> >>  
> >> -	if (skb_iif && proto == IPPROTO_UDP) {
> >> +	if (skb_iif && !(df & htons(IP_DF))) {
> >> 		IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
> >>
> >> This way, any tunnel explicitly providing DF is NOT allowed to
> >> further fragment the resulting segments (leading to tx segments being
> >> dropped).
> >> Others tunnels, that do not care (e.g. vxlan and geneve, and probably
> >> ovs vport-gre, or other ovs encap vports, in df_default=false mode),
> >> will behave same for gso and non-gso.
> >>
> 
> I am really not sure...
> 
> Probably we have no other choice.

Further diving into this, seems the !IP_DF approach is more correct 
then the IPPROTO_UDP approach (WRT packets/segments arriving from other
interface, that exceed egress mtu):

vxlan/geneve:
  Both set df to zero.
  !IP_DF approach acts same as IPPROTO_UDP approach

vxlan/geneve in collect_md (e.g. OvS):
  They set df according to tun_flags & TUNNEL_DONT_FRAGMENT.
  IPPROTO_UDP approach:
    IPSKB_FRAG_SEGS gets set unconditionally.
    In case TUNNEL_DONT_FRAGMENT requested, non-gso get dropped
    due to IPSTATS_MIB_FRAGFAILS, whereas gso gets segmented+fragmented (!)
  !IP_DF approach:
    Aligned, both non-gso and gso gets dropped for TUNNEL_DONT_FRAGMENT.

ip_gre in collect_md (e.g. OvS):
  Sets df according to tun_flags & TUNNEL_DONT_FRAGMENT.
  IPPROTO_UDP approach:
    IPSKB_FRAG_SEGS is never set.
    Therefore in the case were df is NOT set, non-gso are fragged and
    passed, whereas gso gets dropped (!)
  !IP_DF approach:
    Non-gso vs gso aligned.

ip_gre in nopmtudisc:
  Will pass tnl_update_pmtu checks; Then, df inherrited from inner_iph
  (or stays unset if IFLA_GRE_IGNORE_DF specified).
  IPPROTO_UDP approach:
    IPSKB_FRAG_SEGS never set.
    Therefore in the case were df is NOT set, non-gso are fragged and
    passed, whereas gso gets dropped (!)
  !IP_DF approach:
    Aligned.
    
ip_gre in fou/gue mode in nopmtudisc:
  Assuming they pass tnl_update_pmtu checks; Then, df inherrited from
  inner_iph (or stays unset if IFLA_GRE_IGNORE_DF specified).
  IPPROTO_UDP approach:
    IPSKB_FRAG_SEGS gets always set (since proto==IPPROTO_UDP).
    In the case df is set, non-gso dropped by IPSTATS_MIB_FRAGFAILS,
    whereas gso gets segmented+fragmented (!)
  !IP_DF approach: 
    Aligned.

ip_gre in pmtudisc:
  Sets df to IP_DF.
  Non-gso will fail tnl_update_pmtu checks (gso should pass).
  IPPROTO_UDP approach:
    IPSKB_FRAG_SEGS never set. This leads the gso skbs to be eventually
    dropped. okay.
  !IP_DF approach:
    IPSKB_FRAG_SEGS not set, since IP_DF is true.
    This leads to gso skbs to be eventually dropped. okay.

(truely appreciate if you can review my above analysis)

Therefore using !(df & htons(IP_DF)) actually fixes some oversights of
our former proto==IPPROTO_UDP approach.

I'll send a patch.

Thanks
Shmulik

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-19 13:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-09  7:04 [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs wenxu
2016-08-11  0:35 ` David Miller
2016-08-11 19:41   ` Shmulik Ladkani
2016-08-12  4:29     ` wenxu
     [not found]     ` <80d116d7-e61c-1bbd-64bf-e3b1f809419b@ucloud.cn>
2016-08-12  5:18       ` Shmulik Ladkani
2016-08-12 11:11     ` Hannes Frederic Sowa
2016-08-15 11:16       ` Shmulik Ladkani
2016-08-16  7:12         ` wenxu
2016-08-19  7:26         ` Shmulik Ladkani
2016-08-19  9:20           ` Hannes Frederic Sowa
2016-08-19 13:40             ` Shmulik Ladkani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.