[PATCH net] gso: do GSO for local skb with size bigger than MTU

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net] gso: do GSO for local skb with size bigger than MTU
@ 2014-11-28  6:33 Fan Du
  2014-11-28  7:02 ` Jason Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Fan Du @ 2014-11-28  6:33 UTC (permalink / raw)
  To: netdev; +Cc: davem, fw, Fan Du

Test scenario: two KVM guests sitting in different
hosts communicate to each other with a vxlan tunnel.

All interface MTU is default 1500 Bytes, from guest point
of view, its skb gso_size could be as bigger as 1448Bytes,
however after guest skb goes through vxlan encapuslation,
individual segments length of a gso packet could exceed
physical NIC MTU 1500, which will be lost at recevier side.

So it's possible in virtualized environment, locally created
skb len after encapslation could be bigger than underlayer
MTU. In such case, it's reasonable to do GSO first,
then fragment any packet bigger than MTU as possible.

+---------------+ TX     RX +---------------+
|   KVM Guest   | -> ... -> |   KVM Guest   |
+-+-----------+-+           +-+-----------+-+
  |Qemu/VirtIO|               |Qemu/VirtIO|
  +-----------+               +-----------+
       |                            |
       v tap0                  tap0 v
  +-----------+               +-----------+
  | ovs bridge|               | ovs bridge|
  +-----------+               +-----------+
       | vxlan                vxlan |
       v                            v
  +-----------+               +-----------+
  |    NIC    |    <------>   |    NIC    |
  +-----------+               +-----------+

Steps to reproduce:
 1. Using kernel builtin openvswitch module to setup ovs bridge.
 2. Runing iperf without -M, communication will stuck.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 net/ipv4/ip_output.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index bc6471d..558b5f8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 	struct sk_buff *segs;
 	int ret = 0;
 
-	/* common case: locally created skb or seglen is <= mtu */
-	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
-	      skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
+	/* Both locally created skb and forwarded skb could exceed
+	 * MTU size, so make a unified rule for them all.
+	 */
+	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
 		return ip_finish_output2(skb);
 
 	/* Slowpath -  GSO segment length is exceeding the dst MTU.
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-11-28  6:33 [PATCH net] gso: do GSO for local skb with size bigger than MTU Fan Du
@ 2014-11-28  7:02 ` Jason Wang
  2014-11-30 10:08   ` Du, Fan
  2014-11-30 10:26 ` Florian Westphal
  2014-12-03  3:23 ` David Miller
  2 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2014-11-28  7:02 UTC (permalink / raw)
  To: Fan Du; +Cc: netdev, davem, fw, Fan Du



On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> Test scenario: two KVM guests sitting in different
> hosts communicate to each other with a vxlan tunnel.
> 
> All interface MTU is default 1500 Bytes, from guest point
> of view, its skb gso_size could be as bigger as 1448Bytes,
> however after guest skb goes through vxlan encapuslation,
> individual segments length of a gso packet could exceed
> physical NIC MTU 1500, which will be lost at recevier side.
> 
> So it's possible in virtualized environment, locally created
> skb len after encapslation could be bigger than underlayer
> MTU. In such case, it's reasonable to do GSO first,
> then fragment any packet bigger than MTU as possible.
> 
> +---------------+ TX     RX +---------------+
> |   KVM Guest   | -> ... -> |   KVM Guest   |
> +-+-----------+-+           +-+-----------+-+
>   |Qemu/VirtIO|               |Qemu/VirtIO|
>   +-----------+               +-----------+
>        |                            |
>        v tap0                  tap0 v
>   +-----------+               +-----------+
>   | ovs bridge|               | ovs bridge|
>   +-----------+               +-----------+
>        | vxlan                vxlan |
>        v                            v
>   +-----------+               +-----------+
>   |    NIC    |    <------>   |    NIC    |
>   +-----------+               +-----------+
> 
> Steps to reproduce:
>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>  2. Runing iperf without -M, communication will stuck.

Is this issue specific to ovs or ipv4? Path MTU discovery should
help in this case I believe.
> 
> 
> Signed-off-by: Fan Du <fan.du@intel.com>
> ---
>  net/ipv4/ip_output.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index bc6471d..558b5f8 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff 
> *skb)
>  	struct sk_buff *segs;
>  	int ret = 0;
>  
> -	/* common case: locally created skb or seglen is <= mtu */
> -	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> -	      skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> +	/* Both locally created skb and forwarded skb could exceed
> +	 * MTU size, so make a unified rule for them all.
> +	 */
> +	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>  		return ip_finish_output2(skb);
>  
>  	/* Slowpath -  GSO segment length is exceeding the dst MTU.
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-11-28  7:02 ` Jason Wang
@ 2014-11-30 10:08   ` Du, Fan
  2014-12-01 13:52     ` Thomas Graf
  2014-12-02 15:44     ` Flavio Leitner
  0 siblings, 2 replies; 57+ messages in thread
From: Du, Fan @ 2014-11-30 10:08 UTC (permalink / raw)
  To: 'Jason Wang'; +Cc: netdev, davem, fw, Du, Fan



>-----Original Message-----
>From: Jason Wang [mailto:jasowang@redhat.com]
>Sent: Friday, November 28, 2014 3:02 PM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>
>
>On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> Test scenario: two KVM guests sitting in different hosts communicate
>> to each other with a vxlan tunnel.
>>
>> All interface MTU is default 1500 Bytes, from guest point of view, its
>> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> goes through vxlan encapuslation, individual segments length of a gso
>> packet could exceed physical NIC MTU 1500, which will be lost at
>> recevier side.
>>
>> So it's possible in virtualized environment, locally created skb len
>> after encapslation could be bigger than underlayer MTU. In such case,
>> it's reasonable to do GSO first, then fragment any packet bigger than
>> MTU as possible.
>>
>> +---------------+ TX     RX +---------------+
>> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> +-+-----------+-+           +-+-----------+-+
>>   |Qemu/VirtIO|               |Qemu/VirtIO|
>>   +-----------+               +-----------+
>>        |                            |
>>        v tap0                  tap0 v
>>   +-----------+               +-----------+
>>   | ovs bridge|               | ovs bridge|
>>   +-----------+               +-----------+
>>        | vxlan                vxlan |
>>        v                            v
>>   +-----------+               +-----------+
>>   |    NIC    |    <------>   |    NIC    |
>>   +-----------+               +-----------+
>>
>> Steps to reproduce:
>>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>>  2. Runing iperf without -M, communication will stuck.
>
>Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
>believe.

Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
without any further ip segmentation.

Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, 
Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.

For PMTU to work, that's another issue I will try to address later on.

>>
>>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>> ---
>>  net/ipv4/ip_output.c |    7 ++++---
>>  1 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> bc6471d..558b5f8 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
>> *skb)
>>  	struct sk_buff *segs;
>>  	int ret = 0;
>>
>> -	/* common case: locally created skb or seglen is <= mtu */
>> -	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> -	      skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> +	/* Both locally created skb and forwarded skb could exceed
>> +	 * MTU size, so make a unified rule for them all.
>> +	 */
>> +	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>>  		return ip_finish_output2(skb);
>>
>>  	/* Slowpath -  GSO segment length is exceeding the dst MTU.
>> --
>> 1.7.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org More majordomo info
>> at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-11-28  6:33 [PATCH net] gso: do GSO for local skb with size bigger than MTU Fan Du
  2014-11-28  7:02 ` Jason Wang
@ 2014-11-30 10:26 ` Florian Westphal
  2014-11-30 10:55   ` Du, Fan
  2014-12-03  3:23 ` David Miller
  2 siblings, 1 reply; 57+ messages in thread
From: Florian Westphal @ 2014-11-30 10:26 UTC (permalink / raw)
  To: Fan Du; +Cc: netdev, davem, fw

Fan Du <fan.du@intel.com> wrote:
> Test scenario: two KVM guests sitting in different
> hosts communicate to each other with a vxlan tunnel.
> 
> All interface MTU is default 1500 Bytes, from guest point
> of view, its skb gso_size could be as bigger as 1448Bytes,
> however after guest skb goes through vxlan encapuslation,
> individual segments length of a gso packet could exceed
> physical NIC MTU 1500, which will be lost at recevier side.
> 
> So it's possible in virtualized environment, locally created
> skb len after encapslation could be bigger than underlayer
> MTU. In such case, it's reasonable to do GSO first,
> then fragment any packet bigger than MTU as possible.
> 
> +---------------+ TX     RX +---------------+
> |   KVM Guest   | -> ... -> |   KVM Guest   |
> +-+-----------+-+           +-+-----------+-+
>   |Qemu/VirtIO|               |Qemu/VirtIO|
>   +-----------+               +-----------+
>        |                            |
>        v tap0                  tap0 v
>   +-----------+               +-----------+
>   | ovs bridge|               | ovs bridge|
>   +-----------+               +-----------+
>        | vxlan                vxlan |
>        v                            v
>   +-----------+               +-----------+
>   |    NIC    |    <------>   |    NIC    |
>   +-----------+               +-----------+
> 
> Steps to reproduce:
>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>  2. Runing iperf without -M, communication will stuck.

Hmm, do we really want to suport bridges containing interfaces with
different MTUs?

It seems to me to only clean solution is to set tap0 MTU so that it
accounts for the bridge encap overhead.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-11-30 10:26 ` Florian Westphal
@ 2014-11-30 10:55   ` Du, Fan
  2014-11-30 15:11     ` Florian Westphal
  0 siblings, 1 reply; 57+ messages in thread
From: Du, Fan @ 2014-11-30 10:55 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, davem, Du, Fan



>-----Original Message-----
>From: Florian Westphal [mailto:fw@strlen.de]
>Sent: Sunday, November 30, 2014 6:27 PM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>Fan Du <fan.du@intel.com> wrote:
>> Test scenario: two KVM guests sitting in different hosts communicate
>> to each other with a vxlan tunnel.
>>
>> All interface MTU is default 1500 Bytes, from guest point of view, its
>> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> goes through vxlan encapuslation, individual segments length of a gso
>> packet could exceed physical NIC MTU 1500, which will be lost at
>> recevier side.
>>
>> So it's possible in virtualized environment, locally created skb len
>> after encapslation could be bigger than underlayer MTU. In such case,
>> it's reasonable to do GSO first, then fragment any packet bigger than
>> MTU as possible.
>>
>> +---------------+ TX     RX +---------------+
>> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> +-+-----------+-+           +-+-----------+-+
>>   |Qemu/VirtIO|               |Qemu/VirtIO|
>>   +-----------+               +-----------+
>>        |                            |
>>        v tap0                  tap0 v
>>   +-----------+               +-----------+
>>   | ovs bridge|               | ovs bridge|
>>   +-----------+               +-----------+
>>        | vxlan                vxlan |
>>        v                            v
>>   +-----------+               +-----------+
>>   |    NIC    |    <------>   |    NIC    |
>>   +-----------+               +-----------+
>>
>> Steps to reproduce:
>>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>>  2. Runing iperf without -M, communication will stuck.
>
>Hmm, do we really want to suport bridges containing interfaces with different
>MTUs?

All interface MTU in the test scenario is the default one, 1500.

>It seems to me to only clean solution is to set tap0 MTU so that it accounts for the
>bridge encap overhead.

This will force _ALL_ deploy instances requiring tap0 MTU change in every cloud env.

Current behavior leads over-mtu-sized packet push down to NIC, which should not
happen anyway. And as I putted in another threads:
Perform GSO for skb, then try to do ip segmentation if possible, If DF set, send back
ICMP message. If DF is not set, apparently user want stack do ip segmentation, and
All the GSO-ed skb will be sent out correctly as expected.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-11-30 10:55   ` Du, Fan
@ 2014-11-30 15:11     ` Florian Westphal
  2014-12-01  6:47       ` Du, Fan
  0 siblings, 1 reply; 57+ messages in thread
From: Florian Westphal @ 2014-11-30 15:11 UTC (permalink / raw)
  To: Du, Fan; +Cc: Florian Westphal, netdev, davem

Du, Fan <fan.du@intel.com> wrote:
> All interface MTU in the test scenario is the default one, 1500.

Not really, unless I misunderstand the setup.

You have a l2 network where part of the machines are connected by a
l2 tunnel.

All machines within that network ought to assume that MTU is equal for
all machines within the same L2 network.

> >It seems to me to only clean solution is to set tap0 MTU so that it accounts for the
> >bridge encap overhead.
> 
> This will force _ALL_ deploy instances requiring tap0 MTU change in every cloud env.

Yes, alternatively emply routing, then PMTU should work.

> Current behavior leads over-mtu-sized packet push down to NIC, which should not
> happen anyway. And as I putted in another threads:
> Perform GSO for skb, then try to do ip segmentation if possible, If DF set, send back
> ICMP message. If DF is not set, apparently user want stack do ip segmentation, and
> All the GSO-ed skb will be sent out correctly as expected.

Well, the linux bridge implementation (especially bridge netfilter)
did/allows for a lot of layering violations and this has usually caused
a myriad of followup kludges to make one-more scenario work.

I still think that trying to make this work is a bad idea.
If hosts have different MTUs they should be in different l2 networks.

Alternatively, the Tunneling implementation should be opaque and do the
needed fragmentation to provide the illusion of identical MTUs.

That said, I don't see anything wrong with the patch per se, I just
dislike the concept.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-11-30 15:11     ` Florian Westphal
@ 2014-12-01  6:47       ` Du, Fan
  0 siblings, 0 replies; 57+ messages in thread
From: Du, Fan @ 2014-12-01  6:47 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, davem, Du, Fan



>-----Original Message-----
>From: Florian Westphal [mailto:fw@strlen.de]
>Sent: Sunday, November 30, 2014 11:11 PM
>To: Du, Fan
>Cc: Florian Westphal; netdev@vger.kernel.org; davem@davemloft.net
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>Du, Fan <fan.du@intel.com> wrote:
>> All interface MTU in the test scenario is the default one, 1500.
>
>Not really, unless I misunderstand the setup.
>
>You have a l2 network where part of the machines are connected by a
>l2 tunnel.
>
>All machines within that network ought to assume that MTU is equal for all
>machines within the same L2 network.

Based on what assumption do you think the test scenario use different MTU???
I think all your conclusion comes from the MTU configuration, as a matter of fact,
Like I stated before, ALL interface MTU is default 1500.

I elaborate this typical(!kludges) env a bit more:

Without vxlan tunnel, a typical standard env:
Guest -> Qemu/VirtIO -> tap0 -> linux bridge -> NIC

No tunneling trick here, no MTU change, packets come packets go, naked...

With vxlan tunnel, almost all the same topology as before, really no need to change any MTU
To make below env work.
Guest -> Qemu/VirtIO -> tap0 -> ovs bridge -> vxlan tunnel -> NIC
                                         ^^^^^^^^
                                         ^^^^^^^^
Encapsulation outer L234/VXLAN clothes before Guest frame,
Over-MTU-sized packet is locally created as *default* when Guest first attempts to try a big MSS *BEFORE* PMTU
is able to make guest sense this MSS is too big. Guest what, this over-MTU-sized packet is lost. That's the problem,
not because any different MTU configuration, but the code here rule out(based on what fact???) any such existing possibility

Anyway, setup such env is quite easy to see what's really going on inside the stack. You could even use docker to give a try.
It's the same effect as KVM guest, but easy to deploy.

Docker instance -> docker0 bridge -> vethA -> vethB -> ovs-br0 -> vxlan -> NIC

Any doubts about the env, please let me know.


>> >It seems to me to only clean solution is to set tap0 MTU so that it
>> >accounts for the bridge encap overhead.
>>
>> This will force _ALL_ deploy instances requiring tap0 MTU change in every cloud
>env.
>
>Yes, alternatively emply routing, then PMTU should work.
>
>> Current behavior leads over-mtu-sized packet push down to NIC, which
>> should not happen anyway. And as I putted in another threads:
>> Perform GSO for skb, then try to do ip segmentation if possible, If DF
>> set, send back ICMP message. If DF is not set, apparently user want
>> stack do ip segmentation, and All the GSO-ed skb will be sent out correctly as
>expected.
>
>Well, the linux bridge implementation (especially bridge netfilter) did/allows for a
>lot of layering violations and this has usually caused a myriad of followup kludges
>to make one-more scenario work.
>
>I still think that trying to make this work is a bad idea.
>If hosts have different MTUs they should be in different l2 networks.
>
>Alternatively, the Tunneling implementation should be opaque and do the needed
>fragmentation to provide the illusion of identical MTUs.
>
>That said, I don't see anything wrong with the patch per se, I just dislike the
>concept.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-11-30 10:08   ` Du, Fan
@ 2014-12-01 13:52     ` Thomas Graf
       [not found]       ` <20141201135225.GA16814-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  2014-12-02 15:44     ` Flavio Leitner
  1 sibling, 1 reply; 57+ messages in thread
From: Thomas Graf @ 2014-12-01 13:52 UTC (permalink / raw)
  To: Du, Fan; +Cc: 'Jason Wang', netdev, davem, fw, dev, mst, jesse, pshelar

On 11/30/14 at 10:08am, Du, Fan wrote:
> >-----Original Message-----
> >From: Jason Wang [mailto:jasowang@redhat.com]
> >Sent: Friday, November 28, 2014 3:02 PM
> >To: Du, Fan
> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> >> Test scenario: two KVM guests sitting in different hosts communicate
> >> to each other with a vxlan tunnel.
> >>
> >> All interface MTU is default 1500 Bytes, from guest point of view, its
> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> >> goes through vxlan encapuslation, individual segments length of a gso
> >> packet could exceed physical NIC MTU 1500, which will be lost at
> >> recevier side.
> >>
> >> So it's possible in virtualized environment, locally created skb len
> >> after encapslation could be bigger than underlayer MTU. In such case,
> >> it's reasonable to do GSO first, then fragment any packet bigger than
> >> MTU as possible.
> >>
> >> +---------------+ TX     RX +---------------+
> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> >> +-+-----------+-+           +-+-----------+-+
> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> >>   +-----------+               +-----------+
> >>        |                            |
> >>        v tap0                  tap0 v
> >>   +-----------+               +-----------+
> >>   | ovs bridge|               | ovs bridge|
> >>   +-----------+               +-----------+
> >>        | vxlan                vxlan |
> >>        v                            v
> >>   +-----------+               +-----------+
> >>   |    NIC    |    <------>   |    NIC    |
> >>   +-----------+               +-----------+
> >>
> >> Steps to reproduce:
> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> >>  2. Runing iperf without -M, communication will stuck.
> >
> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> >believe.
> 
> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> without any further ip segmentation.
> 
> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, 
> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.

Aside from this. I think Virtio should provide a MTU hint to the guest
to adjust MTU in the vNIC to account for both overhead or support for
jumbo frames in the underlay transparently without relying on PMTU or
MSS hints. I remember we talked about this a while ago with at least
Michael but haven't done actual code work on it yet.

> For PMTU to work, that's another issue I will try to address later on.

PMTU discovery was explicitly removed from the OVS datapath. Maybe
Pravin or Jesse can provide some background on that

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]       ` <20141201135225.GA16814-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-12-01 15:06         ` Michael S. Tsirkin
  2014-12-02 15:48         ` Flavio Leitner
  1 sibling, 0 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2014-12-01 15:06 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	'Jason Wang',
	Du, Fan, fw-HFFVJYpyMKqzQB+pC5nmwQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Mon, Dec 01, 2014 at 01:52:25PM +0000, Thomas Graf wrote:
> On 11/30/14 at 10:08am, Du, Fan wrote:
> > >-----Original Message-----
> > >From: Jason Wang [mailto:jasowang@redhat.com]
> > >Sent: Friday, November 28, 2014 3:02 PM
> > >To: Du, Fan
> > >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> > >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> > >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> > >> Test scenario: two KVM guests sitting in different hosts communicate
> > >> to each other with a vxlan tunnel.
> > >>
> > >> All interface MTU is default 1500 Bytes, from guest point of view, its
> > >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> > >> goes through vxlan encapuslation, individual segments length of a gso
> > >> packet could exceed physical NIC MTU 1500, which will be lost at
> > >> recevier side.
> > >>
> > >> So it's possible in virtualized environment, locally created skb len
> > >> after encapslation could be bigger than underlayer MTU. In such case,
> > >> it's reasonable to do GSO first, then fragment any packet bigger than
> > >> MTU as possible.
> > >>
> > >> +---------------+ TX     RX +---------------+
> > >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> > >> +-+-----------+-+           +-+-----------+-+
> > >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> > >>   +-----------+               +-----------+
> > >>        |                            |
> > >>        v tap0                  tap0 v
> > >>   +-----------+               +-----------+
> > >>   | ovs bridge|               | ovs bridge|
> > >>   +-----------+               +-----------+
> > >>        | vxlan                vxlan |
> > >>        v                            v
> > >>   +-----------+               +-----------+
> > >>   |    NIC    |    <------>   |    NIC    |
> > >>   +-----------+               +-----------+
> > >>
> > >> Steps to reproduce:
> > >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> > >>  2. Runing iperf without -M, communication will stuck.
> > >
> > >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> > >believe.
> > 
> > Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> > without any further ip segmentation.
> > 
> > Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, 
> > Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.

Sounds about right.

> Aside from this. I think Virtio should provide a MTU hint to the guest
> to adjust MTU in the vNIC to account for both overhead or support for
> jumbo frames in the underlay transparently without relying on PMTU or
> MSS hints. I remember we talked about this a while ago with at least
> Michael but haven't done actual code work on it yet.

This can be an optimization but can't replace fixing PMTU.
In particular this can't help legacy guests.

> > For PMTU to work, that's another issue I will try to address later on.
> 
> PMTU discovery was explicitly removed from the OVS datapath. Maybe
> Pravin or Jesse can provide some background on that

PMTU was never working properly for OVS users.  There was an ugly hack that kind
of worked part of the time.  That was dropped, and good riddance.
Proper PMTU support for all kind of tunneling solutions, e.g. along the
lines you outline above, belongs in host core.

-- 
MST
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-11-30 10:08   ` Du, Fan
  2014-12-01 13:52     ` Thomas Graf
@ 2014-12-02 15:44     ` Flavio Leitner
  2014-12-02 18:06       ` Jesse Gross
  1 sibling, 1 reply; 57+ messages in thread
From: Flavio Leitner @ 2014-12-02 15:44 UTC (permalink / raw)
  To: Du, Fan; +Cc: 'Jason Wang', netdev, davem, fw

On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
> 
> 
> >-----Original Message-----
> >From: Jason Wang [mailto:jasowang@redhat.com]
> >Sent: Friday, November 28, 2014 3:02 PM
> >To: Du, Fan
> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> >
> >
> >
> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> >> Test scenario: two KVM guests sitting in different hosts communicate
> >> to each other with a vxlan tunnel.
> >>
> >> All interface MTU is default 1500 Bytes, from guest point of view, its
> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> >> goes through vxlan encapuslation, individual segments length of a gso
> >> packet could exceed physical NIC MTU 1500, which will be lost at
> >> recevier side.
> >>
> >> So it's possible in virtualized environment, locally created skb len
> >> after encapslation could be bigger than underlayer MTU. In such case,
> >> it's reasonable to do GSO first, then fragment any packet bigger than
> >> MTU as possible.
> >>
> >> +---------------+ TX     RX +---------------+
> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> >> +-+-----------+-+           +-+-----------+-+
> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> >>   +-----------+               +-----------+
> >>        |                            |
> >>        v tap0                  tap0 v
> >>   +-----------+               +-----------+
> >>   | ovs bridge|               | ovs bridge|
> >>   +-----------+               +-----------+
> >>        | vxlan                vxlan |
> >>        v                            v
> >>   +-----------+               +-----------+
> >>   |    NIC    |    <------>   |    NIC    |
> >>   +-----------+               +-----------+
> >>
> >> Steps to reproduce:
> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> >>  2. Runing iperf without -M, communication will stuck.
> >
> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> >believe.
> 
> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> without any further ip segmentation.
> 
> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, 
> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
> 
> For PMTU to work, that's another issue I will try to address later on.
> 
> >>
> >>
> >> Signed-off-by: Fan Du <fan.du@intel.com>
> >> ---
> >>  net/ipv4/ip_output.c |    7 ++++---
> >>  1 files changed, 4 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
> >> bc6471d..558b5f8 100644
> >> --- a/net/ipv4/ip_output.c
> >> +++ b/net/ipv4/ip_output.c
> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
> >> *skb)
> >>  	struct sk_buff *segs;
> >>  	int ret = 0;
> >>
> >> -	/* common case: locally created skb or seglen is <= mtu */
> >> -	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> >> -	      skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> >> +	/* Both locally created skb and forwarded skb could exceed
> >> +	 * MTU size, so make a unified rule for them all.
> >> +	 */
> >> +	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> >>  		return ip_finish_output2(skb);


Are you using kernel's vxlan device or openvswitch's vxlan device?

Because for kernel's vxlan devices the MTU accounts for the header
overhead so I believe your patch would work.  However, the MTU is
not visible for the ovs's vxlan devices, so that wouldn't work.

fbl

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]       ` <20141201135225.GA16814-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  2014-12-01 15:06         ` Michael S. Tsirkin
@ 2014-12-02 15:48         ` Flavio Leitner
  2014-12-02 17:09           ` Thomas Graf
  1 sibling, 1 reply; 57+ messages in thread
From: Flavio Leitner @ 2014-12-02 15:48 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, mst-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, 'Jason Wang',
	Du, Fan, fw-HFFVJYpyMKqzQB+pC5nmwQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Mon, Dec 01, 2014 at 01:52:25PM +0000, Thomas Graf wrote:
> On 11/30/14 at 10:08am, Du, Fan wrote:
> > >-----Original Message-----
> > >From: Jason Wang [mailto:jasowang@redhat.com]
> > >Sent: Friday, November 28, 2014 3:02 PM
> > >To: Du, Fan
> > >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> > >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> > >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> > >> Test scenario: two KVM guests sitting in different hosts communicate
> > >> to each other with a vxlan tunnel.
> > >>
> > >> All interface MTU is default 1500 Bytes, from guest point of view, its
> > >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> > >> goes through vxlan encapuslation, individual segments length of a gso
> > >> packet could exceed physical NIC MTU 1500, which will be lost at
> > >> recevier side.
> > >>
> > >> So it's possible in virtualized environment, locally created skb len
> > >> after encapslation could be bigger than underlayer MTU. In such case,
> > >> it's reasonable to do GSO first, then fragment any packet bigger than
> > >> MTU as possible.
> > >>
> > >> +---------------+ TX     RX +---------------+
> > >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> > >> +-+-----------+-+           +-+-----------+-+
> > >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> > >>   +-----------+               +-----------+
> > >>        |                            |
> > >>        v tap0                  tap0 v
> > >>   +-----------+               +-----------+
> > >>   | ovs bridge|               | ovs bridge|
> > >>   +-----------+               +-----------+
> > >>        | vxlan                vxlan |
> > >>        v                            v
> > >>   +-----------+               +-----------+
> > >>   |    NIC    |    <------>   |    NIC    |
> > >>   +-----------+               +-----------+
> > >>
> > >> Steps to reproduce:
> > >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> > >>  2. Runing iperf without -M, communication will stuck.
> > >
> > >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> > >believe.
> > 
> > Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> > without any further ip segmentation.
> > 
> > Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, 
> > Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
> 
> Aside from this. I think Virtio should provide a MTU hint to the guest
> to adjust MTU in the vNIC to account for both overhead or support for
> jumbo frames in the underlay transparently without relying on PMTU or
> MSS hints. I remember we talked about this a while ago with at least
> Michael but haven't done actual code work on it yet.

What about containers or any other virtualization environment that
doesn't use Virtio?

fbl
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-02 15:48         ` Flavio Leitner
@ 2014-12-02 17:09           ` Thomas Graf
       [not found]             ` <20141202170927.GA9457-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Thomas Graf @ 2014-12-02 17:09 UTC (permalink / raw)
  To: Du, Fan, 'Jason Wang',
	netdev, davem, fw, dev, mst, jesse, pshelar

On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> What about containers or any other virtualization environment that
> doesn't use Virtio?

The host can dictate the MTU in that case for both veth or OVS
internal which would be primary container plumbing techniques.

ivshmem would need it's own fix.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]             ` <20141202170927.GA9457-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-12-02 17:34               ` Michael S. Tsirkin
  2014-12-02 17:41                 ` Thomas Graf
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2014-12-02 17:34 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	'Jason Wang',
	Du, Fan, fw-HFFVJYpyMKqzQB+pC5nmwQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> > What about containers or any other virtualization environment that
> > doesn't use Virtio?
> 
> The host can dictate the MTU in that case for both veth or OVS
> internal which would be primary container plumbing techniques.

It typically can't do this easily for VMs with emulated devices:
real ethernet uses a fixed MTU.

IMHO it's confusing to suggest MTU as a fix for this bug, it's
an unrelated optimization.
ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.


> ivshmem would need it's own fix.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-02 17:34               ` Michael S. Tsirkin
@ 2014-12-02 17:41                 ` Thomas Graf
  2014-12-02 18:12                   ` Jesse Gross
  2014-12-03  2:31                   ` Du, Fan
  0 siblings, 2 replies; 57+ messages in thread
From: Thomas Graf @ 2014-12-02 17:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Du, Fan, 'Jason Wang', netdev, davem, fw, dev, jesse, pshelar

On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> > > What about containers or any other virtualization environment that
> > > doesn't use Virtio?
> > 
> > The host can dictate the MTU in that case for both veth or OVS
> > internal which would be primary container plumbing techniques.
> 
> It typically can't do this easily for VMs with emulated devices:
> real ethernet uses a fixed MTU.
> 
> IMHO it's confusing to suggest MTU as a fix for this bug, it's
> an unrelated optimization.
> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.

PMTU discovery only resolves the issue if an actual IP stack is
running inside the VM. This may not be the case at all.

I agree that exposing an MTU towards the guest is not applicable
in all situations, in particular because it is difficult to decide
what MTU to expose. It is a relatively elegant solution in a lot
of virtualization host cases hooked up to an orchestration system
though.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-02 15:44     ` Flavio Leitner
@ 2014-12-02 18:06       ` Jesse Gross
  2014-12-02 21:32         ` Flavio Leitner
  0 siblings, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2014-12-02 18:06 UTC (permalink / raw)
  To: Du, Fan, Jason Wang, netdev, davem, fw

On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
> On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
>>
>>
>> >-----Original Message-----
>> >From: Jason Wang [mailto:jasowang@redhat.com]
>> >Sent: Friday, November 28, 2014 3:02 PM
>> >To: Du, Fan
>> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
>> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>> >
>> >
>> >
>> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> >> Test scenario: two KVM guests sitting in different hosts communicate
>> >> to each other with a vxlan tunnel.
>> >>
>> >> All interface MTU is default 1500 Bytes, from guest point of view, its
>> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> >> goes through vxlan encapuslation, individual segments length of a gso
>> >> packet could exceed physical NIC MTU 1500, which will be lost at
>> >> recevier side.
>> >>
>> >> So it's possible in virtualized environment, locally created skb len
>> >> after encapslation could be bigger than underlayer MTU. In such case,
>> >> it's reasonable to do GSO first, then fragment any packet bigger than
>> >> MTU as possible.
>> >>
>> >> +---------------+ TX     RX +---------------+
>> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> >> +-+-----------+-+           +-+-----------+-+
>> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
>> >>   +-----------+               +-----------+
>> >>        |                            |
>> >>        v tap0                  tap0 v
>> >>   +-----------+               +-----------+
>> >>   | ovs bridge|               | ovs bridge|
>> >>   +-----------+               +-----------+
>> >>        | vxlan                vxlan |
>> >>        v                            v
>> >>   +-----------+               +-----------+
>> >>   |    NIC    |    <------>   |    NIC    |
>> >>   +-----------+               +-----------+
>> >>
>> >> Steps to reproduce:
>> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>> >>  2. Runing iperf without -M, communication will stuck.
>> >
>> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
>> >believe.
>>
>> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
>> without any further ip segmentation.
>>
>> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set,
>> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
>>
>> For PMTU to work, that's another issue I will try to address later on.
>>
>> >>
>> >>
>> >> Signed-off-by: Fan Du <fan.du@intel.com>
>> >> ---
>> >>  net/ipv4/ip_output.c |    7 ++++---
>> >>  1 files changed, 4 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> >> bc6471d..558b5f8 100644
>> >> --- a/net/ipv4/ip_output.c
>> >> +++ b/net/ipv4/ip_output.c
>> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
>> >> *skb)
>> >>    struct sk_buff *segs;
>> >>    int ret = 0;
>> >>
>> >> -  /* common case: locally created skb or seglen is <= mtu */
>> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> +  /* Both locally created skb and forwarded skb could exceed
>> >> +   * MTU size, so make a unified rule for them all.
>> >> +   */
>> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >>            return ip_finish_output2(skb);
>
>
> Are you using kernel's vxlan device or openvswitch's vxlan device?
>
> Because for kernel's vxlan devices the MTU accounts for the header
> overhead so I believe your patch would work.  However, the MTU is
> not visible for the ovs's vxlan devices, so that wouldn't work.

This is being called after the tunnel code, so the MTU that is being
looked at in all cases is the physical device's. Since the packet has
already been encapsulated, tunnel header overhead is already accounted
for in skb_gso_network_seglen() and this should be fine for both OVS
and non-OVS cases.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-02 17:41                 ` Thomas Graf
@ 2014-12-02 18:12                   ` Jesse Gross
       [not found]                     ` <CAEP_g=-86Z6pxNow-wjnbx_v9er_TSn6x5waigqVqYHa7tEQJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-12-03  2:31                   ` Du, Fan
  1 sibling, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2014-12-02 18:12 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev, davem, fw, dev,
	Pravin Shelar

On Tue, Dec 2, 2014 at 9:41 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>> > > What about containers or any other virtualization environment that
>> > > doesn't use Virtio?
>> >
>> > The host can dictate the MTU in that case for both veth or OVS
>> > internal which would be primary container plumbing techniques.
>>
>> It typically can't do this easily for VMs with emulated devices:
>> real ethernet uses a fixed MTU.
>>
>> IMHO it's confusing to suggest MTU as a fix for this bug, it's
>> an unrelated optimization.
>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>
> PMTU discovery only resolves the issue if an actual IP stack is
> running inside the VM. This may not be the case at all.

It's also only really a correct thing to do if the ICMP packet is
coming from an L3 node. If you are doing straight bridging then you
have to resort to hacks like OVS had before, which I agree are not
particularly desirable.

> I agree that exposing an MTU towards the guest is not applicable
> in all situations, in particular because it is difficult to decide
> what MTU to expose. It is a relatively elegant solution in a lot
> of virtualization host cases hooked up to an orchestration system
> though.

I also think this is the right thing to do as a common case
optimization and I know other platforms (such as Hyper-V) do it. It's
not a complete solution so we still need the original patch in this
thread to handle things transparently.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-02 18:06       ` Jesse Gross
@ 2014-12-02 21:32         ` Flavio Leitner
  2014-12-02 21:47           ` Jesse Gross
  2014-12-03  1:58           ` Du, Fan
  0 siblings, 2 replies; 57+ messages in thread
From: Flavio Leitner @ 2014-12-02 21:32 UTC (permalink / raw)
  To: Jesse Gross; +Cc: Du, Fan, Jason Wang, netdev, davem, fw

On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote:
> On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
> > On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
> >>
> >>
> >> >-----Original Message-----
> >> >From: Jason Wang [mailto:jasowang@redhat.com]
> >> >Sent: Friday, November 28, 2014 3:02 PM
> >> >To: Du, Fan
> >> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> >> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> >> >
> >> >
> >> >
> >> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> >> >> Test scenario: two KVM guests sitting in different hosts communicate
> >> >> to each other with a vxlan tunnel.
> >> >>
> >> >> All interface MTU is default 1500 Bytes, from guest point of view, its
> >> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> >> >> goes through vxlan encapuslation, individual segments length of a gso
> >> >> packet could exceed physical NIC MTU 1500, which will be lost at
> >> >> recevier side.
> >> >>
> >> >> So it's possible in virtualized environment, locally created skb len
> >> >> after encapslation could be bigger than underlayer MTU. In such case,
> >> >> it's reasonable to do GSO first, then fragment any packet bigger than
> >> >> MTU as possible.
> >> >>
> >> >> +---------------+ TX     RX +---------------+
> >> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> >> >> +-+-----------+-+           +-+-----------+-+
> >> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> >> >>   +-----------+               +-----------+
> >> >>        |                            |
> >> >>        v tap0                  tap0 v
> >> >>   +-----------+               +-----------+
> >> >>   | ovs bridge|               | ovs bridge|
> >> >>   +-----------+               +-----------+
> >> >>        | vxlan                vxlan |
> >> >>        v                            v
> >> >>   +-----------+               +-----------+
> >> >>   |    NIC    |    <------>   |    NIC    |
> >> >>   +-----------+               +-----------+
> >> >>
> >> >> Steps to reproduce:
> >> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> >> >>  2. Runing iperf without -M, communication will stuck.
> >> >
> >> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> >> >believe.
> >>
> >> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> >> without any further ip segmentation.
> >>
> >> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set,
> >> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
> >>
> >> For PMTU to work, that's another issue I will try to address later on.
> >>
> >> >>
> >> >>
> >> >> Signed-off-by: Fan Du <fan.du@intel.com>
> >> >> ---
> >> >>  net/ipv4/ip_output.c |    7 ++++---
> >> >>  1 files changed, 4 insertions(+), 3 deletions(-)
> >> >>
> >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
> >> >> bc6471d..558b5f8 100644
> >> >> --- a/net/ipv4/ip_output.c
> >> >> +++ b/net/ipv4/ip_output.c
> >> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
> >> >> *skb)
> >> >>    struct sk_buff *segs;
> >> >>    int ret = 0;
> >> >>
> >> >> -  /* common case: locally created skb or seglen is <= mtu */
> >> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> >> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> >> >> +  /* Both locally created skb and forwarded skb could exceed
> >> >> +   * MTU size, so make a unified rule for them all.
> >> >> +   */
> >> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> >> >>            return ip_finish_output2(skb);
> >
> >
> > Are you using kernel's vxlan device or openvswitch's vxlan device?
> >
> > Because for kernel's vxlan devices the MTU accounts for the header
> > overhead so I believe your patch would work.  However, the MTU is
> > not visible for the ovs's vxlan devices, so that wouldn't work.
> 
> This is being called after the tunnel code, so the MTU that is being
> looked at in all cases is the physical device's. Since the packet has
> already been encapsulated, tunnel header overhead is already accounted
> for in skb_gso_network_seglen() and this should be fine for both OVS
> and non-OVS cases.

Right, it didn't work on my first try and that explanation came to mind.

Anyway, I am testing this with containers instead of VMs, so I am using
veth and not Virtio-net.

This is the actual stack trace:

[...]
  do_output
  ovs_vport_send
  vxlan_tnl_send
  vxlan_xmit_skb
  udp_tunnel_xmit_skb
  iptunnel_xmit
   \ skb_scrub_packet => skb->ignore_df = 0;
  ip_local_out_sk
  ip_output
  ip_finish_output (_gso is inlined)
  ip_fragment

and on ip_fragment() it does:

 503         if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
 504                      (IPCB(skb)->frag_max_size &&
 505                       IPCB(skb)->frag_max_size > mtu))) {
 506                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 507                 icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 508                           htonl(mtu));
 509                 kfree_skb(skb);
 510                 return -EMSGSIZE;
 511         }

Since IP_DF is set and skb->ignore_df is reset to 0, in my case
the packet is dropped and an ICMP is sent back. The connection
remains stuck as before. Doesn't virtio-net set DF bit?

Thanks,
fbl

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-02 21:32         ` Flavio Leitner
@ 2014-12-02 21:47           ` Jesse Gross
  2014-12-03  1:58           ` Du, Fan
  1 sibling, 0 replies; 57+ messages in thread
From: Jesse Gross @ 2014-12-02 21:47 UTC (permalink / raw)
  To: Jesse Gross, Du, Fan, Jason Wang, netdev, davem, fw

On Tue, Dec 2, 2014 at 1:32 PM, Flavio Leitner <fbl@redhat.com> wrote:
> On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote:
>> On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
>> > On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
>> >>
>> >>
>> >> >-----Original Message-----
>> >> >From: Jason Wang [mailto:jasowang@redhat.com]
>> >> >Sent: Friday, November 28, 2014 3:02 PM
>> >> >To: Du, Fan
>> >> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
>> >> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>> >> >
>> >> >
>> >> >
>> >> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> >> >> Test scenario: two KVM guests sitting in different hosts communicate
>> >> >> to each other with a vxlan tunnel.
>> >> >>
>> >> >> All interface MTU is default 1500 Bytes, from guest point of view, its
>> >> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> >> >> goes through vxlan encapuslation, individual segments length of a gso
>> >> >> packet could exceed physical NIC MTU 1500, which will be lost at
>> >> >> recevier side.
>> >> >>
>> >> >> So it's possible in virtualized environment, locally created skb len
>> >> >> after encapslation could be bigger than underlayer MTU. In such case,
>> >> >> it's reasonable to do GSO first, then fragment any packet bigger than
>> >> >> MTU as possible.
>> >> >>
>> >> >> +---------------+ TX     RX +---------------+
>> >> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> >> >> +-+-----------+-+           +-+-----------+-+
>> >> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
>> >> >>   +-----------+               +-----------+
>> >> >>        |                            |
>> >> >>        v tap0                  tap0 v
>> >> >>   +-----------+               +-----------+
>> >> >>   | ovs bridge|               | ovs bridge|
>> >> >>   +-----------+               +-----------+
>> >> >>        | vxlan                vxlan |
>> >> >>        v                            v
>> >> >>   +-----------+               +-----------+
>> >> >>   |    NIC    |    <------>   |    NIC    |
>> >> >>   +-----------+               +-----------+
>> >> >>
>> >> >> Steps to reproduce:
>> >> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>> >> >>  2. Runing iperf without -M, communication will stuck.
>> >> >
>> >> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
>> >> >believe.
>> >>
>> >> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
>> >> without any further ip segmentation.
>> >>
>> >> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set,
>> >> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
>> >>
>> >> For PMTU to work, that's another issue I will try to address later on.
>> >>
>> >> >>
>> >> >>
>> >> >> Signed-off-by: Fan Du <fan.du@intel.com>
>> >> >> ---
>> >> >>  net/ipv4/ip_output.c |    7 ++++---
>> >> >>  1 files changed, 4 insertions(+), 3 deletions(-)
>> >> >>
>> >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> >> >> bc6471d..558b5f8 100644
>> >> >> --- a/net/ipv4/ip_output.c
>> >> >> +++ b/net/ipv4/ip_output.c
>> >> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
>> >> >> *skb)
>> >> >>    struct sk_buff *segs;
>> >> >>    int ret = 0;
>> >> >>
>> >> >> -  /* common case: locally created skb or seglen is <= mtu */
>> >> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> >> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >> +  /* Both locally created skb and forwarded skb could exceed
>> >> >> +   * MTU size, so make a unified rule for them all.
>> >> >> +   */
>> >> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >>            return ip_finish_output2(skb);
>> >
>> >
>> > Are you using kernel's vxlan device or openvswitch's vxlan device?
>> >
>> > Because for kernel's vxlan devices the MTU accounts for the header
>> > overhead so I believe your patch would work.  However, the MTU is
>> > not visible for the ovs's vxlan devices, so that wouldn't work.
>>
>> This is being called after the tunnel code, so the MTU that is being
>> looked at in all cases is the physical device's. Since the packet has
>> already been encapsulated, tunnel header overhead is already accounted
>> for in skb_gso_network_seglen() and this should be fine for both OVS
>> and non-OVS cases.
>
> Right, it didn't work on my first try and that explanation came to mind.
>
> Anyway, I am testing this with containers instead of VMs, so I am using
> veth and not Virtio-net.
>
> This is the actual stack trace:
>
> [...]
>   do_output
>   ovs_vport_send
>   vxlan_tnl_send
>   vxlan_xmit_skb
>   udp_tunnel_xmit_skb
>   iptunnel_xmit
>    \ skb_scrub_packet => skb->ignore_df = 0;
>   ip_local_out_sk
>   ip_output
>   ip_finish_output (_gso is inlined)
>   ip_fragment
>
> and on ip_fragment() it does:
>
>  503         if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
>  504                      (IPCB(skb)->frag_max_size &&
>  505                       IPCB(skb)->frag_max_size > mtu))) {
>  506                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
>  507                 icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  508                           htonl(mtu));
>  509                 kfree_skb(skb);
>  510                 return -EMSGSIZE;
>  511         }
>
> Since IP_DF is set and skb->ignore_df is reset to 0, in my case
> the packet is dropped and an ICMP is sent back. The connection
> remains stuck as before. Doesn't virtio-net set DF bit?

I see now - I agree it's not clear how the original patch worked. The
DF bit is actually set by the encapsulator so it should be the same
regardless of how the packet got there (in OVS it is on by default).
It seems like skb_scrub_packet() shouldn't clear skb->ignore_df (or if
it should in other cases then it seems we need an option to do not it
for tunnels).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-02 21:32         ` Flavio Leitner
  2014-12-02 21:47           ` Jesse Gross
@ 2014-12-03  1:58           ` Du, Fan
  1 sibling, 0 replies; 57+ messages in thread
From: Du, Fan @ 2014-12-03  1:58 UTC (permalink / raw)
  To: Flavio Leitner, Jesse Gross; +Cc: Jason Wang, netdev, davem, fw, Du, Fan



>-----Original Message-----
>From: Flavio Leitner [mailto:fbl@redhat.com]
>Sent: Wednesday, December 3, 2014 5:33 AM
>To: Jesse Gross
>Cc: Du, Fan; Jason Wang; netdev@vger.kernel.org; davem@davemloft.net;
>fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote:
>> On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
>> > On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
>> >>
>> >>
>> >> >-----Original Message-----
>> >> >From: Jason Wang [mailto:jasowang@redhat.com]
>> >> >Sent: Friday, November 28, 2014 3:02 PM
>> >> >To: Du, Fan
>> >> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du,
>> >> >Fan
>> >> >Subject: Re: [PATCH net] gso: do GSO for local skb with size
>> >> >bigger than MTU
>> >> >
>> >> >
>> >> >
>> >> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> >> >> Test scenario: two KVM guests sitting in different hosts
>> >> >> communicate to each other with a vxlan tunnel.
>> >> >>
>> >> >> All interface MTU is default 1500 Bytes, from guest point of
>> >> >> view, its skb gso_size could be as bigger as 1448Bytes, however
>> >> >> after guest skb goes through vxlan encapuslation, individual
>> >> >> segments length of a gso packet could exceed physical NIC MTU
>> >> >> 1500, which will be lost at recevier side.
>> >> >>
>> >> >> So it's possible in virtualized environment, locally created skb
>> >> >> len after encapslation could be bigger than underlayer MTU. In
>> >> >> such case, it's reasonable to do GSO first, then fragment any
>> >> >> packet bigger than MTU as possible.
>> >> >>
>> >> >> +---------------+ TX     RX +---------------+
>> >> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> >> >> +-+-----------+-+           +-+-----------+-+
>> >> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
>> >> >>   +-----------+               +-----------+
>> >> >>        |                            |
>> >> >>        v tap0                  tap0 v
>> >> >>   +-----------+               +-----------+
>> >> >>   | ovs bridge|               | ovs bridge|
>> >> >>   +-----------+               +-----------+
>> >> >>        | vxlan                vxlan |
>> >> >>        v                            v
>> >> >>   +-----------+               +-----------+
>> >> >>   |    NIC    |    <------>   |    NIC    |
>> >> >>   +-----------+               +-----------+
>> >> >>
>> >> >> Steps to reproduce:
>> >> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>> >> >>  2. Runing iperf without -M, communication will stuck.
>> >> >
>> >> >Is this issue specific to ovs or ipv4? Path MTU discovery should
>> >> >help in this case I believe.
>> >>
>> >> Problem here is host stack push local over-sized gso skb down to
>> >> NIC, and perform GSO there without any further ip segmentation.
>> >>
>> >> Reasonable behavior is do gso first at ip level, if gso-ed skb is
>> >> bigger than MTU && df is set, Then push
>ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust
>mtu.
>> >>
>> >> For PMTU to work, that's another issue I will try to address later on.
>> >>
>> >> >>
>> >> >>
>> >> >> Signed-off-by: Fan Du <fan.du@intel.com>
>> >> >> ---
>> >> >>  net/ipv4/ip_output.c |    7 ++++---
>> >> >>  1 files changed, 4 insertions(+), 3 deletions(-)
>> >> >>
>> >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> >> >> bc6471d..558b5f8 100644
>> >> >> --- a/net/ipv4/ip_output.c
>> >> >> +++ b/net/ipv4/ip_output.c
>> >> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct
>> >> >> sk_buff
>> >> >> *skb)
>> >> >>    struct sk_buff *segs;
>> >> >>    int ret = 0;
>> >> >>
>> >> >> -  /* common case: locally created skb or seglen is <= mtu */
>> >> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> >> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >> +  /* Both locally created skb and forwarded skb could exceed
>> >> >> +   * MTU size, so make a unified rule for them all.
>> >> >> +   */
>> >> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >>            return ip_finish_output2(skb);
>> >
>> >
>> > Are you using kernel's vxlan device or openvswitch's vxlan device?
>> >
>> > Because for kernel's vxlan devices the MTU accounts for the header
>> > overhead so I believe your patch would work.  However, the MTU is
>> > not visible for the ovs's vxlan devices, so that wouldn't work.
>>
>> This is being called after the tunnel code, so the MTU that is being
>> looked at in all cases is the physical device's. Since the packet has
>> already been encapsulated, tunnel header overhead is already accounted
>> for in skb_gso_network_seglen() and this should be fine for both OVS
>> and non-OVS cases.
>
>Right, it didn't work on my first try and that explanation came to mind.
>
>Anyway, I am testing this with containers instead of VMs, so I am using veth and
>not Virtio-net.
>
>This is the actual stack trace:
>
>[...]
>  do_output
>  ovs_vport_send
>  vxlan_tnl_send
>  vxlan_xmit_skb
>  udp_tunnel_xmit_skb
>  iptunnel_xmit
>   \ skb_scrub_packet => skb->ignore_df = 0;
>  ip_local_out_sk
>  ip_output
>  ip_finish_output (_gso is inlined)
>  ip_fragment
>
>and on ip_fragment() it does:
>
> 503         if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
> 504                      (IPCB(skb)->frag_max_size &&
> 505                       IPCB(skb)->frag_max_size > mtu))) {
> 506                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
> 507                 icmp_send(skb, ICMP_DEST_UNREACH,
>ICMP_FRAG_NEEDED,
> 508                           htonl(mtu));
> 509                 kfree_skb(skb);
> 510                 return -EMSGSIZE;
> 511         }
>
>Since IP_DF is set and skb->ignore_df is reset to 0, in my case the packet is
>dropped and an ICMP is sent back. The connection remains stuck as before.
>Doesn't virtio-net set DF bit?

Thanks for giving it a try and see what really happens. 

You almost there! Ip_segment honor IP_DF, this is bit is take care of by vxlan interface.
In practical env, vxlan interface should take a conservative attitude to allow fragmentation
by appending "options: df_default=false" when creating vxlan interface.

Why allow fragmentation? Because Guest or Container may send over-MTU-sized packet downwards.
Host is expected to be prepared to such incident. This is just what happens in real world cloud env.


>Thanks,
>fbl

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-02 17:41                 ` Thomas Graf
  2014-12-02 18:12                   ` Jesse Gross
@ 2014-12-03  2:31                   ` Du, Fan
  2015-01-05  6:02                     ` Fan Du
  1 sibling, 1 reply; 57+ messages in thread
From: Du, Fan @ 2014-12-03  2:31 UTC (permalink / raw)
  To: Thomas Graf, Michael S. Tsirkin
  Cc: 'Jason Wang', netdev, davem, fw, dev, jesse, pshelar, Du, Fan



>-----Original Message-----
>From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>Sent: Wednesday, December 3, 2014 1:42 AM
>To: Michael S. Tsirkin
>Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>> > > What about containers or any other virtualization environment that
>> > > doesn't use Virtio?
>> >
>> > The host can dictate the MTU in that case for both veth or OVS
>> > internal which would be primary container plumbing techniques.
>>
>> It typically can't do this easily for VMs with emulated devices:
>> real ethernet uses a fixed MTU.
>>
>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>> unrelated optimization.
>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>
>PMTU discovery only resolves the issue if an actual IP stack is running inside the
>VM. This may not be the case at all.
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some thoughts here:

Think otherwise, this is indeed what host stack should forge a ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
message with _inner_ skb network and transport header, do whatever type of encapsulation,
and thereafter push such packet upward to Guest/Container, which make them feel, the intermediate node
or the peer send such message. PMTU should be expected to work correct.
And such behavior should be shared by all other encapsulation tech if they are also suffered.


>I agree that exposing an MTU towards the guest is not applicable in all situations,
>in particular because it is difficult to decide what MTU to expose. It is a relatively
>elegant solution in a lot of virtualization host cases hooked up to an orchestration
>system though.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-11-28  6:33 [PATCH net] gso: do GSO for local skb with size bigger than MTU Fan Du
  2014-11-28  7:02 ` Jason Wang
  2014-11-30 10:26 ` Florian Westphal
@ 2014-12-03  3:23 ` David Miller
  2014-12-03  3:32   ` Du, Fan
  2 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2014-12-03  3:23 UTC (permalink / raw)
  To: fan.du; +Cc: netdev, fw

From: Fan Du <fan.du@intel.com>
Date: Fri, 28 Nov 2014 14:33:05 +0800

> Test scenario: two KVM guests sitting in different
> hosts communicate to each other with a vxlan tunnel.
> 
> All interface MTU is default 1500 Bytes, from guest point
> of view, its skb gso_size could be as bigger as 1448Bytes,
> however after guest skb goes through vxlan encapuslation,
> individual segments length of a gso packet could exceed
> physical NIC MTU 1500, which will be lost at recevier side.
> 
> So it's possible in virtualized environment, locally created
> skb len after encapslation could be bigger than underlayer
> MTU. In such case, it's reasonable to do GSO first,
> then fragment any packet bigger than MTU as possible.
> 
> +---------------+ TX     RX +---------------+
> |   KVM Guest   | -> ... -> |   KVM Guest   |
> +-+-----------+-+           +-+-----------+-+
>   |Qemu/VirtIO|               |Qemu/VirtIO|
>   +-----------+               +-----------+
>        |                            |
>        v tap0                  tap0 v
>   +-----------+               +-----------+
>   | ovs bridge|               | ovs bridge|
>   +-----------+               +-----------+
>        | vxlan                vxlan |
>        v                            v
>   +-----------+               +-----------+
>   |    NIC    |    <------>   |    NIC    |
>   +-----------+               +-----------+
> 
> Steps to reproduce:
>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>  2. Runing iperf without -M, communication will stuck.
> 
> Signed-off-by: Fan Du <fan.du@intel.com>

I really don't like this at all.

If guest sees a 1500 byte MTU, that's it's link layer MTU and it had
better be able to send 1500 byte packets onto the "wire".

If you cannot properly propagate the vxlan encapsulation overhead back
into the guest's MTU you must hide this problem from the rest of our
stack somehow.

Nothing we create inside the host should need the change that you
are making.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03  3:23 ` David Miller
@ 2014-12-03  3:32   ` Du, Fan
  2014-12-03  4:35     ` David Miller
  0 siblings, 1 reply; 57+ messages in thread
From: Du, Fan @ 2014-12-03  3:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, fw, Du, Fan



>-----Original Message-----
>From: David Miller [mailto:davem@davemloft.net]
>Sent: Wednesday, December 3, 2014 11:23 AM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>From: Fan Du <fan.du@intel.com>
>Date: Fri, 28 Nov 2014 14:33:05 +0800
>
>> Test scenario: two KVM guests sitting in different hosts communicate
>> to each other with a vxlan tunnel.
>>
>> All interface MTU is default 1500 Bytes, from guest point of view, its
>> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> goes through vxlan encapuslation, individual segments length of a gso
>> packet could exceed physical NIC MTU 1500, which will be lost at
>> recevier side.
>>
>> So it's possible in virtualized environment, locally created skb len
>> after encapslation could be bigger than underlayer MTU. In such case,
>> it's reasonable to do GSO first, then fragment any packet bigger than
>> MTU as possible.
>>
>> +---------------+ TX     RX +---------------+
>> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> +-+-----------+-+           +-+-----------+-+
>>   |Qemu/VirtIO|               |Qemu/VirtIO|
>>   +-----------+               +-----------+
>>        |                            |
>>        v tap0                  tap0 v
>>   +-----------+               +-----------+
>>   | ovs bridge|               | ovs bridge|
>>   +-----------+               +-----------+
>>        | vxlan                vxlan |
>>        v                            v
>>   +-----------+               +-----------+
>>   |    NIC    |    <------>   |    NIC    |
>>   +-----------+               +-----------+
>>
>> Steps to reproduce:
>>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>>  2. Runing iperf without -M, communication will stuck.
>>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>
>I really don't like this at all.
>
>If guest sees a 1500 byte MTU, that's it's link layer MTU and it had better be able to
>send 1500 byte packets onto the "wire".

This patch makes it happens exactly as you putted.

>If you cannot properly propagate the vxlan encapsulation overhead back into the
>guest's MTU you must hide this problem from the rest of our stack somehow.

Again, this patch hide this problem to make Guest feel it can send packet with MTU as 1500 bytes.

>Nothing we create inside the host should need the change that you are making.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03  3:32   ` Du, Fan
@ 2014-12-03  4:35     ` David Miller
  2014-12-03  4:50       ` Du, Fan
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2014-12-03  4:35 UTC (permalink / raw)
  To: fan.du; +Cc: netdev, fw

From: "Du, Fan" <fan.du@intel.com>
Date: Wed, 3 Dec 2014 03:32:46 +0000

>>If guest sees a 1500 byte MTU, that's it's link layer MTU and it had better be able to
>>send 1500 byte packets onto the "wire".
> 
> This patch makes it happens exactly as you putted.
> 
>>If you cannot properly propagate the vxlan encapsulation overhead back into the
>>guest's MTU you must hide this problem from the rest of our stack somehow.
> 
> Again, this patch hide this problem to make Guest feel it can send packet with MTU as 1500 bytes.

I said make the guest see the real MTU, not hide the real MTU by
fragmenting or spitting ICMP PMTU messages back.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03  4:35     ` David Miller
@ 2014-12-03  4:50       ` Du, Fan
  2014-12-03  5:14         ` David Miller
  0 siblings, 1 reply; 57+ messages in thread
From: Du, Fan @ 2014-12-03  4:50 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, fw, Du, Fan



>-----Original Message-----
>From: David Miller [mailto:davem@davemloft.net]
>Sent: Wednesday, December 3, 2014 12:35 PM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>From: "Du, Fan" <fan.du@intel.com>
>Date: Wed, 3 Dec 2014 03:32:46 +0000
>
>>>If guest sees a 1500 byte MTU, that's it's link layer MTU and it had
>>>better be able to send 1500 byte packets onto the "wire".
>>
>> This patch makes it happens exactly as you putted.
>>
>>>If you cannot properly propagate the vxlan encapsulation overhead back
>>>into the guest's MTU you must hide this problem from the rest of our stack
>somehow.
>>
>> Again, this patch hide this problem to make Guest feel it can send packet with
>MTU as 1500 bytes.
>
>I said make the guest see the real MTU, not hide the real MTU by fragmenting or
>spitting ICMP PMTU messages back.

Do you have any better idea to achieve what you said besides this patch approach
without both fragmentation and ICMP message at the same time to cater for all kinds
tunnel tech?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03  4:50       ` Du, Fan
@ 2014-12-03  5:14         ` David Miller
  2014-12-03  6:53           ` Du, Fan
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2014-12-03  5:14 UTC (permalink / raw)
  To: fan.du; +Cc: netdev, fw

From: "Du, Fan" <fan.du@intel.com>
Date: Wed, 3 Dec 2014 04:50:21 +0000

> Do you have any better idea to achieve what you said besides this patch approach
> without both fragmentation and ICMP message at the same time to cater for all kinds
> tunnel tech?

I am not obligated to figure out for you how to design a correctly
implemented patch.

But I am obligated to keep a bad change from going into the tree and
that is what I am doing.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03  5:14         ` David Miller
@ 2014-12-03  6:53           ` Du, Fan
  0 siblings, 0 replies; 57+ messages in thread
From: Du, Fan @ 2014-12-03  6:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, fw, Du, Fan



>-----Original Message-----
>From: David Miller [mailto:davem@davemloft.net]
>Sent: Wednesday, December 3, 2014 1:15 PM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>From: "Du, Fan" <fan.du@intel.com>
>Date: Wed, 3 Dec 2014 04:50:21 +0000
>
>> Do you have any better idea to achieve what you said besides this
>> patch approach without both fragmentation and ICMP message at the same
>> time to cater for all kinds tunnel tech?
>
>I am not obligated to figure out for you how to design a correctly implemented
>patch.
>
>But I am obligated to keep a bad change from going into the tree and that is what I
>am doing.

"bad" is not depending whether you say it or not, but what the real world needs and what
proper solution could be provided at the time being.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]                     ` <CAEP_g=-86Z6pxNow-wjnbx_v9er_TSn6x5waigqVqYHa7tEQJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-12-03  9:03                       ` Michael S. Tsirkin
  2014-12-03 18:07                         ` Jesse Gross
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2014-12-03  9:03 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	Jason Wang, Du, Fan, fw-HFFVJYpyMKqzQB+pC5nmwQ,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Tue, Dec 02, 2014 at 10:12:04AM -0800, Jesse Gross wrote:
> On Tue, Dec 2, 2014 at 9:41 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
> >> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> >> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> >> > > What about containers or any other virtualization environment that
> >> > > doesn't use Virtio?
> >> >
> >> > The host can dictate the MTU in that case for both veth or OVS
> >> > internal which would be primary container plumbing techniques.
> >>
> >> It typically can't do this easily for VMs with emulated devices:
> >> real ethernet uses a fixed MTU.
> >>
> >> IMHO it's confusing to suggest MTU as a fix for this bug, it's
> >> an unrelated optimization.
> >> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
> >
> > PMTU discovery only resolves the issue if an actual IP stack is
> > running inside the VM. This may not be the case at all.
> 
> It's also only really a correct thing to do if the ICMP packet is
> coming from an L3 node. If you are doing straight bridging then you
> have to resort to hacks like OVS had before, which I agree are not
> particularly desirable.

The issue seems to be that fundamentally, this is
bridging interfaces with variable MTUs (even if MTU values
on devices don't let us figure this out)-
that is already not straight bridging, and
I would argue sending ICMPs back is the right thing to do.


> > I agree that exposing an MTU towards the guest is not applicable
> > in all situations, in particular because it is difficult to decide
> > what MTU to expose. It is a relatively elegant solution in a lot
> > of virtualization host cases hooked up to an orchestration system
> > though.
> 
> I also think this is the right thing to do as a common case
> optimization and I know other platforms (such as Hyper-V) do it. It's
> not a complete solution so we still need the original patch in this
> thread to handle things transparently.

Well, as I believe David (and independently Jason) is saying, it looks like
the ICMPs we are sending back after applying the original patch have the
wrong MTU.

And if I understand what David is saying here, IP is also the wrong place to
do it.

-- 
MST
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03  9:03                       ` Michael S. Tsirkin
@ 2014-12-03 18:07                         ` Jesse Gross
       [not found]                           ` <CAEP_g=9C+D3gbjJ4n1t6xuyjqEAMYi4ZfqPoe92UAoQJH-UsKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2014-12-03 18:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Thomas Graf, Du, Fan, Jason Wang, netdev, davem, fw, dev, Pravin Shelar

On Wed, Dec 3, 2014 at 1:03 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Dec 02, 2014 at 10:12:04AM -0800, Jesse Gross wrote:
>> On Tue, Dec 2, 2014 at 9:41 AM, Thomas Graf <tgraf@suug.ch> wrote:
>> > On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>> >> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>> >> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>> >> > > What about containers or any other virtualization environment that
>> >> > > doesn't use Virtio?
>> >> >
>> >> > The host can dictate the MTU in that case for both veth or OVS
>> >> > internal which would be primary container plumbing techniques.
>> >>
>> >> It typically can't do this easily for VMs with emulated devices:
>> >> real ethernet uses a fixed MTU.
>> >>
>> >> IMHO it's confusing to suggest MTU as a fix for this bug, it's
>> >> an unrelated optimization.
>> >> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>> >
>> > PMTU discovery only resolves the issue if an actual IP stack is
>> > running inside the VM. This may not be the case at all.
>>
>> It's also only really a correct thing to do if the ICMP packet is
>> coming from an L3 node. If you are doing straight bridging then you
>> have to resort to hacks like OVS had before, which I agree are not
>> particularly desirable.
>
> The issue seems to be that fundamentally, this is
> bridging interfaces with variable MTUs (even if MTU values
> on devices don't let us figure this out)-
> that is already not straight bridging, and
> I would argue sending ICMPs back is the right thing to do.

How do you deal with the fact that there is no host IP stack inside
the tunnel? And isn't this exactly the same as the former OVS
implementation that you said you didn't like?

>> > I agree that exposing an MTU towards the guest is not applicable
>> > in all situations, in particular because it is difficult to decide
>> > what MTU to expose. It is a relatively elegant solution in a lot
>> > of virtualization host cases hooked up to an orchestration system
>> > though.
>>
>> I also think this is the right thing to do as a common case
>> optimization and I know other platforms (such as Hyper-V) do it. It's
>> not a complete solution so we still need the original patch in this
>> thread to handle things transparently.
>
> Well, as I believe David (and independently Jason) is saying, it looks like
> the ICMPs we are sending back after applying the original patch have the
> wrong MTU.

The problem is actually that the ICMP messages won't even go to the
sending VM because the host IP stack and the VM are isolated from each
other and there is no route.

> And if I understand what David is saying here, IP is also the wrong place to
> do it.

ICMP can't be the complete solution in any case because it only works
for IP traffic. I think there are only two full solutions: find a way
to adjust the guest MTU to the minimum MTU that its traffic could hit
in an L2 domain or fragmentation. ICMP could be a possible
optimization in the fragmentation case.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]                           ` <CAEP_g=9C+D3gbjJ4n1t6xuyjqEAMYi4ZfqPoe92UAoQJH-UsKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-12-03 18:38                             ` Michael S. Tsirkin
  2014-12-03 18:56                               ` Rick Jones
  2014-12-03 19:38                               ` Jesse Gross
  0 siblings, 2 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2014-12-03 18:38 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	Jason Wang, Du, Fan, fw-HFFVJYpyMKqzQB+pC5nmwQ,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Wed, Dec 03, 2014 at 10:07:42AM -0800, Jesse Gross wrote:
> On Wed, Dec 3, 2014 at 1:03 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Dec 02, 2014 at 10:12:04AM -0800, Jesse Gross wrote:
> >> On Tue, Dec 2, 2014 at 9:41 AM, Thomas Graf <tgraf@suug.ch> wrote:
> >> > On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
> >> >> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> >> >> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> >> >> > > What about containers or any other virtualization environment that
> >> >> > > doesn't use Virtio?
> >> >> >
> >> >> > The host can dictate the MTU in that case for both veth or OVS
> >> >> > internal which would be primary container plumbing techniques.
> >> >>
> >> >> It typically can't do this easily for VMs with emulated devices:
> >> >> real ethernet uses a fixed MTU.
> >> >>
> >> >> IMHO it's confusing to suggest MTU as a fix for this bug, it's
> >> >> an unrelated optimization.
> >> >> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
> >> >
> >> > PMTU discovery only resolves the issue if an actual IP stack is
> >> > running inside the VM. This may not be the case at all.
> >>
> >> It's also only really a correct thing to do if the ICMP packet is
> >> coming from an L3 node. If you are doing straight bridging then you
> >> have to resort to hacks like OVS had before, which I agree are not
> >> particularly desirable.
> >
> > The issue seems to be that fundamentally, this is
> > bridging interfaces with variable MTUs (even if MTU values
> > on devices don't let us figure this out)-
> > that is already not straight bridging, and
> > I would argue sending ICMPs back is the right thing to do.
> 
> How do you deal with the fact that there is no host IP stack inside
> the tunnel? And isn't this exactly the same as the former OVS
> implementation that you said you didn't like?

I was talking about the high level requirement, not the implementation
here. I agree it's not at all trivial, we need to propagate this across
tunnels.

But let's agree on what we are trying to accomplish first.


> >> > I agree that exposing an MTU towards the guest is not applicable
> >> > in all situations, in particular because it is difficult to decide
> >> > what MTU to expose. It is a relatively elegant solution in a lot
> >> > of virtualization host cases hooked up to an orchestration system
> >> > though.
> >>
> >> I also think this is the right thing to do as a common case
> >> optimization and I know other platforms (such as Hyper-V) do it. It's
> >> not a complete solution so we still need the original patch in this
> >> thread to handle things transparently.
> >
> > Well, as I believe David (and independently Jason) is saying, it looks like
> > the ICMPs we are sending back after applying the original patch have the
> > wrong MTU.
> 
> The problem is actually that the ICMP messages won't even go to the
> sending VM because the host IP stack and the VM are isolated from each
> other and there is no route.

Exactly.
But all this is talking about implementation.

Let's agree on what we want to do first.

And in my mind, we typically want originator to adjust its PMTU,
just for a given destination.
Sending ICMP to originating VM will typically accomplish this.




> > And if I understand what David is saying here, IP is also the wrong place to
> > do it.
> 
> ICMP can't be the complete solution in any case because it only works
> for IP traffic.

Let's be specific please.  What protocols do you most care about? IPX?

> I think there are only two full solutions: find a way
> to adjust the guest MTU to the minimum MTU that its traffic could hit
> in an L2 domain or fragmentation. ICMP could be a possible
> optimization in the fragmentation case.

Both approaches seem strange. You are sending 1 packet an hour to
some destination behind 100 tunnels. Why would you want to
cut down your MTU for all packets? On the other hand,
doubling the amount of packets because your MTU is off
by a couple of bytes will hurt performance significantly.

Still, if you want to cut down the MTU within guest,
that's only an ifconfig away.
Most people would not want to bother, I think it's a good
idea to make PMTU work properly for them.

-- 
MST
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03 18:38                             ` Michael S. Tsirkin
@ 2014-12-03 18:56                               ` Rick Jones
       [not found]                                 ` <547F5CC2.8000908-VXdhtT5mjnY@public.gmane.org>
  2014-12-03 19:38                               ` Jesse Gross
  1 sibling, 1 reply; 57+ messages in thread
From: Rick Jones @ 2014-12-03 18:56 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jesse Gross
  Cc: Thomas Graf, Du, Fan, Jason Wang, netdev, davem, fw, dev, Pravin Shelar

Trying to "fake-out" an ICMP message to paper-over "devices" in the 
"middle" of same Layer2 network having different MTUs from the ends goes 
back to at least the days when people started joining FDDI networks to 
Ethernet networks with bridges rather than routers.  Perhaps back even 
further.  It had problems them (including not all traffic being IP), I 
doubt today would be any different.

All devices in a Layer2 network must have the same MTU. (*)

The only exception to that which does not lead one down a rathole is 
that you can have the MTU at the "edges" of the Layer 2 network be 
smaller than the MTU in the "middle."  And then only if the middle 
"never" tries to talk itself to the edges directly.

rick jones

(*) Or at least any communicating device must be kept ignorant of what 
one does to have a smaller MTU in there somewhere.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03 18:38                             ` Michael S. Tsirkin
  2014-12-03 18:56                               ` Rick Jones
@ 2014-12-03 19:38                               ` Jesse Gross
  2014-12-03 22:02                                 ` Thomas Graf
  1 sibling, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2014-12-03 19:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Thomas Graf, Du, Fan, Jason Wang, netdev, davem, fw, dev, Pravin Shelar

On Wed, Dec 3, 2014 at 10:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Dec 03, 2014 at 10:07:42AM -0800, Jesse Gross wrote:.
>> ICMP can't be the complete solution in any case because it only works
>> for IP traffic.
>
> Let's be specific please.  What protocols do you most care about? IPX?
>
>> I think there are only two full solutions: find a way
>> to adjust the guest MTU to the minimum MTU that its traffic could hit
>> in an L2 domain or fragmentation. ICMP could be a possible
>> optimization in the fragmentation case.
>
> Both approaches seem strange. You are sending 1 packet an hour to
> some destination behind 100 tunnels. Why would you want to
> cut down your MTU for all packets? On the other hand,
> doubling the amount of packets because your MTU is off
> by a couple of bytes will hurt performance significantly.
>
> Still, if you want to cut down the MTU within guest,
> that's only an ifconfig away.
> Most people would not want to bother, I think it's a good
> idea to make PMTU work properly for them.

I care about correctness first, which means that an Ethernet link
being exposed to the guest should behave like Ethernet. So, yes, IPX
should work if somebody chooses to do that.

Your comments are about performance optimization. That's fine but
without a correct base to start from it seems like putting the cart
before the horse and is hard to reason about.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03 19:38                               ` Jesse Gross
@ 2014-12-03 22:02                                 ` Thomas Graf
       [not found]                                   ` <20141203220244.GA8822-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  2014-12-03 22:51                                   ` Jesse Gross
  0 siblings, 2 replies; 57+ messages in thread
From: Thomas Graf @ 2014-12-03 22:02 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev, davem, fw, dev,
	Pravin Shelar

On 12/03/14 at 11:38am, Jesse Gross wrote:
> On Wed, Dec 3, 2014 at 10:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > Both approaches seem strange. You are sending 1 packet an hour to
> > some destination behind 100 tunnels. Why would you want to
> > cut down your MTU for all packets? On the other hand,
> > doubling the amount of packets because your MTU is off
> > by a couple of bytes will hurt performance significantly.
> >
> > Still, if you want to cut down the MTU within guest,
> > that's only an ifconfig away.
> > Most people would not want to bother, I think it's a good
> > idea to make PMTU work properly for them.
> 
> I care about correctness first, which means that an Ethernet link
> being exposed to the guest should behave like Ethernet. So, yes, IPX
> should work if somebody chooses to do that.
> 
> Your comments are about performance optimization. That's fine but
> without a correct base to start from it seems like putting the cart
> before the horse and is hard to reason about.

I agree with Jesse in particular about correctnes but Michael has a
point (which I thing nobod objects to) which is that it may not always
make sense to force the MTU onto the guest. It clearly makes sense for
the edge server connected to an overlay but it may not be ideal if
WAN traffic is VXLAN encapped and local DC traffic is put onto a VLAN.

That said, I think it is fair to assume that the host knows what role
it plays and can be configured accordingly, i.e. a Netlink API which
exposes the encap overhead so libvirt can max() over it force it onto
the guest or something along those lines.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]                                   ` <20141203220244.GA8822-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-12-03 22:50                                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2014-12-03 22:50 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	Jason Wang, Du, Fan, fw-HFFVJYpyMKqzQB+pC5nmwQ,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Wed, Dec 03, 2014 at 10:02:44PM +0000, Thomas Graf wrote:
> On 12/03/14 at 11:38am, Jesse Gross wrote:
> > On Wed, Dec 3, 2014 at 10:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > Both approaches seem strange. You are sending 1 packet an hour to
> > > some destination behind 100 tunnels. Why would you want to
> > > cut down your MTU for all packets? On the other hand,
> > > doubling the amount of packets because your MTU is off
> > > by a couple of bytes will hurt performance significantly.
> > >
> > > Still, if you want to cut down the MTU within guest,
> > > that's only an ifconfig away.
> > > Most people would not want to bother, I think it's a good
> > > idea to make PMTU work properly for them.
> > 
> > I care about correctness first, which means that an Ethernet link
> > being exposed to the guest should behave like Ethernet. So, yes, IPX
> > should work if somebody chooses to do that.
> > 
> > Your comments are about performance optimization. That's fine but
> > without a correct base to start from it seems like putting the cart
> > before the horse and is hard to reason about.
> 
> I agree with Jesse in particular about correctnes but Michael has a
> point (which I thing nobod objects to) which is that it may not always
> make sense to force the MTU onto the guest. It clearly makes sense for
> the edge server connected to an overlay but it may not be ideal if
> WAN traffic is VXLAN encapped and local DC traffic is put onto a VLAN.

And it's not like tweaking local MTU for one interface will
magically fix everything.

> That said, I think it is fair to assume that the host knows what role
> it plays and can be configured accordingly, i.e. a Netlink API which
> exposes the encap overhead so libvirt can max() over it force it onto
> the guest or something along those lines.

I'd say let's try to at least fix IP traffic properly.

-- 
MST
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03 22:02                                 ` Thomas Graf
       [not found]                                   ` <20141203220244.GA8822-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-12-03 22:51                                   ` Jesse Gross
  2014-12-03 23:05                                     ` Thomas Graf
  2014-12-04  7:48                                     ` Du Fan
  1 sibling, 2 replies; 57+ messages in thread
From: Jesse Gross @ 2014-12-03 22:51 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev, davem, fw, dev,
	Pravin Shelar

On Wed, Dec 3, 2014 at 2:02 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/03/14 at 11:38am, Jesse Gross wrote:
>> On Wed, Dec 3, 2014 at 10:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > Both approaches seem strange. You are sending 1 packet an hour to
>> > some destination behind 100 tunnels. Why would you want to
>> > cut down your MTU for all packets? On the other hand,
>> > doubling the amount of packets because your MTU is off
>> > by a couple of bytes will hurt performance significantly.
>> >
>> > Still, if you want to cut down the MTU within guest,
>> > that's only an ifconfig away.
>> > Most people would not want to bother, I think it's a good
>> > idea to make PMTU work properly for them.
>>
>> I care about correctness first, which means that an Ethernet link
>> being exposed to the guest should behave like Ethernet. So, yes, IPX
>> should work if somebody chooses to do that.
>>
>> Your comments are about performance optimization. That's fine but
>> without a correct base to start from it seems like putting the cart
>> before the horse and is hard to reason about.
>
> I agree with Jesse in particular about correctnes but Michael has a
> point (which I thing nobod objects to) which is that it may not always
> make sense to force the MTU onto the guest. It clearly makes sense for
> the edge server connected to an overlay but it may not be ideal if
> WAN traffic is VXLAN encapped and local DC traffic is put onto a VLAN.

The question is whether you would do this in a single L2 segment. It's
possible, of course, but probably not a great idea and I'm not sure
that it's really worth optimizing for. We do have one existing example
of this type of MTU reduction - the bridge device when you attach
multiple devices with varying MTUs (including a VXLAN device). In that
case, the bridge device is effectively acting as a connection point,
similar to virtio in a VM.

My proposal would be something like this:
 * For L2, reduce the VM MTU to the lowest common denominator on the segment.
 * For L3, use path MTU discovery or fragment inner packet (i.e.
normal routing behavior).
 * As a last resort (such as if using an old version of virtio in the
guest), fragment the tunnel packet.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03 22:51                                   ` Jesse Gross
@ 2014-12-03 23:05                                     ` Thomas Graf
       [not found]                                       ` <20141203230551.GC8822-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  2014-12-04  7:48                                     ` Du Fan
  1 sibling, 1 reply; 57+ messages in thread
From: Thomas Graf @ 2014-12-03 23:05 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev, davem, fw, dev,
	Pravin Shelar

On 12/03/14 at 02:51pm, Jesse Gross wrote:
> My proposal would be something like this:
>  * For L2, reduce the VM MTU to the lowest common denominator on the segment.
>  * For L3, use path MTU discovery or fragment inner packet (i.e.
> normal routing behavior).
>  * As a last resort (such as if using an old version of virtio in the
> guest), fragment the tunnel packet.

That's what I had in mind as well although using a differentiator bit
which indicates to the output path whether the packet is to be
considered switched or routed and thus send ICMPs. The bit would be set
per flow, thus allowing arbitary granularity of behaviour.

I haven't fully thought this through yet though.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]                                       ` <20141203230551.GC8822-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-12-04  0:54                                         ` Jesse Gross
  2014-12-04  1:15                                           ` Thomas Graf
  0 siblings, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2014-12-04  0:54 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Michael S. Tsirkin,
	netdev-u79uwXL29TY76Z2rM5mHXA, Jason Wang, Du, Fan,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Wed, Dec 3, 2014 at 3:05 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/03/14 at 02:51pm, Jesse Gross wrote:
>> My proposal would be something like this:
>>  * For L2, reduce the VM MTU to the lowest common denominator on the segment.
>>  * For L3, use path MTU discovery or fragment inner packet (i.e.
>> normal routing behavior).
>>  * As a last resort (such as if using an old version of virtio in the
>> guest), fragment the tunnel packet.
>
> That's what I had in mind as well although using a differentiator bit
> which indicates to the output path whether the packet is to be
> considered switched or routed and thus send ICMPs. The bit would be set
> per flow, thus allowing arbitary granularity of behaviour.

I don't think that we actually need a bit. I would expect that ICMP
generation to be coupled with routing (although this requires being
able to know what the ultimate MTU is at the time of routing the inner
packet). If that's the case, you just need to steer between L2 and L3
processing in the same way that you would today and ICMP would just
come in the right cases.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-04  0:54                                         ` Jesse Gross
@ 2014-12-04  1:15                                           ` Thomas Graf
  2014-12-04  1:51                                             ` Jesse Gross
  0 siblings, 1 reply; 57+ messages in thread
From: Thomas Graf @ 2014-12-04  1:15 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev, davem, fw, dev,
	Pravin Shelar

On 12/03/14 at 04:54pm, Jesse Gross wrote:
> I don't think that we actually need a bit. I would expect that ICMP
> generation to be coupled with routing (although this requires being
> able to know what the ultimate MTU is at the time of routing the inner
> packet). If that's the case, you just need to steer between L2 and L3
> processing in the same way that you would today and ICMP would just
> come in the right cases.

I think the MTU awareness is solveable but how do you steer between
L2 and L3? How do you differentiate between an L3 ACL rule in L2 mode
and an actual L3 based forward? dec_ttl? This is what drove me to
the user controlled bit and it became appealing as it allows to
enable/disable PMTU per guest or even per flow/route.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-04  1:15                                           ` Thomas Graf
@ 2014-12-04  1:51                                             ` Jesse Gross
  2014-12-04  9:26                                               ` Thomas Graf
  0 siblings, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2014-12-04  1:51 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev, davem, fw, dev,
	Pravin Shelar

On Wed, Dec 3, 2014 at 5:15 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/03/14 at 04:54pm, Jesse Gross wrote:
>> I don't think that we actually need a bit. I would expect that ICMP
>> generation to be coupled with routing (although this requires being
>> able to know what the ultimate MTU is at the time of routing the inner
>> packet). If that's the case, you just need to steer between L2 and L3
>> processing in the same way that you would today and ICMP would just
>> come in the right cases.
>
> I think the MTU awareness is solveable but how do you steer between
> L2 and L3? How do you differentiate between an L3 ACL rule in L2 mode
> and an actual L3 based forward? dec_ttl? This is what drove me to
> the user controlled bit and it became appealing as it allows to
> enable/disable PMTU per guest or even per flow/route.

I think it depends on where you put the PMTU check. If routing is
happening in OVS where it is decomposed in several discrete actions
like set MAC and decrement TTL then perhaps there is another explicit
action to check the MTU. If it is happening in the context of the IP
stack, then ICMP generation occurs automatically and if you get that
if you write a flow to send a packet there. In each case, it seems
like a flow would be steering you by way of an action to do routing so
you would have fine grained control. I don't see this as conflicting
with L3 ACLs in an L2 context in the same way that you don't have to
automatically decrement the TTL.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03 22:51                                   ` Jesse Gross
  2014-12-03 23:05                                     ` Thomas Graf
@ 2014-12-04  7:48                                     ` Du Fan
  2014-12-04 23:23                                       ` Jesse Gross
  1 sibling, 1 reply; 57+ messages in thread
From: Du Fan @ 2014-12-04  7:48 UTC (permalink / raw)
  To: Jesse Gross, Pravin Shelar
  Cc: Thomas Graf, Michael S. Tsirkin, Du, Fan, Jason Wang, netdev,
	davem, fw, dev

于 2014年12月04日 06:51, Jesse Gross 写道:
> My proposal would be something like this:
>   * For L2, reduce the VM MTU to the lowest common denominator on the segment.
>   * For L3, use path MTU discovery or fragment inner packet (i.e.
> normal routing behavior).
>   * As a last resort (such as if using an old version of virtio in the
> guest), fragment the tunnel packet.

After some investigation on OpenvSwitch package, it seems before this
commit: 06021dcb "datapath: compat: Fix compilation 3.11" OpenvSwitch
package is doing GSO on its own.

rpl_ip_local_out
   -> tnl_skb_gso_segment
       ^^^^^^^^^^^^^^^
          Perform GSO  in above function
     -> ip_local_out
           .
           .
         -> ip_finish_output

Which means, when over-MTU-sized skb enter ip_local_out, it's not a gso
type skb, so the stack perform ip fragmentation, and send them out.
So, over-MTU-sized skb did travel through stack into outside.

Why not dropping such OpenvSwitch level GSO operation after 3.10?



-- 
No zuo no die but I have to try.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-04  1:51                                             ` Jesse Gross
@ 2014-12-04  9:26                                               ` Thomas Graf
  2014-12-04 23:19                                                 ` Jesse Gross
  0 siblings, 1 reply; 57+ messages in thread
From: Thomas Graf @ 2014-12-04  9:26 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev, davem, fw, dev,
	Pravin Shelar

On 12/03/14 at 05:51pm, Jesse Gross wrote:
> I think it depends on where you put the PMTU check. If routing is
> happening in OVS where it is decomposed in several discrete actions
> like set MAC and decrement TTL then perhaps there is another explicit
> action to check the MTU. If it is happening in the context of the IP
> stack, then ICMP generation occurs automatically and if you get that
> if you write a flow to send a packet there. In each case, it seems
> like a flow would be steering you by way of an action to do routing so
> you would have fine grained control. I don't see this as conflicting
> with L3 ACLs in an L2 context in the same way that you don't have to
> automatically decrement the TTL.

OK, I was under the impression that you would detect L3 behaviour
desire in OVS without an explicit action. Hence the worry on wrong
classification for L3 ACLs.

Whether we mark the packet and defer the MTU check or we check right
in the action is an implementation detail in the end. I think we
agree on concept level that we need an API to control behaviour.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]                                 ` <547F5CC2.8000908-VXdhtT5mjnY@public.gmane.org>
@ 2014-12-04 10:17                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2014-12-04 10:17 UTC (permalink / raw)
  To: Rick Jones
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	Jason Wang, Du, Fan, fw-HFFVJYpyMKqzQB+pC5nmwQ,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Wed, Dec 03, 2014 at 10:56:02AM -0800, Rick Jones wrote:
> Trying to "fake-out" an ICMP message to paper-over "devices" in the "middle"
> of same Layer2 network having different MTUs from the ends goes back to at
> least the days when people started joining FDDI networks to Ethernet
> networks with bridges rather than routers.  Perhaps back even further.  It
> had problems them (including not all traffic being IP), I doubt today would
> be any different.
> 
> All devices in a Layer2 network must have the same MTU. (*)
> 
> The only exception to that which does not lead one down a rathole is that
> you can have the MTU at the "edges" of the Layer 2 network be smaller than
> the MTU in the "middle."  And then only if the middle "never" tries to talk
> itself to the edges directly.
> 
> rick jones
> 
> (*) Or at least any communicating device must be kept ignorant of what one
> does to have a smaller MTU in there somewhere.

What I've been pointing out is that just because something is within a VM,
we can't assume it's an edge.

-- 
MST
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-04  9:26                                               ` Thomas Graf
@ 2014-12-04 23:19                                                 ` Jesse Gross
  0 siblings, 0 replies; 57+ messages in thread
From: Jesse Gross @ 2014-12-04 23:19 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev, davem, fw, dev,
	Pravin Shelar

On Thu, Dec 4, 2014 at 1:26 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/03/14 at 05:51pm, Jesse Gross wrote:
>> I think it depends on where you put the PMTU check. If routing is
>> happening in OVS where it is decomposed in several discrete actions
>> like set MAC and decrement TTL then perhaps there is another explicit
>> action to check the MTU. If it is happening in the context of the IP
>> stack, then ICMP generation occurs automatically and if you get that
>> if you write a flow to send a packet there. In each case, it seems
>> like a flow would be steering you by way of an action to do routing so
>> you would have fine grained control. I don't see this as conflicting
>> with L3 ACLs in an L2 context in the same way that you don't have to
>> automatically decrement the TTL.
>
> OK, I was under the impression that you would detect L3 behaviour
> desire in OVS without an explicit action. Hence the worry on wrong
> classification for L3 ACLs.
>
> Whether we mark the packet and defer the MTU check or we check right
> in the action is an implementation detail in the end. I think we
> agree on concept level that we need an API to control behaviour.

Yes, I agree.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-04  7:48                                     ` Du Fan
@ 2014-12-04 23:23                                       ` Jesse Gross
  2014-12-05  0:25                                         ` Du Fan
  0 siblings, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2014-12-04 23:23 UTC (permalink / raw)
  To: Du Fan
  Cc: Pravin Shelar, Thomas Graf, Michael S. Tsirkin, Du, Fan,
	Jason Wang, netdev, davem, fw, dev

On Wed, Dec 3, 2014 at 11:48 PM, Du Fan <fengyuleidian0615@gmail.com> wrote:
> 于 2014年12月04日 06:51, Jesse Gross 写道:
>>
>> My proposal would be something like this:
>>   * For L2, reduce the VM MTU to the lowest common denominator on the
>> segment.
>>   * For L3, use path MTU discovery or fragment inner packet (i.e.
>> normal routing behavior).
>>   * As a last resort (such as if using an old version of virtio in the
>> guest), fragment the tunnel packet.
>
>
> After some investigation on OpenvSwitch package, it seems before this
> commit: 06021dcb "datapath: compat: Fix compilation 3.11" OpenvSwitch
> package is doing GSO on its own.
>
> rpl_ip_local_out
>   -> tnl_skb_gso_segment
>       ^^^^^^^^^^^^^^^
>          Perform GSO  in above function
>     -> ip_local_out
>           .
>           .
>         -> ip_finish_output
>
> Which means, when over-MTU-sized skb enter ip_local_out, it's not a gso
> type skb, so the stack perform ip fragmentation, and send them out.
> So, over-MTU-sized skb did travel through stack into outside.
>
> Why not dropping such OpenvSwitch level GSO operation after 3.10?

The change in 3.11 was that the tunnel infrastructure used by OVS was
upstreamed and shared by all implementations. It's not right to
perform GSO in OVS itself as it prevents the logic from being used by
other components. Breaking up the packet in OVS also eliminates some
of the benefits of GSO by shortening the optimized path and prevents
offloading to hardware.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-04 23:23                                       ` Jesse Gross
@ 2014-12-05  0:25                                         ` Du Fan
  0 siblings, 0 replies; 57+ messages in thread
From: Du Fan @ 2014-12-05  0:25 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Pravin Shelar, Thomas Graf, Michael S. Tsirkin, Du, Fan,
	Jason Wang, netdev, davem, fw, dev


On 2014/12/5 7:23, Jesse Gross wrote:
> On Wed, Dec 3, 2014 at 11:48 PM, Du Fan <fengyuleidian0615@gmail.com> wrote:
>> 于 2014年12月04日 06:51, Jesse Gross 写道:
>>> My proposal would be something like this:
>>>    * For L2, reduce the VM MTU to the lowest common denominator on the
>>> segment.
>>>    * For L3, use path MTU discovery or fragment inner packet (i.e.
>>> normal routing behavior).
>>>    * As a last resort (such as if using an old version of virtio in the
>>> guest), fragment the tunnel packet.
>>
>> After some investigation on OpenvSwitch package, it seems before this
>> commit: 06021dcb "datapath: compat: Fix compilation 3.11" OpenvSwitch
>> package is doing GSO on its own.
>>
>> rpl_ip_local_out
>>    -> tnl_skb_gso_segment
>>        ^^^^^^^^^^^^^^^
>>           Perform GSO  in above function
>>      -> ip_local_out
>>            .
>>            .
>>          -> ip_finish_output
>>
>> Which means, when over-MTU-sized skb enter ip_local_out, it's not a gso
>> type skb, so the stack perform ip fragmentation, and send them out.
>> So, over-MTU-sized skb did travel through stack into outside.
>>
>> Why not dropping such OpenvSwitch level GSO operation after 3.10?
> The change in 3.11 was that the tunnel infrastructure used by OVS was
> upstreamed and shared by all implementations. It's not right to
> perform GSO in OVS itself as it prevents the logic from being used by
> other components. Breaking up the packet in OVS also eliminates some
> of the benefits of GSO by shortening the optimized path and prevents
> offloading to hardware.
Thanks for your explanation, I understand its background better now.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2014-12-03  2:31                   ` Du, Fan
@ 2015-01-05  6:02                     ` Fan Du
       [not found]                       ` <54AA2912.6090903-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Fan Du @ 2015-01-05  6:02 UTC (permalink / raw)
  To: Du, Fan, Thomas Graf, davem, jesse
  Cc: Michael S. Tsirkin, 'Jason Wang', netdev, fw, dev, pshelar

于 2014年12月03日 10:31, Du, Fan 写道:
>
>
>> -----Original Message-----
>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>> Sent: Wednesday, December 3, 2014 1:42 AM
>> To: Michael S. Tsirkin
>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>>
>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>> What about containers or any other virtualization environment that
>>>>> doesn't use Virtio?
>>>>
>>>> The host can dictate the MTU in that case for both veth or OVS
>>>> internal which would be primary container plumbing techniques.
>>>
>>> It typically can't do this easily for VMs with emulated devices:
>>> real ethernet uses a fixed MTU.
>>>
>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>> unrelated optimization.
>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>
>> PMTU discovery only resolves the issue if an actual IP stack is running inside the
>> VM. This may not be the case at all.
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Some thoughts here:
>
> Think otherwise, this is indeed what host stack should forge a ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
> message with _inner_ skb network and transport header, do whatever type of encapsulation,
> and thereafter push such packet upward to Guest/Container, which make them feel, the intermediate node
> or the peer send such message. PMTU should be expected to work correct.
> And such behavior should be shared by all other encapsulation tech if they are also suffered.

Hi David, Jesse and Thomas

As discussed in here: https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
quotes from Jesse:
My proposal would be something like this:
  * For L2, reduce the VM MTU to the lowest common denominator on the segment.
  * For L3, use path MTU discovery or fragment inner packet (i.e.
normal routing behavior).
  * As a last resort (such as if using an old version of virtio in the
guest), fragment the tunnel packet.


For L2, it's a administrative action
For L3, PMTU approach looks better, because once the sender is alerted the reduced MTU,
packet size after encapsulation will not exceed physical MTU, so no additional fragments
efforts needed.
For "As a last resort... fragment the tunnel packet", the original patch:
https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job, but seems it's
not welcomed.

Below raw patch adopts PMTU approach, please review! Any kind of comments/suggestions
is welcomed.

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index e9f81d4..4d1b221 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1771,6 +1771,130 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
  		tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
  		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);

+		if (skb_is_gso(skb)) {
+			unsigned int inner_l234_hdrlen;
+			unsigned int outer_l34_hdrlen;
+			unsigned int gso_seglen;
+			struct net_device *phy_dev = rt->dst.dev;
+
+			inner_l234_hdrlen = skb_transport_header(skb) - skb_mac_header(skb);
+			if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))
+				inner_l234_hdrlen += tcp_hdrlen(skb);
+			if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP)
+				inner_l234_hdrlen += sizeof(struct udphdr);
+
+			outer_l34_hdrlen = sizeof(struct iphdr) + sizeof(struct udphdr) + sizeof(struct vxlanhdr);
+			/* gso_seglen is the GSO-ed skb packet len, adjust gso_size
+			 * to fit into physical netdev MTU
+			 */
+			gso_seglen = outer_l34_hdrlen + inner_l234_hdrlen + skb_shinfo(skb)->gso_size;
+			if (gso_seglen > phy_dev->mtu) {
+				struct sk_buff *reply;
+				struct ethhdr *orig_eth;
+				struct ethhdr *new_eth;
+				struct ethhdr *tnl_eth;
+				struct iphdr *orig_ip;
+				struct iphdr *new_ip;
+				struct iphdr *tnl_ip;
+				struct icmphdr *new_icmp;
+				unsigned int room;
+				unsigned int data_len;
+				unsigned int reply_l234_hdrlen;
+				unsigned int vxlan_tnl_hdrlen;
+				struct vxlanhdr *vxh;
+				struct udphdr *uh;
+				__wsum csum;
+
+				/* How much room to store orignal message */
+				room = (skb->len > 576) ? 576 : skb->len;
+				room -= sizeof(struct iphdr) + sizeof(struct icmphdr);
+
+				/* Ethernet payload len */
+				data_len = skb->len - skb_network_offset(skb);
+				if (data_len > room)
+					data_len = room;
+
+				reply_l234_hdrlen = LL_RESERVED_SPACE(phy_dev) + phy_dev->needed_tailroom +
+									sizeof(struct iphdr) + sizeof(struct icmphdr);
+				vxlan_tnl_hdrlen = LL_RESERVED_SPACE(phy_dev) + phy_dev->needed_tailroom +
+									sizeof(struct iphdr) + sizeof(struct udphdr) + sizeof(struct vxlanhdr);
+
+				reply = alloc_skb(vxlan_tnl_hdrlen + reply_l234_hdrlen + data_len, GFP_ATOMIC);
+				reply->dev = phy_dev;
+				skb_reserve(reply, vxlan_tnl_hdrlen + reply_l234_hdrlen);
+
+				new_icmp = (struct icmphdr *)__skb_push(reply, sizeof(struct icmphdr));
+				new_icmp->type = ICMP_DEST_UNREACH;
+				new_icmp->code = ICMP_FRAG_NEEDED;
+				new_icmp->un.frag.mtu = htons(phy_dev->mtu - outer_l34_hdrlen);
+				new_icmp->checksum = 0;
+
+				new_ip = (struct iphdr *)__skb_push(reply, sizeof(struct iphdr));
+				orig_ip = ip_hdr(skb);
+				new_ip->ihl = 5;
+				new_ip->version = 4;
+				new_ip->ttl = 32;
+				new_ip->tos = 1;
+				new_ip->protocol = IPPROTO_ICMP;
+				new_ip->saddr = orig_ip->daddr;
+				new_ip->daddr = orig_ip->saddr;
+				new_ip->frag_off = 0;
+				new_ip->tot_len = htons(sizeof(struct iphdr) + sizeof(struct icmphdr) + data_len);
+				ip_send_check(new_ip);
+
+				new_eth = (struct ethhdr *)__skb_push(reply, sizeof(struct ethhdr));
+				orig_eth = eth_hdr(skb);
+				ether_addr_copy(new_eth->h_dest, orig_eth->h_source);
+				ether_addr_copy(new_eth->h_source, orig_eth->h_dest);
+				new_eth->h_proto = htons(ETH_P_IP);
+				reply->ip_summed = CHECKSUM_UNNECESSARY;
+				reply->pkt_type = PACKET_HOST;
+				reply->protocol = htons(ETH_P_IP);
+				memcpy(skb_put(reply, data_len), skb_network_header(skb), data_len);
+				new_icmp->checksum = csum_fold(csum_partial(new_icmp, sizeof(struct icmphdr) + data_len, 0));
+
+				/* vxlan encapuslation */
+				vxh = (struct vxlanhdr *)__skb_push(reply, sizeof(*vxh));
+				vxh->vx_flags = htonl(VXLAN_FLAGS);
+				vxh->vx_vni = htonl(vni << 8);
+
+				__skb_push(reply, sizeof(*uh));
+				skb_reset_transport_header(reply);
+				uh = udp_hdr(reply);
+				uh->dest = dst_port;
+				uh->source = src_port;
+				uh->len = htons(reply->len);
+				uh->check = 0;
+				csum = skb_checksum(reply, 0, reply->len, 0);
+				uh->check = udp_v4_check(reply->len, fl4.saddr, dst->sin.sin_addr.s_addr, csum);
+
+				tnl_ip = (struct iphdr *)__skb_push(reply, sizeof(struct iphdr));
+				skb_reset_network_header(reply);
+				tnl_ip->ihl = 5;
+				tnl_ip->version = 4;
+				tnl_ip->ttl = 32;
+				tnl_ip->tos = 1;
+				tnl_ip->protocol = IPPROTO_UDP;
+				tnl_ip->saddr = dst->sin.sin_addr.s_addr;
+				tnl_ip->daddr = fl4.saddr;
+				tnl_ip->frag_off = 0;
+				tnl_ip->tot_len = htons(reply->len);
+				ip_send_check(tnl_ip);
+
+				/* fill with nosense mac header */
+				tnl_eth = (struct ethhdr *)__skb_push(reply, sizeof(struct ethhdr));
+				skb_reset_mac_header(reply);
+				orig_eth = eth_hdr(skb);
+				ether_addr_copy(tnl_eth->h_dest, orig_eth->h_source);
+				ether_addr_copy(tnl_eth->h_source, orig_eth->h_dest);
+				tnl_eth->h_proto = htons(ETH_P_IP);
+				__skb_pull(reply, skb_network_offset(reply));
+
+				/* push encapuslated ICMP message back to sender */
+				netif_rx_ni(reply);
+			}
+		}
  		err = vxlan_xmit_skb(vxlan->vn_sock, rt, skb,
  				     fl4.saddr, dst->sin.sin_addr.s_addr,
  				     tos, ttl, df, src_port, dst_port,



-- 
No zuo no die but I have to try.

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]                       ` <54AA2912.6090903-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-01-05 17:58                         ` Jesse Gross
  2015-01-06  9:34                           ` Fan Du
  0 siblings, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2015-01-05 17:58 UTC (permalink / raw)
  To: Fan Du
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Michael S. Tsirkin,
	netdev-u79uwXL29TY76Z2rM5mHXA, Jason Wang, Du, Fan,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Mon, Jan 5, 2015 at 1:02 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2014年12月03日 10:31, Du, Fan 写道:
>
>>
>>
>>> -----Original Message-----
>>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>>> Sent: Wednesday, December 3, 2014 1:42 AM
>>> To: Michael S. Tsirkin
>>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than
>>> MTU
>>>
>>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>>>
>>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>>>
>>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>>>
>>>>>> What about containers or any other virtualization environment that
>>>>>> doesn't use Virtio?
>>>>>
>>>>>
>>>>> The host can dictate the MTU in that case for both veth or OVS
>>>>> internal which would be primary container plumbing techniques.
>>>>
>>>>
>>>> It typically can't do this easily for VMs with emulated devices:
>>>> real ethernet uses a fixed MTU.
>>>>
>>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>>> unrelated optimization.
>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>>
>>>
>>> PMTU discovery only resolves the issue if an actual IP stack is running
>>> inside the
>>> VM. This may not be the case at all.
>>
>>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> Some thoughts here:
>>
>> Think otherwise, this is indeed what host stack should forge a
>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
>> message with _inner_ skb network and transport header, do whatever type of
>> encapsulation,
>> and thereafter push such packet upward to Guest/Container, which make them
>> feel, the intermediate node
>> or the peer send such message. PMTU should be expected to work correct.
>> And such behavior should be shared by all other encapsulation tech if they
>> are also suffered.
>
>
> Hi David, Jesse and Thomas
>
> As discussed in here:
> https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
> quotes from Jesse:
> My proposal would be something like this:
>  * For L2, reduce the VM MTU to the lowest common denominator on the
> segment.
>  * For L3, use path MTU discovery or fragment inner packet (i.e.
> normal routing behavior).
>  * As a last resort (such as if using an old version of virtio in the
> guest), fragment the tunnel packet.
>
>
> For L2, it's a administrative action
> For L3, PMTU approach looks better, because once the sender is alerted the
> reduced MTU,
> packet size after encapsulation will not exceed physical MTU, so no
> additional fragments
> efforts needed.
> For "As a last resort... fragment the tunnel packet", the original patch:
> https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job, but
> seems it's
> not welcomed.

This needs to be properly integrated into IP processing if it is to
work correctly. One of the reasons for only doing path MTU discovery
for L3 is that it operates seamlessly as part of normal operation -
there is no need to forge addresses or potentially generate ICMP when
on an L2 network. However, this ignores the IP handling that is going
on (note that in OVS it is possible for L3 to be implemented as a set
of flows coming from a controller).

It also should not be VXLAN specific or duplicate VXLAN encapsulation
code. As this is happening before encapsulation, the generated ICMP
does not need to be encapsulated either if it is created in the right
location.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2015-01-05 17:58                         ` Jesse Gross
@ 2015-01-06  9:34                           ` Fan Du
  2015-01-06 19:11                             ` Jesse Gross
  0 siblings, 1 reply; 57+ messages in thread
From: Fan Du @ 2015-01-06  9:34 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Du, Fan, Thomas Graf, davem, Michael S. Tsirkin, Jason Wang,
	netdev, fw, dev, pshelar


On 2015/1/6 1:58, Jesse Gross wrote:
> On Mon, Jan 5, 2015 at 1:02 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
>> 于 2014年12月03日 10:31, Du, Fan 写道:
>>
>>>
>>>> -----Original Message-----
>>>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>>>> Sent: Wednesday, December 3, 2014 1:42 AM
>>>> To: Michael S. Tsirkin
>>>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>>>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>>>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than
>>>> MTU
>>>>
>>>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>>>> What about containers or any other virtualization environment that
>>>>>>> doesn't use Virtio?
>>>>>>
>>>>>> The host can dictate the MTU in that case for both veth or OVS
>>>>>> internal which would be primary container plumbing techniques.
>>>>>
>>>>> It typically can't do this easily for VMs with emulated devices:
>>>>> real ethernet uses a fixed MTU.
>>>>>
>>>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>>>> unrelated optimization.
>>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>>>
>>>> PMTU discovery only resolves the issue if an actual IP stack is running
>>>> inside the
>>>> VM. This may not be the case at all.
>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> Some thoughts here:
>>>
>>> Think otherwise, this is indeed what host stack should forge a
>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
>>> message with _inner_ skb network and transport header, do whatever type of
>>> encapsulation,
>>> and thereafter push such packet upward to Guest/Container, which make them
>>> feel, the intermediate node
>>> or the peer send such message. PMTU should be expected to work correct.
>>> And such behavior should be shared by all other encapsulation tech if they
>>> are also suffered.
>>
>> Hi David, Jesse and Thomas
>>
>> As discussed in here:
>> https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
>> quotes from Jesse:
>> My proposal would be something like this:
>>   * For L2, reduce the VM MTU to the lowest common denominator on the
>> segment.
>>   * For L3, use path MTU discovery or fragment inner packet (i.e.
>> normal routing behavior).
>>   * As a last resort (such as if using an old version of virtio in the
>> guest), fragment the tunnel packet.
>>
>>
>> For L2, it's a administrative action
>> For L3, PMTU approach looks better, because once the sender is alerted the
>> reduced MTU,
>> packet size after encapsulation will not exceed physical MTU, so no
>> additional fragments
>> efforts needed.
>> For "As a last resort... fragment the tunnel packet", the original patch:
>> https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job, but
>> seems it's
>> not welcomed.
> This needs to be properly integrated into IP processing if it is to
> work correctly.
Do you mean the original patch in this thread? yes, it works correctly
in my cloud env. If you has any other concerns, please let me know. :)
> One of the reasons for only doing path MTU discovery
> for L3 is that it operates seamlessly as part of normal operation -
> there is no need to forge addresses or potentially generate ICMP when
> on an L2 network. However, this ignores the IP handling that is going
> on (note that in OVS it is possible for L3 to be implemented as a set
> of flows coming from a controller).
>
> It also should not be VXLAN specific or duplicate VXLAN encapsulation
> code. As this is happening before encapsulation, the generated ICMP
> does not need to be encapsulated either if it is created in the right
> location.
Yes, I agree. GRE share the same issue from the code flow.
Pushing back ICMP msg back without encapsulation without circulating down
to physical device is possible. The "right location" as far as I know
could only be in ovs_vport_send. In addition this probably requires wrapper
route looking up operation for GRE/VXLAN, after get the under layer 
device MTU
from the routing information, then calculate reduced MTU becomes feasible.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2015-01-06  9:34                           ` Fan Du
@ 2015-01-06 19:11                             ` Jesse Gross
       [not found]                               ` <CAEP_g=8bCR=PeSoi09jLWLtNUrxhzx45h1Wm=9D=R57AqUac2w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2015-01-06 19:11 UTC (permalink / raw)
  To: Fan Du
  Cc: Du, Fan, Thomas Graf, davem, Michael S. Tsirkin, Jason Wang,
	netdev, fw, dev, pshelar

On Tue, Jan 6, 2015 at 4:34 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
>
> On 2015/1/6 1:58, Jesse Gross wrote:
>>
>> On Mon, Jan 5, 2015 at 1:02 AM, Fan Du <fengyuleidian0615@gmail.com>
>> wrote:
>>>
>>> 于 2014年12月03日 10:31, Du, Fan 写道:
>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>>>>> Sent: Wednesday, December 3, 2014 1:42 AM
>>>>> To: Michael S. Tsirkin
>>>>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>>>>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>>>>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger
>>>>> than
>>>>> MTU
>>>>>
>>>>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>>>>>
>>>>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>>>>>
>>>>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>>>>>
>>>>>>>> What about containers or any other virtualization environment that
>>>>>>>> doesn't use Virtio?
>>>>>>>
>>>>>>>
>>>>>>> The host can dictate the MTU in that case for both veth or OVS
>>>>>>> internal which would be primary container plumbing techniques.
>>>>>>
>>>>>>
>>>>>> It typically can't do this easily for VMs with emulated devices:
>>>>>> real ethernet uses a fixed MTU.
>>>>>>
>>>>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>>>>> unrelated optimization.
>>>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>>>>
>>>>>
>>>>> PMTU discovery only resolves the issue if an actual IP stack is running
>>>>> inside the
>>>>> VM. This may not be the case at all.
>>>>
>>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>
>>>> Some thoughts here:
>>>>
>>>> Think otherwise, this is indeed what host stack should forge a
>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
>>>> message with _inner_ skb network and transport header, do whatever type
>>>> of
>>>> encapsulation,
>>>> and thereafter push such packet upward to Guest/Container, which make
>>>> them
>>>> feel, the intermediate node
>>>> or the peer send such message. PMTU should be expected to work correct.
>>>> And such behavior should be shared by all other encapsulation tech if
>>>> they
>>>> are also suffered.
>>>
>>>
>>> Hi David, Jesse and Thomas
>>>
>>> As discussed in here:
>>> https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
>>> quotes from Jesse:
>>> My proposal would be something like this:
>>>   * For L2, reduce the VM MTU to the lowest common denominator on the
>>> segment.
>>>   * For L3, use path MTU discovery or fragment inner packet (i.e.
>>> normal routing behavior).
>>>   * As a last resort (such as if using an old version of virtio in the
>>> guest), fragment the tunnel packet.
>>>
>>>
>>> For L2, it's a administrative action
>>> For L3, PMTU approach looks better, because once the sender is alerted
>>> the
>>> reduced MTU,
>>> packet size after encapsulation will not exceed physical MTU, so no
>>> additional fragments
>>> efforts needed.
>>> For "As a last resort... fragment the tunnel packet", the original patch:
>>> https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job,
>>> but
>>> seems it's
>>> not welcomed.
>>
>> This needs to be properly integrated into IP processing if it is to
>> work correctly.
>
> Do you mean the original patch in this thread? yes, it works correctly
> in my cloud env. If you has any other concerns, please let me know. :)

Ok...but that doesn't actually address the points that I made.

>> One of the reasons for only doing path MTU discovery
>> for L3 is that it operates seamlessly as part of normal operation -
>> there is no need to forge addresses or potentially generate ICMP when
>> on an L2 network. However, this ignores the IP handling that is going
>> on (note that in OVS it is possible for L3 to be implemented as a set
>> of flows coming from a controller).
>>
>> It also should not be VXLAN specific or duplicate VXLAN encapsulation
>> code. As this is happening before encapsulation, the generated ICMP
>> does not need to be encapsulated either if it is created in the right
>> location.
>
> Yes, I agree. GRE share the same issue from the code flow.
> Pushing back ICMP msg back without encapsulation without circulating down
> to physical device is possible. The "right location" as far as I know
> could only be in ovs_vport_send. In addition this probably requires wrapper
> route looking up operation for GRE/VXLAN, after get the under layer device
> MTU
> from the routing information, then calculate reduced MTU becomes feasible.

As I said, it needs to be integrated into L3 processing. In OVS this
would mean adding some primitives to the kernel and then exposing the
functionality upwards into userspace/controller.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]                               ` <CAEP_g=8bCR=PeSoi09jLWLtNUrxhzx45h1Wm=9D=R57AqUac2w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-07  5:58                                 ` Fan Du
  2015-01-07 20:52                                   ` Jesse Gross
  0 siblings, 1 reply; 57+ messages in thread
From: Fan Du @ 2015-01-07  5:58 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Michael S. Tsirkin,
	netdev-u79uwXL29TY76Z2rM5mHXA, Jason Wang, Du, Fan,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q

于 2015年01月07日 03:11, Jesse Gross 写道:
>>> One of the reasons for only doing path MTU discovery
>>> >>for L3 is that it operates seamlessly as part of normal operation -
>>> >>there is no need to forge addresses or potentially generate ICMP when
>>> >>on an L2 network. However, this ignores the IP handling that is going
>>> >>on (note that in OVS it is possible for L3 to be implemented as a set
>>> >>of flows coming from a controller).
>>> >>
>>> >>It also should not be VXLAN specific or duplicate VXLAN encapsulation
>>> >>code. As this is happening before encapsulation, the generated ICMP
>>> >>does not need to be encapsulated either if it is created in the right
>>> >>location.
>> >
>> >Yes, I agree. GRE share the same issue from the code flow.
>> >Pushing back ICMP msg back without encapsulation without circulating down
>> >to physical device is possible. The "right location" as far as I know
>> >could only be in ovs_vport_send. In addition this probably requires wrapper
>> >route looking up operation for GRE/VXLAN, after get the under layer device
>> >MTU
>> >from the routing information, then calculate reduced MTU becomes feasible.
> As I said, it needs to be integrated into L3 processing. In OVS this
> would mean adding some primitives to the kernel and then exposing the
> functionality upwards into userspace/controller.

I'm bit of confused with "L3 processing" you mentioned here... SORRY
Apparently I'm not seeing the whole picture as you pointed out. Could you please
elaborate "L3 processing" a bit more? docs/codes/or other useful links. Appreciated.

My understanding is:
controller sets the forwarding rules into kernel datapath, any flow not matching
with the rules are threw to controller by upcall. Once the rule decision is made
by controller, then, this flow packet is pushed down to datapath to be forwarded
again according to the new rule.

So I'm not sure whether pushing the over-MTU-sized packet or pushing the forged ICMP
without encapsulation to controller is required by current ovs implementation. By doing
so, such over-MTU-sized packet is treated as a event for the controller to be take
care of.



-- 
No zuo no die but I have to try.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2015-01-07  5:58                                 ` Fan Du
@ 2015-01-07 20:52                                   ` Jesse Gross
       [not found]                                     ` <CAEP_g=8EBeQUFkRRsG3sznYryd+LE9qJKWQXfS==HG2HDO=UKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2015-01-07 20:52 UTC (permalink / raw)
  To: Fan Du
  Cc: Du, Fan, Thomas Graf, davem, Michael S. Tsirkin, Jason Wang,
	netdev, fw, dev, pshelar

On Tue, Jan 6, 2015 at 9:58 PM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年01月07日 03:11, Jesse Gross 写道:
>>>>
>>>> One of the reasons for only doing path MTU discovery
>>>> >>for L3 is that it operates seamlessly as part of normal operation -
>>>> >>there is no need to forge addresses or potentially generate ICMP when
>>>> >>on an L2 network. However, this ignores the IP handling that is going
>>>> >>on (note that in OVS it is possible for L3 to be implemented as a set
>>>> >>of flows coming from a controller).
>>>> >>
>>>> >>It also should not be VXLAN specific or duplicate VXLAN encapsulation
>>>> >>code. As this is happening before encapsulation, the generated ICMP
>>>> >>does not need to be encapsulated either if it is created in the right
>>>> >>location.
>>>
>>> >
>>> >Yes, I agree. GRE share the same issue from the code flow.
>>> >Pushing back ICMP msg back without encapsulation without circulating
>>> > down
>>> >to physical device is possible. The "right location" as far as I know
>>> >could only be in ovs_vport_send. In addition this probably requires
>>> > wrapper
>>> >route looking up operation for GRE/VXLAN, after get the under layer
>>> > device
>>> >MTU
>>> >from the routing information, then calculate reduced MTU becomes
>>> > feasible.
>>
>> As I said, it needs to be integrated into L3 processing. In OVS this
>> would mean adding some primitives to the kernel and then exposing the
>> functionality upwards into userspace/controller.
>
>
> I'm bit of confused with "L3 processing" you mentioned here... SORRY
> Apparently I'm not seeing the whole picture as you pointed out. Could you
> please
> elaborate "L3 processing" a bit more? docs/codes/or other useful links.
> Appreciated.

L3 processing is anywhere that routing takes place - i.e. where you
would decrement the TTL and change the MAC addresses. Part of routing
is dealing with differing MTUs, so that needs to be integrated into
the same logic.

> My understanding is:
> controller sets the forwarding rules into kernel datapath, any flow not
> matching
> with the rules are threw to controller by upcall. Once the rule decision is
> made
> by controller, then, this flow packet is pushed down to datapath to be
> forwarded
> again according to the new rule.
>
> So I'm not sure whether pushing the over-MTU-sized packet or pushing the
> forged ICMP
> without encapsulation to controller is required by current ovs
> implementation. By doing
> so, such over-MTU-sized packet is treated as a event for the controller to
> be take
> care of.

If flows are implementing routing (again, they are doing things like
decrementing the TTL) then it is necessary for them to also handle
this situation using some potentially new primitives (like a size
check). Otherwise you end up with issues like the ones that I
mentioned above like needing to forge addresses because you don't know
what the correct ones are. If the flows aren't doing things to
implement routing, then you really have a flat L2 network and you
shouldn't be doing this type of behavior at all as I described in the
original plan.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]                                     ` <CAEP_g=8EBeQUFkRRsG3sznYryd+LE9qJKWQXfS==HG2HDO=UKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-08  9:39                                       ` Fan Du
  2015-01-08 19:55                                         ` Jesse Gross
  0 siblings, 1 reply; 57+ messages in thread
From: Fan Du @ 2015-01-08  9:39 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Michael S. Tsirkin,
	netdev-u79uwXL29TY76Z2rM5mHXA, Jason Wang, Du, Fan,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q

于 2015年01月08日 04:52, Jesse Gross 写道:
>> My understanding is:
>> >controller sets the forwarding rules into kernel datapath, any flow not
>> >matching
>> >with the rules are threw to controller by upcall. Once the rule decision is
>> >made
>> >by controller, then, this flow packet is pushed down to datapath to be
>> >forwarded
>> >again according to the new rule.
>> >
>> >So I'm not sure whether pushing the over-MTU-sized packet or pushing the
>> >forged ICMP
>> >without encapsulation to controller is required by current ovs
>> >implementation. By doing
>> >so, such over-MTU-sized packet is treated as a event for the controller to
>> >be take
>> >care of.
> If flows are implementing routing (again, they are doing things like
> decrementing the TTL) then it is necessary for them to also handle
> this situation using some potentially new primitives (like a size
> check). Otherwise you end up with issues like the ones that I
> mentioned above like needing to forge addresses because you don't know
> what the correct ones are.

Thanks for explaining, Jesse!

btw, I don't get it about "to forge addresses", building ICMP message
with Guest packet doesn't require to forge address when not encapsulating
ICMP message with outer headers.

If the flows aren't doing things to
> implement routing, then you really have a flat L2 network and you
> shouldn't be doing this type of behavior at all as I described in the
> original plan.

For flows implementing routing scenario:
First of all, over-MTU-sized packet could only be detected once the flow
as been consulted(each port could implement a 'check' hook to do this),
and just before send to the actual port.

Then pushing the over-MTU-sized packet back to controller, it's the controller
who will will decide whether to build ICMP message, or whatever routing behaviour
it may take. And sent it back with the port information. This ICMP message will
travel back to Guest.

Why does the flow has to use primitive like a "check size"? "check size"
will only take effect after do_output. I'm not very clear with this approach.

And not all scenario involving flow with routing behaviour, just set up a
vxlan tunnel, and attach KVM guest or Docker onto it for playing or developing.
This wouldn't necessarily require user to set additional specific flows to make
over-MTU-sized packet pass through the tunnel correctly. In such scenario, I
think the original patch in this thread to fragment tunnel packet is still needed
OR workout a generic component to build ICMP for all type tunnel in L2 level.
Both of those will act as a backup plan as there is no such specific flow as
default.

If I missed something obviously, please let me know.

-- 
No zuo no die but I have to try.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2015-01-08  9:39                                       ` Fan Du
@ 2015-01-08 19:55                                         ` Jesse Gross
       [not found]                                           ` <CAEP_g=9hh+MG7AWEnct7CwRqp=ZghpbkDeQ5BhGQktDgMST1jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-09  5:48                                           ` Fan Du
  0 siblings, 2 replies; 57+ messages in thread
From: Jesse Gross @ 2015-01-08 19:55 UTC (permalink / raw)
  To: Fan Du
  Cc: Du, Fan, Thomas Graf, davem, Michael S. Tsirkin, Jason Wang,
	netdev, fw, dev, pshelar

On Thu, Jan 8, 2015 at 1:39 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年01月08日 04:52, Jesse Gross 写道:
>>>
>>> My understanding is:
>>> >controller sets the forwarding rules into kernel datapath, any flow not
>>> >matching
>>> >with the rules are threw to controller by upcall. Once the rule decision
>>> > is
>>> >made
>>> >by controller, then, this flow packet is pushed down to datapath to be
>>> >forwarded
>>> >again according to the new rule.
>>> >
>>> >So I'm not sure whether pushing the over-MTU-sized packet or pushing the
>>> >forged ICMP
>>> >without encapsulation to controller is required by current ovs
>>> >implementation. By doing
>>> >so, such over-MTU-sized packet is treated as a event for the controller
>>> > to
>>> >be take
>>> >care of.
>>
>> If flows are implementing routing (again, they are doing things like
>> decrementing the TTL) then it is necessary for them to also handle
>> this situation using some potentially new primitives (like a size
>> check). Otherwise you end up with issues like the ones that I
>> mentioned above like needing to forge addresses because you don't know
>> what the correct ones are.
>
>
> Thanks for explaining, Jesse!
>
> btw, I don't get it about "to forge addresses", building ICMP message
> with Guest packet doesn't require to forge address when not encapsulating
> ICMP message with outer headers.

Your patch has things like this (for the inner IP header):

+                               new_ip->saddr = orig_ip->daddr;
+                               new_ip->daddr = orig_ip->saddr;

These addresses are owned by the endpoints, not the host generating
generating the ICMP message, so I would consider that to be forging
addresses.

> If the flows aren't doing things to
>>
>> implement routing, then you really have a flat L2 network and you
>> shouldn't be doing this type of behavior at all as I described in the
>> original plan.
>
>
> For flows implementing routing scenario:
> First of all, over-MTU-sized packet could only be detected once the flow
> as been consulted(each port could implement a 'check' hook to do this),
> and just before send to the actual port.
>
> Then pushing the over-MTU-sized packet back to controller, it's the
> controller
> who will will decide whether to build ICMP message, or whatever routing
> behaviour
> it may take. And sent it back with the port information. This ICMP message
> will
> travel back to Guest.
>
> Why does the flow has to use primitive like a "check size"? "check size"
> will only take effect after do_output. I'm not very clear with this
> approach.

Checking the size obviously needs to be an action that would take
place before outputting in order for it to have any effect. Attaching
a check to a port does not fit in very well with the other primitives
of OVS, so I think an action is the obvious place to put it.

> And not all scenario involving flow with routing behaviour, just set up a
> vxlan tunnel, and attach KVM guest or Docker onto it for playing or
> developing.
> This wouldn't necessarily require user to set additional specific flows to
> make
> over-MTU-sized packet pass through the tunnel correctly. In such scenario, I
> think the original patch in this thread to fragment tunnel packet is still
> needed
> OR workout a generic component to build ICMP for all type tunnel in L2
> level.
> Both of those will act as a backup plan as there is no such specific flow as
> default.

In these cases, we should find a way to adjust the MTU, preferably
automatically using virtio.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
       [not found]                                           ` <CAEP_g=9hh+MG7AWEnct7CwRqp=ZghpbkDeQ5BhGQktDgMST1jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-09  5:42                                             ` Fan Du
  2015-01-12 18:48                                               ` Jesse Gross
  0 siblings, 1 reply; 57+ messages in thread
From: Fan Du @ 2015-01-09  5:42 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Michael S. Tsirkin,
	netdev-u79uwXL29TY76Z2rM5mHXA, Jason Wang, Du, Fan,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q

于 2015年01月09日 03:55, Jesse Gross 写道:
> On Thu, Jan 8, 2015 at 1:39 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
>> 于 2015年01月08日 04:52, Jesse Gross 写道:
>>>>
>>>> My understanding is:
>>>>> controller sets the forwarding rules into kernel datapath, any flow not
>>>>> matching
>>>>> with the rules are threw to controller by upcall. Once the rule decision
>>>>> is
>>>>> made
>>>>> by controller, then, this flow packet is pushed down to datapath to be
>>>>> forwarded
>>>>> again according to the new rule.
>>>>>
>>>>> So I'm not sure whether pushing the over-MTU-sized packet or pushing the
>>>>> forged ICMP
>>>>> without encapsulation to controller is required by current ovs
>>>>> implementation. By doing
>>>>> so, such over-MTU-sized packet is treated as a event for the controller
>>>>> to
>>>>> be take
>>>>> care of.
>>>
>>> If flows are implementing routing (again, they are doing things like
>>> decrementing the TTL) then it is necessary for them to also handle
>>> this situation using some potentially new primitives (like a size
>>> check). Otherwise you end up with issues like the ones that I
>>> mentioned above like needing to forge addresses because you don't know
>>> what the correct ones are.
>>
>>
>> Thanks for explaining, Jesse!
>>
>> btw, I don't get it about "to forge addresses", building ICMP message
>> with Guest packet doesn't require to forge address when not encapsulating
>> ICMP message with outer headers.
>
> Your patch has things like this (for the inner IP header):
>
> +                               new_ip->saddr = orig_ip->daddr;
> +                               new_ip->daddr = orig_ip->saddr;
>
> These addresses are owned by the endpoints, not the host generating
> generating the ICMP message, so I would consider that to be forging
> addresses.
>
>> If the flows aren't doing things to
>>>
>>> implement routing, then you really have a flat L2 network and you
>>> shouldn't be doing this type of behavior at all as I described in the
>>> original plan.
>>
>>
>> For flows implementing routing scenario:
>> First of all, over-MTU-sized packet could only be detected once the flow
>> as been consulted(each port could implement a 'check' hook to do this),
>> and just before send to the actual port.
>>
>> Then pushing the over-MTU-sized packet back to controller, it's the
>> controller
>> who will will decide whether to build ICMP message, or whatever routing
>> behaviour
>> it may take. And sent it back with the port information. This ICMP message
>> will
>> travel back to Guest.
>>
>> Why does the flow has to use primitive like a "check size"? "check size"
>> will only take effect after do_output. I'm not very clear with this
>> approach.
>
> Checking the size obviously needs to be an action that would take
> place before outputting in order for it to have any effect. Attaching
> a check to a port does not fit in very well with the other primitives
> of OVS, so I think an action is the obvious place to put it.

If flow is defined as:

	CHECK_SIZE -> OUTPUT

Then traversing actions at CHECK_SIZE needs to find the exactly OUTPUT port,
thus get its underlay encapsulation method as well as valid route for physical
NIC MTU, with those information can calculation whether GSOed packets
exceeds physical MTU. This is feasible anyway at the first look. After this,
it's the controller responsibility to handle such event.

If the CHECK_SIZE returns positive(over-MTU-sized packets show up), then call
output_userspace to push this packet upper wards.

I'm not sure this vague idea is the expected behaviour as required by "L3 processing".

>> And not all scenario involving flow with routing behaviour, just set up a
>> vxlan tunnel, and attach KVM guest or Docker onto it for playing or
>> developing.
>> This wouldn't necessarily require user to set additional specific flows to
>> make
>> over-MTU-sized packet pass through the tunnel correctly. In such scenario, I
>> think the original patch in this thread to fragment tunnel packet is still
>> needed
>> OR workout a generic component to build ICMP for all type tunnel in L2
>> level.
>> Both of those will act as a backup plan as there is no such specific flow as
>> default.
>
> In these cases, we should find a way to adjust the MTU, preferably
> automatically using virtio.
>


-- 
No zuo no die but I have to try.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2015-01-08 19:55                                         ` Jesse Gross
       [not found]                                           ` <CAEP_g=9hh+MG7AWEnct7CwRqp=ZghpbkDeQ5BhGQktDgMST1jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-09  5:48                                           ` Fan Du
  2015-01-12 18:55                                             ` Jesse Gross
  1 sibling, 1 reply; 57+ messages in thread
From: Fan Du @ 2015-01-09  5:48 UTC (permalink / raw)
  To: Jesse Gross, Michael S. Tsirkin
  Cc: Du, Fan, Thomas Graf, davem, Jason Wang, netdev, fw, dev, pshelar

于 2015年01月09日 03:55, Jesse Gross 写道:
> On Thu, Jan 8, 2015 at 1:39 AM, Fan Du<fengyuleidian0615@gmail.com>  wrote:
>> >于 2015年01月08日 04:52, Jesse Gross 写道:
>>>> >>>
>>>> >>>My understanding is:
>>>>> >>> >controller sets the forwarding rules into kernel datapath, any flow not
>>>>> >>> >matching
>>>>> >>> >with the rules are threw to controller by upcall. Once the rule decision
>>>>> >>> >is
>>>>> >>> >made
>>>>> >>> >by controller, then, this flow packet is pushed down to datapath to be
>>>>> >>> >forwarded
>>>>> >>> >again according to the new rule.
>>>>> >>> >
>>>>> >>> >So I'm not sure whether pushing the over-MTU-sized packet or pushing the
>>>>> >>> >forged ICMP
>>>>> >>> >without encapsulation to controller is required by current ovs
>>>>> >>> >implementation. By doing
>>>>> >>> >so, such over-MTU-sized packet is treated as a event for the controller
>>>>> >>> >to
>>>>> >>> >be take
>>>>> >>> >care of.
>>> >>
>>> >>If flows are implementing routing (again, they are doing things like
>>> >>decrementing the TTL) then it is necessary for them to also handle
>>> >>this situation using some potentially new primitives (like a size
>>> >>check). Otherwise you end up with issues like the ones that I
>>> >>mentioned above like needing to forge addresses because you don't know
>>> >>what the correct ones are.
>> >
>> >
>> >Thanks for explaining, Jesse!
>> >
>> >btw, I don't get it about "to forge addresses", building ICMP message
>> >with Guest packet doesn't require to forge address when not encapsulating
>> >ICMP message with outer headers.
> Your patch has things like this (for the inner IP header):
>
> +                               new_ip->saddr = orig_ip->daddr;
> +                               new_ip->daddr = orig_ip->saddr;
>
> These addresses are owned by the endpoints, not the host generating
> generating the ICMP message, so I would consider that to be forging
> addresses.
>
>> >If the flows aren't doing things to
>>> >>
>>> >>implement routing, then you really have a flat L2 network and you
>>> >>shouldn't be doing this type of behavior at all as I described in the
>>> >>original plan.
>> >
>> >
>> >For flows implementing routing scenario:
>> >First of all, over-MTU-sized packet could only be detected once the flow
>> >as been consulted(each port could implement a 'check' hook to do this),
>> >and just before send to the actual port.
>> >
>> >Then pushing the over-MTU-sized packet back to controller, it's the
>> >controller
>> >who will will decide whether to build ICMP message, or whatever routing
>> >behaviour
>> >it may take. And sent it back with the port information. This ICMP message
>> >will
>> >travel back to Guest.
>> >
>> >Why does the flow has to use primitive like a "check size"? "check size"
>> >will only take effect after do_output. I'm not very clear with this
>> >approach.
> Checking the size obviously needs to be an action that would take
> place before outputting in order for it to have any effect. Attaching
> a check to a port does not fit in very well with the other primitives
> of OVS, so I think an action is the obvious place to put it.
>
>> >And not all scenario involving flow with routing behaviour, just set up a
>> >vxlan tunnel, and attach KVM guest or Docker onto it for playing or
>> >developing.
>> >This wouldn't necessarily require user to set additional specific flows to
>> >make
>> >over-MTU-sized packet pass through the tunnel correctly. In such scenario, I
>> >think the original patch in this thread to fragment tunnel packet is still
>> >needed
>> >OR workout a generic component to build ICMP for all type tunnel in L2
>> >level.
>> >Both of those will act as a backup plan as there is no such specific flow as
>> >default.
> In these cases, we should find a way to adjust the MTU, preferably
> automatically using virtio.

I'm gonna to argue this a bit more here.

virtio_net pose no limit at its simulated net device, actually it can fall into
anywhere between 68 and 65535. Most importantly, virtio_net just simulates NIC,
it just can’t assume/presume there is an encapsulating port at its downstream.
How should virtio automatically adjust its upper guest MTU?



-- 
No zuo no die but I have to try.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2015-01-09  5:42                                             ` Fan Du
@ 2015-01-12 18:48                                               ` Jesse Gross
  0 siblings, 0 replies; 57+ messages in thread
From: Jesse Gross @ 2015-01-12 18:48 UTC (permalink / raw)
  To: Fan Du
  Cc: Du, Fan, Thomas Graf, davem, Michael S. Tsirkin, Jason Wang,
	netdev, fw, dev, pshelar

On Thu, Jan 8, 2015 at 9:42 PM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年01月09日 03:55, Jesse Gross 写道:
>
>> On Thu, Jan 8, 2015 at 1:39 AM, Fan Du <fengyuleidian0615@gmail.com>
>> wrote:
>>>
>>> 于 2015年01月08日 04:52, Jesse Gross 写道:
>>>>>
>>>>>
>>>>> My understanding is:
>>>>>>
>>>>>> controller sets the forwarding rules into kernel datapath, any flow
>>>>>> not
>>>>>> matching
>>>>>> with the rules are threw to controller by upcall. Once the rule
>>>>>> decision
>>>>>> is
>>>>>> made
>>>>>> by controller, then, this flow packet is pushed down to datapath to be
>>>>>> forwarded
>>>>>> again according to the new rule.
>>>>>>
>>>>>> So I'm not sure whether pushing the over-MTU-sized packet or pushing
>>>>>> the
>>>>>> forged ICMP
>>>>>> without encapsulation to controller is required by current ovs
>>>>>> implementation. By doing
>>>>>> so, such over-MTU-sized packet is treated as a event for the
>>>>>> controller
>>>>>> to
>>>>>> be take
>>>>>> care of.
>>>>
>>>>
>>>> If flows are implementing routing (again, they are doing things like
>>>> decrementing the TTL) then it is necessary for them to also handle
>>>> this situation using some potentially new primitives (like a size
>>>> check). Otherwise you end up with issues like the ones that I
>>>> mentioned above like needing to forge addresses because you don't know
>>>> what the correct ones are.
>>>
>>>
>>>
>>> Thanks for explaining, Jesse!
>>>
>>> btw, I don't get it about "to forge addresses", building ICMP message
>>> with Guest packet doesn't require to forge address when not encapsulating
>>> ICMP message with outer headers.
>>
>>
>> Your patch has things like this (for the inner IP header):
>>
>> +                               new_ip->saddr = orig_ip->daddr;
>> +                               new_ip->daddr = orig_ip->saddr;
>>
>> These addresses are owned by the endpoints, not the host generating
>> generating the ICMP message, so I would consider that to be forging
>> addresses.
>>
>>> If the flows aren't doing things to
>>>>
>>>>
>>>> implement routing, then you really have a flat L2 network and you
>>>> shouldn't be doing this type of behavior at all as I described in the
>>>> original plan.
>>>
>>>
>>>
>>> For flows implementing routing scenario:
>>> First of all, over-MTU-sized packet could only be detected once the flow
>>> as been consulted(each port could implement a 'check' hook to do this),
>>> and just before send to the actual port.
>>>
>>> Then pushing the over-MTU-sized packet back to controller, it's the
>>> controller
>>> who will will decide whether to build ICMP message, or whatever routing
>>> behaviour
>>> it may take. And sent it back with the port information. This ICMP
>>> message
>>> will
>>> travel back to Guest.
>>>
>>> Why does the flow has to use primitive like a "check size"? "check size"
>>> will only take effect after do_output. I'm not very clear with this
>>> approach.
>>
>>
>> Checking the size obviously needs to be an action that would take
>> place before outputting in order for it to have any effect. Attaching
>> a check to a port does not fit in very well with the other primitives
>> of OVS, so I think an action is the obvious place to put it.
>
>
> If flow is defined as:
>
>         CHECK_SIZE -> OUTPUT
>
> Then traversing actions at CHECK_SIZE needs to find the exactly OUTPUT port,
> thus get its underlay encapsulation method as well as valid route for
> physical
> NIC MTU, with those information can calculation whether GSOed packets
> exceeds physical MTU. This is feasible anyway at the first look. After this,
> it's the controller responsibility to handle such event.
>
> If the CHECK_SIZE returns positive(over-MTU-sized packets show up), then
> call
> output_userspace to push this packet upper wards.
>
> I'm not sure this vague idea is the expected behaviour as required by "L3
> processing".

Yes, I think that's the right path.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2015-01-09  5:48                                           ` Fan Du
@ 2015-01-12 18:55                                             ` Jesse Gross
  2015-01-13 16:58                                               ` Thomas Graf
  0 siblings, 1 reply; 57+ messages in thread
From: Jesse Gross @ 2015-01-12 18:55 UTC (permalink / raw)
  To: Fan Du
  Cc: Michael S. Tsirkin, Du, Fan, Thomas Graf, davem, Jason Wang,
	netdev, fw, dev, pshelar

On Thu, Jan 8, 2015 at 9:48 PM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年01月09日 03:55, Jesse Gross 写道:
>>
>> On Thu, Jan 8, 2015 at 1:39 AM, Fan Du<fengyuleidian0615@gmail.com>
>> wrote:
>>
>>> >于 2015年01月08日 04:52, Jesse Gross 写道:
>>>>>
>>>>> >>>
>>>>> >>>My understanding is:
>>>>>>
>>>>>> >>> >controller sets the forwarding rules into kernel datapath, any
>>>>>> >>> > flow not
>>>>>> >>> >matching
>>>>>> >>> >with the rules are threw to controller by upcall. Once the rule
>>>>>> >>> > decision
>>>>>> >>> >is
>>>>>> >>> >made
>>>>>> >>> >by controller, then, this flow packet is pushed down to datapath
>>>>>> >>> > to be
>>>>>> >>> >forwarded
>>>>>> >>> >again according to the new rule.
>>>>>> >>> >
>>>>>> >>> >So I'm not sure whether pushing the over-MTU-sized packet or
>>>>>> >>> > pushing the
>>>>>> >>> >forged ICMP
>>>>>> >>> >without encapsulation to controller is required by current ovs
>>>>>> >>> >implementation. By doing
>>>>>> >>> >so, such over-MTU-sized packet is treated as a event for the
>>>>>> >>> > controller
>>>>>> >>> >to
>>>>>> >>> >be take
>>>>>> >>> >care of.
>>>>
>>>> >>
>>>> >>If flows are implementing routing (again, they are doing things like
>>>> >>decrementing the TTL) then it is necessary for them to also handle
>>>> >>this situation using some potentially new primitives (like a size
>>>> >>check). Otherwise you end up with issues like the ones that I
>>>> >>mentioned above like needing to forge addresses because you don't know
>>>> >>what the correct ones are.
>>>
>>> >
>>> >
>>> >Thanks for explaining, Jesse!
>>> >
>>> >btw, I don't get it about "to forge addresses", building ICMP message
>>> >with Guest packet doesn't require to forge address when not
>>> > encapsulating
>>> >ICMP message with outer headers.
>>
>> Your patch has things like this (for the inner IP header):
>>
>> +                               new_ip->saddr = orig_ip->daddr;
>> +                               new_ip->daddr = orig_ip->saddr;
>>
>> These addresses are owned by the endpoints, not the host generating
>> generating the ICMP message, so I would consider that to be forging
>> addresses.
>>
>>> >If the flows aren't doing things to
>>>>
>>>> >>
>>>> >>implement routing, then you really have a flat L2 network and you
>>>> >>shouldn't be doing this type of behavior at all as I described in the
>>>> >>original plan.
>>>
>>> >
>>> >
>>> >For flows implementing routing scenario:
>>> >First of all, over-MTU-sized packet could only be detected once the flow
>>> >as been consulted(each port could implement a 'check' hook to do this),
>>> >and just before send to the actual port.
>>> >
>>> >Then pushing the over-MTU-sized packet back to controller, it's the
>>> >controller
>>> >who will will decide whether to build ICMP message, or whatever routing
>>> >behaviour
>>> >it may take. And sent it back with the port information. This ICMP
>>> > message
>>> >will
>>> >travel back to Guest.
>>> >
>>> >Why does the flow has to use primitive like a "check size"? "check size"
>>> >will only take effect after do_output. I'm not very clear with this
>>> >approach.
>>
>> Checking the size obviously needs to be an action that would take
>> place before outputting in order for it to have any effect. Attaching
>> a check to a port does not fit in very well with the other primitives
>> of OVS, so I think an action is the obvious place to put it.
>>
>>> >And not all scenario involving flow with routing behaviour, just set up
>>> > a
>>> >vxlan tunnel, and attach KVM guest or Docker onto it for playing or
>>> >developing.
>>> >This wouldn't necessarily require user to set additional specific flows
>>> > to
>>> >make
>>> >over-MTU-sized packet pass through the tunnel correctly. In such
>>> > scenario, I
>>> >think the original patch in this thread to fragment tunnel packet is
>>> > still
>>> >needed
>>> >OR workout a generic component to build ICMP for all type tunnel in L2
>>> >level.
>>> >Both of those will act as a backup plan as there is no such specific
>>> > flow as
>>> >default.
>>
>> In these cases, we should find a way to adjust the MTU, preferably
>> automatically using virtio.
>
>
> I'm gonna to argue this a bit more here.
>
> virtio_net pose no limit at its simulated net device, actually it can fall
> into
> anywhere between 68 and 65535. Most importantly, virtio_net just simulates
> NIC,
> it just can’t assume/presume there is an encapsulating port at its
> downstream.
> How should virtio automatically adjust its upper guest MTU?

There are at least two parts to this:
 * Calculating the right MTU for the guest device.
 * Transferring the MTU from the host to the guest.

The first would presumably involve exposing some kind of API that the
component that does know the right value could program. In this case,
that component could be OVS using the same type of information that
you just described in the earlier post about L3. The API could simply
to just set the MTU of the device in the host and this gets mirrored
to the guest.

The second part I guess is probably a fairly straightforward extension
to virtio but I don't know the details.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
  2015-01-12 18:55                                             ` Jesse Gross
@ 2015-01-13 16:58                                               ` Thomas Graf
  0 siblings, 0 replies; 57+ messages in thread
From: Thomas Graf @ 2015-01-13 16:58 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Fan Du, Michael S. Tsirkin, Du, Fan, davem, Jason Wang, netdev,
	fw, dev, pshelar, fuscof

On 01/12/15 at 10:55am, Jesse Gross wrote:
> There are at least two parts to this:
>  * Calculating the right MTU for the guest device.
>  * Transferring the MTU from the host to the guest.
> 
> The first would presumably involve exposing some kind of API that the
> component that does know the right value could program. In this case,
> that component could be OVS using the same type of information that
> you just described in the earlier post about L3. The API could simply
> to just set the MTU of the device in the host and this gets mirrored
> to the guest.
> 
> The second part I guess is probably a fairly straightforward extension
> to virtio but I don't know the details.

Francesco Fusco wrote code to do exactly this. Maybe he still has
it somewhere.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2015-01-13 16:58 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-28  6:33 [PATCH net] gso: do GSO for local skb with size bigger than MTU Fan Du
2014-11-28  7:02 ` Jason Wang
2014-11-30 10:08   ` Du, Fan
2014-12-01 13:52     ` Thomas Graf
     [not found]       ` <20141201135225.GA16814-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-12-01 15:06         ` Michael S. Tsirkin
2014-12-02 15:48         ` Flavio Leitner
2014-12-02 17:09           ` Thomas Graf
     [not found]             ` <20141202170927.GA9457-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-12-02 17:34               ` Michael S. Tsirkin
2014-12-02 17:41                 ` Thomas Graf
2014-12-02 18:12                   ` Jesse Gross
     [not found]                     ` <CAEP_g=-86Z6pxNow-wjnbx_v9er_TSn6x5waigqVqYHa7tEQJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-12-03  9:03                       ` Michael S. Tsirkin
2014-12-03 18:07                         ` Jesse Gross
     [not found]                           ` <CAEP_g=9C+D3gbjJ4n1t6xuyjqEAMYi4ZfqPoe92UAoQJH-UsKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-12-03 18:38                             ` Michael S. Tsirkin
2014-12-03 18:56                               ` Rick Jones
     [not found]                                 ` <547F5CC2.8000908-VXdhtT5mjnY@public.gmane.org>
2014-12-04 10:17                                   ` Michael S. Tsirkin
2014-12-03 19:38                               ` Jesse Gross
2014-12-03 22:02                                 ` Thomas Graf
     [not found]                                   ` <20141203220244.GA8822-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-12-03 22:50                                     ` Michael S. Tsirkin
2014-12-03 22:51                                   ` Jesse Gross
2014-12-03 23:05                                     ` Thomas Graf
     [not found]                                       ` <20141203230551.GC8822-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-12-04  0:54                                         ` Jesse Gross
2014-12-04  1:15                                           ` Thomas Graf
2014-12-04  1:51                                             ` Jesse Gross
2014-12-04  9:26                                               ` Thomas Graf
2014-12-04 23:19                                                 ` Jesse Gross
2014-12-04  7:48                                     ` Du Fan
2014-12-04 23:23                                       ` Jesse Gross
2014-12-05  0:25                                         ` Du Fan
2014-12-03  2:31                   ` Du, Fan
2015-01-05  6:02                     ` Fan Du
     [not found]                       ` <54AA2912.6090903-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-01-05 17:58                         ` Jesse Gross
2015-01-06  9:34                           ` Fan Du
2015-01-06 19:11                             ` Jesse Gross
     [not found]                               ` <CAEP_g=8bCR=PeSoi09jLWLtNUrxhzx45h1Wm=9D=R57AqUac2w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-07  5:58                                 ` Fan Du
2015-01-07 20:52                                   ` Jesse Gross
     [not found]                                     ` <CAEP_g=8EBeQUFkRRsG3sznYryd+LE9qJKWQXfS==HG2HDO=UKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-08  9:39                                       ` Fan Du
2015-01-08 19:55                                         ` Jesse Gross
     [not found]                                           ` <CAEP_g=9hh+MG7AWEnct7CwRqp=ZghpbkDeQ5BhGQktDgMST1jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-09  5:42                                             ` Fan Du
2015-01-12 18:48                                               ` Jesse Gross
2015-01-09  5:48                                           ` Fan Du
2015-01-12 18:55                                             ` Jesse Gross
2015-01-13 16:58                                               ` Thomas Graf
2014-12-02 15:44     ` Flavio Leitner
2014-12-02 18:06       ` Jesse Gross
2014-12-02 21:32         ` Flavio Leitner
2014-12-02 21:47           ` Jesse Gross
2014-12-03  1:58           ` Du, Fan
2014-11-30 10:26 ` Florian Westphal
2014-11-30 10:55   ` Du, Fan
2014-11-30 15:11     ` Florian Westphal
2014-12-01  6:47       ` Du, Fan
2014-12-03  3:23 ` David Miller
2014-12-03  3:32   ` Du, Fan
2014-12-03  4:35     ` David Miller
2014-12-03  4:50       ` Du, Fan
2014-12-03  5:14         ` David Miller
2014-12-03  6:53           ` Du, Fan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.