All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesse Gross <jesse@nicira.com>
To: Jesse Gross <jesse@nicira.com>, "Du, Fan" <fan.du@intel.com>,
	Jason Wang <jasowang@redhat.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"fw@strlen.de" <fw@strlen.de>
Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
Date: Tue, 2 Dec 2014 13:47:07 -0800	[thread overview]
Message-ID: <CAEP_g=_oyUkM=XK0cqLaKa2mS759hrh_ksd8GxbCJJLce-c_DQ@mail.gmail.com> (raw)
In-Reply-To: <20141202213232.GC5344@t520.home>

On Tue, Dec 2, 2014 at 1:32 PM, Flavio Leitner <fbl@redhat.com> wrote:
> On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote:
>> On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
>> > On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
>> >>
>> >>
>> >> >-----Original Message-----
>> >> >From: Jason Wang [mailto:jasowang@redhat.com]
>> >> >Sent: Friday, November 28, 2014 3:02 PM
>> >> >To: Du, Fan
>> >> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
>> >> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>> >> >
>> >> >
>> >> >
>> >> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> >> >> Test scenario: two KVM guests sitting in different hosts communicate
>> >> >> to each other with a vxlan tunnel.
>> >> >>
>> >> >> All interface MTU is default 1500 Bytes, from guest point of view, its
>> >> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> >> >> goes through vxlan encapuslation, individual segments length of a gso
>> >> >> packet could exceed physical NIC MTU 1500, which will be lost at
>> >> >> recevier side.
>> >> >>
>> >> >> So it's possible in virtualized environment, locally created skb len
>> >> >> after encapslation could be bigger than underlayer MTU. In such case,
>> >> >> it's reasonable to do GSO first, then fragment any packet bigger than
>> >> >> MTU as possible.
>> >> >>
>> >> >> +---------------+ TX     RX +---------------+
>> >> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> >> >> +-+-----------+-+           +-+-----------+-+
>> >> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
>> >> >>   +-----------+               +-----------+
>> >> >>        |                            |
>> >> >>        v tap0                  tap0 v
>> >> >>   +-----------+               +-----------+
>> >> >>   | ovs bridge|               | ovs bridge|
>> >> >>   +-----------+               +-----------+
>> >> >>        | vxlan                vxlan |
>> >> >>        v                            v
>> >> >>   +-----------+               +-----------+
>> >> >>   |    NIC    |    <------>   |    NIC    |
>> >> >>   +-----------+               +-----------+
>> >> >>
>> >> >> Steps to reproduce:
>> >> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>> >> >>  2. Runing iperf without -M, communication will stuck.
>> >> >
>> >> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
>> >> >believe.
>> >>
>> >> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
>> >> without any further ip segmentation.
>> >>
>> >> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set,
>> >> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
>> >>
>> >> For PMTU to work, that's another issue I will try to address later on.
>> >>
>> >> >>
>> >> >>
>> >> >> Signed-off-by: Fan Du <fan.du@intel.com>
>> >> >> ---
>> >> >>  net/ipv4/ip_output.c |    7 ++++---
>> >> >>  1 files changed, 4 insertions(+), 3 deletions(-)
>> >> >>
>> >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> >> >> bc6471d..558b5f8 100644
>> >> >> --- a/net/ipv4/ip_output.c
>> >> >> +++ b/net/ipv4/ip_output.c
>> >> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
>> >> >> *skb)
>> >> >>    struct sk_buff *segs;
>> >> >>    int ret = 0;
>> >> >>
>> >> >> -  /* common case: locally created skb or seglen is <= mtu */
>> >> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> >> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >> +  /* Both locally created skb and forwarded skb could exceed
>> >> >> +   * MTU size, so make a unified rule for them all.
>> >> >> +   */
>> >> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >>            return ip_finish_output2(skb);
>> >
>> >
>> > Are you using kernel's vxlan device or openvswitch's vxlan device?
>> >
>> > Because for kernel's vxlan devices the MTU accounts for the header
>> > overhead so I believe your patch would work.  However, the MTU is
>> > not visible for the ovs's vxlan devices, so that wouldn't work.
>>
>> This is being called after the tunnel code, so the MTU that is being
>> looked at in all cases is the physical device's. Since the packet has
>> already been encapsulated, tunnel header overhead is already accounted
>> for in skb_gso_network_seglen() and this should be fine for both OVS
>> and non-OVS cases.
>
> Right, it didn't work on my first try and that explanation came to mind.
>
> Anyway, I am testing this with containers instead of VMs, so I am using
> veth and not Virtio-net.
>
> This is the actual stack trace:
>
> [...]
>   do_output
>   ovs_vport_send
>   vxlan_tnl_send
>   vxlan_xmit_skb
>   udp_tunnel_xmit_skb
>   iptunnel_xmit
>    \ skb_scrub_packet => skb->ignore_df = 0;
>   ip_local_out_sk
>   ip_output
>   ip_finish_output (_gso is inlined)
>   ip_fragment
>
> and on ip_fragment() it does:
>
>  503         if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
>  504                      (IPCB(skb)->frag_max_size &&
>  505                       IPCB(skb)->frag_max_size > mtu))) {
>  506                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
>  507                 icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  508                           htonl(mtu));
>  509                 kfree_skb(skb);
>  510                 return -EMSGSIZE;
>  511         }
>
> Since IP_DF is set and skb->ignore_df is reset to 0, in my case
> the packet is dropped and an ICMP is sent back. The connection
> remains stuck as before. Doesn't virtio-net set DF bit?

I see now - I agree it's not clear how the original patch worked. The
DF bit is actually set by the encapsulator so it should be the same
regardless of how the packet got there (in OVS it is on by default).
It seems like skb_scrub_packet() shouldn't clear skb->ignore_df (or if
it should in other cases then it seems we need an option to do not it
for tunnels).

  reply	other threads:[~2014-12-02 21:47 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-28  6:33 [PATCH net] gso: do GSO for local skb with size bigger than MTU Fan Du
2014-11-28  7:02 ` Jason Wang
2014-11-30 10:08   ` Du, Fan
2014-12-01 13:52     ` Thomas Graf
     [not found]       ` <20141201135225.GA16814-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-12-01 15:06         ` Michael S. Tsirkin
2014-12-02 15:48         ` Flavio Leitner
2014-12-02 17:09           ` Thomas Graf
     [not found]             ` <20141202170927.GA9457-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-12-02 17:34               ` Michael S. Tsirkin
2014-12-02 17:41                 ` Thomas Graf
2014-12-02 18:12                   ` Jesse Gross
     [not found]                     ` <CAEP_g=-86Z6pxNow-wjnbx_v9er_TSn6x5waigqVqYHa7tEQJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-12-03  9:03                       ` Michael S. Tsirkin
2014-12-03 18:07                         ` Jesse Gross
     [not found]                           ` <CAEP_g=9C+D3gbjJ4n1t6xuyjqEAMYi4ZfqPoe92UAoQJH-UsKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-12-03 18:38                             ` Michael S. Tsirkin
2014-12-03 18:56                               ` Rick Jones
     [not found]                                 ` <547F5CC2.8000908-VXdhtT5mjnY@public.gmane.org>
2014-12-04 10:17                                   ` Michael S. Tsirkin
2014-12-03 19:38                               ` Jesse Gross
2014-12-03 22:02                                 ` Thomas Graf
     [not found]                                   ` <20141203220244.GA8822-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-12-03 22:50                                     ` Michael S. Tsirkin
2014-12-03 22:51                                   ` Jesse Gross
2014-12-03 23:05                                     ` Thomas Graf
     [not found]                                       ` <20141203230551.GC8822-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-12-04  0:54                                         ` Jesse Gross
2014-12-04  1:15                                           ` Thomas Graf
2014-12-04  1:51                                             ` Jesse Gross
2014-12-04  9:26                                               ` Thomas Graf
2014-12-04 23:19                                                 ` Jesse Gross
2014-12-04  7:48                                     ` Du Fan
2014-12-04 23:23                                       ` Jesse Gross
2014-12-05  0:25                                         ` Du Fan
2014-12-03  2:31                   ` Du, Fan
2015-01-05  6:02                     ` Fan Du
     [not found]                       ` <54AA2912.6090903-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-01-05 17:58                         ` Jesse Gross
2015-01-06  9:34                           ` Fan Du
2015-01-06 19:11                             ` Jesse Gross
     [not found]                               ` <CAEP_g=8bCR=PeSoi09jLWLtNUrxhzx45h1Wm=9D=R57AqUac2w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-07  5:58                                 ` Fan Du
2015-01-07 20:52                                   ` Jesse Gross
     [not found]                                     ` <CAEP_g=8EBeQUFkRRsG3sznYryd+LE9qJKWQXfS==HG2HDO=UKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-08  9:39                                       ` Fan Du
2015-01-08 19:55                                         ` Jesse Gross
     [not found]                                           ` <CAEP_g=9hh+MG7AWEnct7CwRqp=ZghpbkDeQ5BhGQktDgMST1jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-09  5:42                                             ` Fan Du
2015-01-12 18:48                                               ` Jesse Gross
2015-01-09  5:48                                           ` Fan Du
2015-01-12 18:55                                             ` Jesse Gross
2015-01-13 16:58                                               ` Thomas Graf
2014-12-02 15:44     ` Flavio Leitner
2014-12-02 18:06       ` Jesse Gross
2014-12-02 21:32         ` Flavio Leitner
2014-12-02 21:47           ` Jesse Gross [this message]
2014-12-03  1:58           ` Du, Fan
2014-11-30 10:26 ` Florian Westphal
2014-11-30 10:55   ` Du, Fan
2014-11-30 15:11     ` Florian Westphal
2014-12-01  6:47       ` Du, Fan
2014-12-03  3:23 ` David Miller
2014-12-03  3:32   ` Du, Fan
2014-12-03  4:35     ` David Miller
2014-12-03  4:50       ` Du, Fan
2014-12-03  5:14         ` David Miller
2014-12-03  6:53           ` Du, Fan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEP_g=_oyUkM=XK0cqLaKa2mS759hrh_ksd8GxbCJJLce-c_DQ@mail.gmail.com' \
    --to=jesse@nicira.com \
    --cc=davem@davemloft.net \
    --cc=fan.du@intel.com \
    --cc=fw@strlen.de \
    --cc=jasowang@redhat.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.