netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* A question on the design of OVS GRE tunnel
@ 2013-07-08  9:51 Cong Wang
  2013-07-08 16:28 ` Pravin Shelar
  0 siblings, 1 reply; 5+ messages in thread
From: Cong Wang @ 2013-07-08  9:51 UTC (permalink / raw)
  To: Jesse Gross, Pravin Shelar; +Cc: netdev, Thomas Graf

Hi, Jesse, Pravin

I have a question on the design of OVS GRE tunnel. Why OVS GRE tunnel
doesn't register a netdev? I understand it is enough for GRE to function
without registering a netdev, just a GRE vport is sufficient and
probably even simpler.

However, I noticed there is some problem with such design:

I saw very bad performance with the _default_ setup with OVS GRE. After
digging it a little bit, clearly the cause is that OVS GRE tunnel adds
an outer IP header and a GRE header for every packet that passed to it,
which could result in a packet whose length is larger than the MTU of
the uplink, therefore after the packet goes through OVS, it has to be
fragmented by IP before going to the wire.

Of course we can workaround this problem by:

1) lower down the MTU of the first net device to reserve some room for
GRE header

2) pass vnet_hdr=on to KVM guests so that packets going out there are
still GSO packets even on hosts (I never try this, just analysis it by
reading code)

Do we have to live with this? Can we solve this problem at OVS layer? So
that no matter how large the packets are we can avoid IP fragmentation?
One solution in my mind is registering a netdev for OVS GRE tunnel too,
so that we could probably reuse GRO cells, thus packets can be merged
before OVS processing them, and it will be segmented again before going
to the wire. But I could easily miss something here. ;)

What do you think? Any better idea to fix this?

Thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question on the design of OVS GRE tunnel
  2013-07-08  9:51 A question on the design of OVS GRE tunnel Cong Wang
@ 2013-07-08 16:28 ` Pravin Shelar
  2013-07-09  2:41   ` Cong Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Pravin Shelar @ 2013-07-08 16:28 UTC (permalink / raw)
  To: Cong Wang; +Cc: Jesse Gross, netdev, Thomas Graf

On Mon, Jul 8, 2013 at 2:51 AM, Cong Wang <amwang@redhat.com> wrote:
> Hi, Jesse, Pravin
>
> I have a question on the design of OVS GRE tunnel. Why OVS GRE tunnel
> doesn't register a netdev? I understand it is enough for GRE to function
> without registering a netdev, just a GRE vport is sufficient and
> probably even simpler.
>
kernel-gre device has gre-parameters/state associated with it and
ovs-gre-vport is completely stateless. ovs-gre state is in user-space
which make kernel module alot simpler. Therefore I doubt it will be
easy or simpler to use netdev at this point.

> However, I noticed there is some problem with such design:
>
> I saw very bad performance with the _default_ setup with OVS GRE. After
> digging it a little bit, clearly the cause is that OVS GRE tunnel adds
> an outer IP header and a GRE header for every packet that passed to it,
> which could result in a packet whose length is larger than the MTU of
> the uplink, therefore after the packet goes through OVS, it has to be
> fragmented by IP before going to the wire.
>
I do not understand what do you mean, gre packets greater than MTU
must be fragmented before sent on wire and it is done by GRE-GSO code.

> Of course we can workaround this problem by:
>
> 1) lower down the MTU of the first net device to reserve some room for
> GRE header
>
> 2) pass vnet_hdr=on to KVM guests so that packets going out there are
> still GSO packets even on hosts (I never try this, just analysis it by
> reading code)
>
> Do we have to live with this? Can we solve this problem at OVS layer? So
> that no matter how large the packets are we can avoid IP fragmentation?
> One solution in my mind is registering a netdev for OVS GRE tunnel too,
> so that we could probably reuse GRO cells, thus packets can be merged
> before OVS processing them, and it will be segmented again before going
> to the wire. But I could easily miss something here. ;)
>
> What do you think? Any better idea to fix this?
>
> Thanks!
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question on the design of OVS GRE tunnel
  2013-07-08 16:28 ` Pravin Shelar
@ 2013-07-09  2:41   ` Cong Wang
  2013-07-09  6:26     ` Jesse Gross
  0 siblings, 1 reply; 5+ messages in thread
From: Cong Wang @ 2013-07-09  2:41 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: Jesse Gross, netdev, Thomas Graf

On Mon, 2013-07-08 at 09:28 -0700, Pravin Shelar wrote:
> On Mon, Jul 8, 2013 at 2:51 AM, Cong Wang <amwang@redhat.com> wrote:
> > Hi, Jesse, Pravin
> >
> > I have a question on the design of OVS GRE tunnel. Why OVS GRE tunnel
> > doesn't register a netdev? I understand it is enough for GRE to function
> > without registering a netdev, just a GRE vport is sufficient and
> > probably even simpler.
> >
> kernel-gre device has gre-parameters/state associated with it and
> ovs-gre-vport is completely stateless. ovs-gre state is in user-space
> which make kernel module alot simpler. Therefore I doubt it will be
> easy or simpler to use netdev at this point.

Understood, from users' point of view, it is simpler. At least no one is
able to assign any IP address to it.

> 
> > However, I noticed there is some problem with such design:
> >
> > I saw very bad performance with the _default_ setup with OVS GRE. After
> > digging it a little bit, clearly the cause is that OVS GRE tunnel adds
> > an outer IP header and a GRE header for every packet that passed to it,
> > which could result in a packet whose length is larger than the MTU of
> > the uplink, therefore after the packet goes through OVS, it has to be
> > fragmented by IP before going to the wire.
> >
> I do not understand what do you mean, gre packets greater than MTU
> must be fragmented before sent on wire and it is done by GRE-GSO code.
> 

Well, I said fragment, not segment. This is exactly why performance is
so bad.

In my _default_ setup, every net device on the path has MTU=1500,
therefore, the packets coming out of a KVM guest can have length=1500,
after they go through OVS GRE tunnel, their length becomes 1538 because
of the added GRE header and IP header.

After that, since the packets are not GSO (unless you pass vnet_hdr=on
to KVM guest), the packets with length=1538 will be _fragmented_ by IP
layer, since the dest uplink has MTU=1500 too. This is why I proposed to
reuse GRO cell to merge the packets, which requires a netdev...

This is the problem.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question on the design of OVS GRE tunnel
  2013-07-09  2:41   ` Cong Wang
@ 2013-07-09  6:26     ` Jesse Gross
  2013-07-10  3:34       ` Cong Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Jesse Gross @ 2013-07-09  6:26 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, dev-yBygre7rU0TnMu66kgdUjQ

On Mon, Jul 8, 2013 at 7:41 PM, Cong Wang <amwang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, 2013-07-08 at 09:28 -0700, Pravin Shelar wrote:
>> On Mon, Jul 8, 2013 at 2:51 AM, Cong Wang <amwang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > However, I noticed there is some problem with such design:
>> >
>> > I saw very bad performance with the _default_ setup with OVS GRE. After
>> > digging it a little bit, clearly the cause is that OVS GRE tunnel adds
>> > an outer IP header and a GRE header for every packet that passed to it,
>> > which could result in a packet whose length is larger than the MTU of
>> > the uplink, therefore after the packet goes through OVS, it has to be
>> > fragmented by IP before going to the wire.
>> >
>> I do not understand what do you mean, gre packets greater than MTU
>> must be fragmented before sent on wire and it is done by GRE-GSO code.
>>
>
> Well, I said fragment, not segment. This is exactly why performance is
> so bad.
>
> In my _default_ setup, every net device on the path has MTU=1500,
> therefore, the packets coming out of a KVM guest can have length=1500,
> after they go through OVS GRE tunnel, their length becomes 1538 because
> of the added GRE header and IP header.
>
> After that, since the packets are not GSO (unless you pass vnet_hdr=on
> to KVM guest), the packets with length=1538 will be _fragmented_ by IP
> layer, since the dest uplink has MTU=1500 too. This is why I proposed to
> reuse GRO cell to merge the packets, which requires a netdev...

Large packets coming from a modern KVM guest will use TSO because this
is a huge performance win regardless of whether any tunneling is used.
It doesn't make any sense for the guest IP stack to take a stream of
packets, split them apart, merge them in the hypervisor stack, and
split them again before transmission. Any packets potentially worth
merging will almost certainly have originated as a single buffer in
the guest, so we should keep them together all the way from the guest
to the GSO/TSO layer.

The real problem is that the requested MSS size is not correct. In the
"best" situation we would first segment the packet to the requested
size, add the tunnel headers, and then fragment. However, it looks to
me like the original size is being carried all the way to the GSO
code, which will then generate packets that are greater than the MTU.
Both of these can likely be improved upon by either convincing the
guest to automatically use a lower MSS or adjusting it ourselves.
X-CudaMail-Whitelist-To: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question on the design of OVS GRE tunnel
  2013-07-09  6:26     ` Jesse Gross
@ 2013-07-10  3:34       ` Cong Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Cong Wang @ 2013-07-10  3:34 UTC (permalink / raw)
  To: Jesse Gross; +Cc: netdev, dev-yBygre7rU0TnMu66kgdUjQ

On Mon, 2013-07-08 at 23:26 -0700, Jesse Gross wrote:
> On Mon, Jul 8, 2013 at 7:41 PM, Cong Wang <amwang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, 2013-07-08 at 09:28 -0700, Pravin Shelar wrote:
> >> On Mon, Jul 8, 2013 at 2:51 AM, Cong Wang <amwang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> > However, I noticed there is some problem with such design:
> >> >
> >> > I saw very bad performance with the _default_ setup with OVS GRE. After
> >> > digging it a little bit, clearly the cause is that OVS GRE tunnel adds
> >> > an outer IP header and a GRE header for every packet that passed to it,
> >> > which could result in a packet whose length is larger than the MTU of
> >> > the uplink, therefore after the packet goes through OVS, it has to be
> >> > fragmented by IP before going to the wire.
> >> >
> >> I do not understand what do you mean, gre packets greater than MTU
> >> must be fragmented before sent on wire and it is done by GRE-GSO code.
> >>
> >
> > Well, I said fragment, not segment. This is exactly why performance is
> > so bad.
> >
> > In my _default_ setup, every net device on the path has MTU=1500,
> > therefore, the packets coming out of a KVM guest can have length=1500,
> > after they go through OVS GRE tunnel, their length becomes 1538 because
> > of the added GRE header and IP header.
> >
> > After that, since the packets are not GSO (unless you pass vnet_hdr=on
> > to KVM guest), the packets with length=1538 will be _fragmented_ by IP
> > layer, since the dest uplink has MTU=1500 too. This is why I proposed to
> > reuse GRO cell to merge the packets, which requires a netdev...
> 
> Large packets coming from a modern KVM guest will use TSO because this
> is a huge performance win regardless of whether any tunneling is used.
> It doesn't make any sense for the guest IP stack to take a stream of
> packets, split them apart, merge them in the hypervisor stack, and
> split them again before transmission. Any packets potentially worth
> merging will almost certainly have originated as a single buffer in
> the guest, so we should keep them together all the way from the guest
> to the GSO/TSO layer.
> 
> The real problem is that the requested MSS size is not correct. In the
> "best" situation we would first segment the packet to the requested
> size, add the tunnel headers, and then fragment. However, it looks to
> me like the original size is being carried all the way to the GSO
> code, which will then generate packets that are greater than the MTU.
> Both of these can likely be improved upon by either convincing the
> guest to automatically use a lower MSS or adjusting it ourselves.

Yeah, unfortunately this is not easy to discover, people need some
knowledge and some time to find the problem and "fix" it. This is why I
think we should find a way to fix it if possible, or at least document
it.

Thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-07-10  3:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-08  9:51 A question on the design of OVS GRE tunnel Cong Wang
2013-07-08 16:28 ` Pravin Shelar
2013-07-09  2:41   ` Cong Wang
2013-07-09  6:26     ` Jesse Gross
2013-07-10  3:34       ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).