[PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
@ 2016-01-06 13:33 David Wragg
       [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
  2016-01-06 13:33 ` [PATCH net 2/2] " David Wragg
  0 siblings, 2 replies; 28+ messages in thread
From: David Wragg @ 2016-01-06 13:33 UTC (permalink / raw)
  To: netdev, dev; +Cc: David Wragg

Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of
any size, constrained only by the ability to transmit the resulting
UDP packets.  4.3 introduced vxlan netdevs corresponding to vxlan
vports.  These netdevs have an MTU, which limits the size of a packet
that can be successfully vxlan-encapsulated.  The default value for
this MTU is 1500, which is awkwardly small, and leads to a conspicuous
change in behaviour for userspace.

These two patches set the MTU on openvswitch-crated vxlan devices to
be 65465 (the maximum IP packet size minus the vxlan-on-IPv6
overhead), effectively restoring the behaviour prior to 4.3.  In order
to accomplish this, the first patch removes the MTU constraint of 1500
for vxlan netdevs without an underlying device.

David Wragg (2):
  vxlan: Relax the MTU constraint on vxlan devices
  vxlan: Set a large MTU on ovs-created vxlan devices

 drivers/net/vxlan.c           | 38 +++++++++++++++++++++++++-------------
 net/openvswitch/vport-vxlan.c |  2 ++
 2 files changed, 27 insertions(+), 13 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices
       [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
@ 2016-01-06 13:33   ` David Wragg
       [not found]     ` <1452087186-12926-2-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
  2016-01-06 20:59   ` [PATCH net 0/2] vxlan: Set a large MTU on ovs-created " David Miller
  1 sibling, 1 reply; 28+ messages in thread
From: David Wragg @ 2016-01-06 13:33 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, dev-yBygre7rU0TnMu66kgdUjQ; +Cc: David Wragg

Allow the MTU of vxlan devices without an underlying device to be set to
larger values (up to a maximum based on IP packet limits and vxlan
overhead).

Previously, their MTUs could not be set to higher than the conventional
ethernet value of 1500.  This is a very arbitrary value in the context
of vxlan, and prevented such vxlan devices from being able to take
advantage of jumbo frames etc.

The default MTU remains 1500, for compatibility.

Signed-off-by: David Wragg <david@weave.works>
---
 drivers/net/vxlan.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ba363ce..96d1c55 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2359,21 +2359,19 @@ static void vxlan_set_multicast_list(struct net_device *dev)
 {
 }
 
-static int vxlan_change_mtu(struct net_device *dev, int new_mtu)
+static int __vxlan_change_mtu(struct net_device *dev,
+			      struct net_device *lowerdev,
+			      struct vxlan_rdst *dst, int new_mtu)
 {
-	struct vxlan_dev *vxlan = netdev_priv(dev);
-	struct vxlan_rdst *dst = &vxlan->default_dst;
-	struct net_device *lowerdev;
-	int max_mtu;
+	int max_mtu = 65535;
 
-	lowerdev = __dev_get_by_index(vxlan->net, dst->remote_ifindex);
-	if (lowerdev == NULL)
-		return eth_change_mtu(dev, new_mtu);
+	if (lowerdev)
+		max_mtu = lowerdev->mtu;
 
 	if (dst->remote_ip.sa.sa_family == AF_INET6)
-		max_mtu = lowerdev->mtu - VXLAN6_HEADROOM;
+		max_mtu -= VXLAN6_HEADROOM;
 	else
-		max_mtu = lowerdev->mtu - VXLAN_HEADROOM;
+		max_mtu -= VXLAN_HEADROOM;
 
 	if (new_mtu < 68 || new_mtu > max_mtu)
 		return -EINVAL;
@@ -2382,6 +2380,15 @@ static int vxlan_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+static int vxlan_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_rdst *dst = &vxlan->default_dst;
+	struct net_device *lowerdev = __dev_get_by_index(vxlan->net,
+							 dst->remote_ifindex);
+	return __vxlan_change_mtu(dev, lowerdev, dst, new_mtu);
+}
+
 static int egress_ipv4_tun_info(struct net_device *dev, struct sk_buff *skb,
 				struct ip_tunnel_info *info,
 				__be16 sport, __be16 dport)
-- 
2.5.0

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net 2/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-06 13:33 [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices David Wragg
       [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
@ 2016-01-06 13:33 ` David Wragg
       [not found]   ` <1452087186-12926-3-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: David Wragg @ 2016-01-06 13:33 UTC (permalink / raw)
  To: netdev, dev; +Cc: David Wragg

Prior to 4.3, vxlan vports could transmit vxlan packets of any size,
constrained only by the ability to transmit the resulting UDP packets.
4.3 introduced vxlan netdevs corresponding to vxlan vports.  These
netdevs have an MTU, which limits the size of a packet that can be
successfully vxlan-encapsulated.  The default value for this MTU is
1500, which is awkwardly small, and leads to a conspicuous change in
behaviour for userspace.

This sets the MTU on openvswitch-created vxlan devices to be 65465
(the maximum IP packet size minus the vxlan-on-IPv6 overhead),
effectively restoring the behaviour prior to 4.3.  Although the
vxlan_config struct already had a mtu field for this,
vxlan_dev_configure mostly ignored it; that is also addressed here.

Signed-off-by: David Wragg <david@weave.works>
---
 drivers/net/vxlan.c           | 11 ++++++++---
 net/openvswitch/vport-vxlan.c |  2 ++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 96d1c55..a15d300 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2764,6 +2764,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
 	int err;
 	bool use_ipv6 = false;
 	__be16 default_port = vxlan->cfg.dst_port;
+	struct net_device *lowerdev = NULL;
 
 	vxlan->net = src_net;
 
@@ -2784,9 +2785,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
 	}
 
 	if (conf->remote_ifindex) {
-		struct net_device *lowerdev
-			 = __dev_get_by_index(src_net, conf->remote_ifindex);
-
+		lowerdev = __dev_get_by_index(src_net, conf->remote_ifindex);
 		dst->remote_ifindex = conf->remote_ifindex;
 
 		if (!lowerdev) {
@@ -2810,6 +2809,12 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
 		needed_headroom = lowerdev->hard_header_len;
 	}
 
+	if (conf->mtu) {
+		err = __vxlan_change_mtu(dev, lowerdev, dst, conf->mtu);
+		if (err)
+			return err;
+	}
+
 	if (use_ipv6 || conf->flags & VXLAN_F_COLLECT_METADATA)
 		needed_headroom += VXLAN6_HEADROOM;
 	else
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index 1605691..a97279f 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -91,6 +91,8 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
 	struct vxlan_config conf = {
 		.no_share = true,
 		.flags = VXLAN_F_COLLECT_METADATA,
+		/* The maximum VXLAN payload to fit in an IPv6 packet */
+		.mtu = 65535 - VXLAN6_HEADROOM,
 	};
 
 	if (!options) {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
       [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
  2016-01-06 13:33   ` [PATCH net 1/2] vxlan: Relax the MTU constraint on " David Wragg
@ 2016-01-06 20:59   ` David Miller
       [not found]     ` <20160106.155950.1007160228570301281.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: David Miller @ 2016-01-06 20:59 UTC (permalink / raw)
  To: david-1SEAoVOfG6VEzL6FDj/jAg
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA

From: David Wragg <david@weave.works>
Date: Wed,  6 Jan 2016 13:33:04 +0000

> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of
> any size, constrained only by the ability to transmit the resulting
> UDP packets.  4.3 introduced vxlan netdevs corresponding to vxlan
> vports.  These netdevs have an MTU, which limits the size of a packet
> that can be successfully vxlan-encapsulated.  The default value for
> this MTU is 1500, which is awkwardly small, and leads to a conspicuous
> change in behaviour for userspace.
> 
> These two patches set the MTU on openvswitch-crated vxlan devices to
> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6
> overhead), effectively restoring the behaviour prior to 4.3.  In order
> to accomplish this, the first patch removes the MTU constraint of 1500
> for vxlan netdevs without an underlying device.

Is this really the right thing to do?  Won't we get a lot of fragmentation
by using such a large MTU, especially since you're making it the default
for OVS setups?

Things like path MTU discovery hinge strongly upon accurate MTU settings.
Otherwise they won't function properly.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
       [not found]     ` <20160106.155950.1007160228570301281.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
@ 2016-01-06 22:53       ` Jesse Gross
  2016-01-06 23:25       ` David Wragg
  1 sibling, 0 replies; 28+ messages in thread
From: Jesse Gross @ 2016-01-06 22:53 UTC (permalink / raw)
  To: David Miller
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, david-1SEAoVOfG6VEzL6FDj/jAg,
	Linux Kernel Network Developers

On Wed, Jan 6, 2016 at 12:59 PM, David Miller <davem@davemloft.net> wrote:
> From: David Wragg <david@weave.works>
> Date: Wed,  6 Jan 2016 13:33:04 +0000
>
>> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of
>> any size, constrained only by the ability to transmit the resulting
>> UDP packets.  4.3 introduced vxlan netdevs corresponding to vxlan
>> vports.  These netdevs have an MTU, which limits the size of a packet
>> that can be successfully vxlan-encapsulated.  The default value for
>> this MTU is 1500, which is awkwardly small, and leads to a conspicuous
>> change in behaviour for userspace.
>>
>> These two patches set the MTU on openvswitch-crated vxlan devices to
>> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6
>> overhead), effectively restoring the behaviour prior to 4.3.  In order
>> to accomplish this, the first patch removes the MTU constraint of 1500
>> for vxlan netdevs without an underlying device.
>
> Is this really the right thing to do?  Won't we get a lot of fragmentation
> by using such a large MTU, especially since you're making it the default
> for OVS setups?
>
> Things like path MTU discovery hinge strongly upon accurate MTU settings.
> Otherwise they won't function properly.

At a minimum, I don't think this should be VXLAN specific. But I agree
that I'm not sure this is the right thing to do.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
       [not found]     ` <20160106.155950.1007160228570301281.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  2016-01-06 22:53       ` Jesse Gross
@ 2016-01-06 23:25       ` David Wragg
  2016-01-06 23:57         ` [ovs-dev] " Jesse Gross
  2016-01-07 21:47         ` David Miller
  1 sibling, 2 replies; 28+ messages in thread
From: David Wragg @ 2016-01-06 23:25 UTC (permalink / raw)
  To: David Miller; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA

David Miller <davem@davemloft.net> writes:
>> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of
>> any size, constrained only by the ability to transmit the resulting
>> UDP packets.  4.3 introduced vxlan netdevs corresponding to vxlan
>> vports.  These netdevs have an MTU, which limits the size of a packet
>> that can be successfully vxlan-encapsulated.  The default value for
>> this MTU is 1500, which is awkwardly small, and leads to a conspicuous
>> change in behaviour for userspace.
>> 
>> These two patches set the MTU on openvswitch-crated vxlan devices to
>> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6
>> overhead), effectively restoring the behaviour prior to 4.3.  In order
>> to accomplish this, the first patch removes the MTU constraint of 1500
>> for vxlan netdevs without an underlying device.
>
> Is this really the right thing to do?

I'm certainly open to suggestions of better ways to solve the problem.

To be clear, the problem from our perspective is that a use of the
kernel openvswitch that worked fine in 4.2 and earlier is hobbled in
4.3.  Previously the MTU of an openvswitch-based vxlan overlay network
was constrained only by the MTU of the physical network.  In 4.3, we
can't take advantage of physical networks that support jumbo frames,
causing a huge hit to throughput across the overlay network.

The specific limit of 1500 seems very arbitrary.  For a vxlan overlay
network on top of a traditional ethernet network, the "correct" MTU for
the vxlan netdevs is 1450 rather than 1500.  And in general with
openvswitch, the destination for vxlan packets is determined on a
packet-by-packet basis, possibly involving different path MTUs of the
underlying network.  There is no single "correct" value.

> Won't we get a lot of fragmentation
> by using such a large MTU, especially since you're making it the default
> for OVS setups?

In the context of the openvswitch vxlan vport transmit path, I can't
find a place where the dev->mtu is used (and it would be surprising, on
the basis that the relevant parts of vxlan.c have not changed that much
since 4.2, when no netdev was involved in that path).

Considering non-openvswitch scenarios, when using vxlan netdevs
directly, a vxlan netdev locked to an underlying device supporting jumbo
frames can use a larger MTU.  It's only vxlan netdevs without an
underlying device that have the limit of 1500 imposed.  But why
shouldn't there be the same flexibility to select an MTU for best
performance in both cases?  Aren't the fragmentation concerns the same?

> Things like path MTU discovery hinge strongly upon accurate MTU settings.
> Otherwise they won't function properly.

True.  But in what sense is 1500 accurate?  Uses/users of the kernel
openvswitch code have always had to get this right, making sure that the
MTU set on a vxlan overlay network conforms to the underlying network
paths involved.

David
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-06 23:25       ` David Wragg
@ 2016-01-06 23:57         ` Jesse Gross
  2016-01-07  0:14           ` Hannes Frederic Sowa
       [not found]           ` <CAEh+42iWSZOyikNydU2Bs8meqYfrKfUJLDGFJ8HzQ06k64LP0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-01-07 21:47         ` David Miller
  1 sibling, 2 replies; 28+ messages in thread
From: Jesse Gross @ 2016-01-06 23:57 UTC (permalink / raw)
  To: David Wragg; +Cc: David Miller, dev, Linux Kernel Network Developers

On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote:
> David Miller <davem@davemloft.net> writes:
>>> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of
>>> any size, constrained only by the ability to transmit the resulting
>>> UDP packets.  4.3 introduced vxlan netdevs corresponding to vxlan
>>> vports.  These netdevs have an MTU, which limits the size of a packet
>>> that can be successfully vxlan-encapsulated.  The default value for
>>> this MTU is 1500, which is awkwardly small, and leads to a conspicuous
>>> change in behaviour for userspace.
>>>
>>> These two patches set the MTU on openvswitch-crated vxlan devices to
>>> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6
>>> overhead), effectively restoring the behaviour prior to 4.3.  In order
>>> to accomplish this, the first patch removes the MTU constraint of 1500
>>> for vxlan netdevs without an underlying device.
>>
>> Is this really the right thing to do?
>
> I'm certainly open to suggestions of better ways to solve the problem.

One option is to simply set the MTU on the device from userspace.

The reality is that the code you're modifying is compatibility code.
Maybe we should make this change to preserve the old behavior for old
callers (although, again, it should not be just for VXLAN). But no new
features or tunnel types will be supported in this manner.

New or updated userspace programs should work by simply creating and
adding tunnel devices to OVS. That won't go through this path at all
so you're going to need to find another approach in the near future in
any case.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-06 23:57         ` [ovs-dev] " Jesse Gross
@ 2016-01-07  0:14           ` Hannes Frederic Sowa
  2016-01-07  0:46             ` Jesse Gross
       [not found]           ` <CAEh+42iWSZOyikNydU2Bs8meqYfrKfUJLDGFJ8HzQ06k64LP0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: Hannes Frederic Sowa @ 2016-01-07  0:14 UTC (permalink / raw)
  To: Jesse Gross, David Wragg
  Cc: David Miller, dev, Linux Kernel Network Developers

Hi,

On 07.01.2016 00:57, Jesse Gross wrote:
> On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote:
>> David Miller <davem@davemloft.net> writes:
>>>> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of
>>>> any size, constrained only by the ability to transmit the resulting
>>>> UDP packets.  4.3 introduced vxlan netdevs corresponding to vxlan
>>>> vports.  These netdevs have an MTU, which limits the size of a packet
>>>> that can be successfully vxlan-encapsulated.  The default value for
>>>> this MTU is 1500, which is awkwardly small, and leads to a conspicuous
>>>> change in behaviour for userspace.
>>>>
>>>> These two patches set the MTU on openvswitch-crated vxlan devices to
>>>> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6
>>>> overhead), effectively restoring the behaviour prior to 4.3.  In order
>>>> to accomplish this, the first patch removes the MTU constraint of 1500
>>>> for vxlan netdevs without an underlying device.
>>>
>>> Is this really the right thing to do?
>>
>> I'm certainly open to suggestions of better ways to solve the problem.
>
> One option is to simply set the MTU on the device from userspace.
>
> The reality is that the code you're modifying is compatibility code.
> Maybe we should make this change to preserve the old behavior for old
> callers (although, again, it should not be just for VXLAN). But no new
> features or tunnel types will be supported in this manner.
>
> New or updated userspace programs should work by simply creating and
> adding tunnel devices to OVS. That won't go through this path at all
> so you're going to need to find another approach in the near future in
> any case.

I don't see any other way as to make MTUs part of the flow if we want to 
have correct ip_local_error notifications. And those must also work 
across VMs, so openvswitch in quasi brouting mode would need to emit 
ICMP PtBs (hopefully with a correct source address, otherwise uRPF kills 
them before reaching the applications) or do error signaling via virtio_net.

Either the openvswitch user space can feed those information to the 
datapath or the ovs dataplane can do a lookup on the outer ip address 
while filling out the metadata_dst and caching it in the flow or we just 
keep the dst in the flow anyway. So a net_device used by ovs has no real 
mtu anymore.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
       [not found]           ` <CAEh+42iWSZOyikNydU2Bs8meqYfrKfUJLDGFJ8HzQ06k64LP0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-01-07  0:29             ` David Wragg
       [not found]               ` <86wprmp6z6.fsf-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: David Wragg @ 2016-01-07  0:29 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Linux Kernel Network Developers,
	David Miller

Jesse Gross <jesse@kernel.org> writes:
> On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote:
>> I'm certainly open to suggestions of better ways to solve the problem.
>
> One option is to simply set the MTU on the device from userspace.

If that worked I wouldn't be submitting a patch.

The MTU value of 1500 is not merely the default.  It is also the maximum
allowed for a vxlan netdev not associated with an underlying netdev.  If
you do e.g. "ip link set dev vxlan-6784 mtu 8950", where vxlan-6784
was created by an ovs vport, it fails with EINVAL.

The first patch of the two submitted removes that limit.

> The reality is that the code you're modifying is compatibility code.
> Maybe we should make this change to preserve the old behavior or old
> callers (although, again, it should not be just for VXLAN). But no new
> features or tunnel types will be supported in this manner.

That's fine.  Naturally, the ideal from our point of view is if the
compatibility code is fully compatible, so we don't have to make changes
on our side that involve different code paths for different kernel
versions.  That's what my patches are intended to achieve.

But we can live with such changes on our side, as long as there is some
reasonable way to do so.  In the case of this vxlan MTU issue, there
doesn't seem to be one.

> New or updated userspace programs should work by simply creating and
> adding tunnel devices to OVS. That won't go through this path at all
> so you're going to need to find another approach in the near future in
> any case.

Ok.  But please try to be gentle on the poor souls who have to come up
with a single codebase that works on a range of kernel versions going
back a few years.

David
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-07  0:14           ` Hannes Frederic Sowa
@ 2016-01-07  0:46             ` Jesse Gross
  2016-01-07 11:49               ` Thomas Graf
  0 siblings, 1 reply; 28+ messages in thread
From: Jesse Gross @ 2016-01-07  0:46 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Wragg, David Miller, dev, Linux Kernel Network Developers

On Wed, Jan 6, 2016 at 4:14 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi,
>
>
> On 07.01.2016 00:57, Jesse Gross wrote:
>>
>> On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote:
>>>
>>> David Miller <davem@davemloft.net> writes:
>>>>>
>>>>> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of
>>>>> any size, constrained only by the ability to transmit the resulting
>>>>> UDP packets.  4.3 introduced vxlan netdevs corresponding to vxlan
>>>>> vports.  These netdevs have an MTU, which limits the size of a packet
>>>>> that can be successfully vxlan-encapsulated.  The default value for
>>>>> this MTU is 1500, which is awkwardly small, and leads to a conspicuous
>>>>> change in behaviour for userspace.
>>>>>
>>>>> These two patches set the MTU on openvswitch-crated vxlan devices to
>>>>> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6
>>>>> overhead), effectively restoring the behaviour prior to 4.3.  In order
>>>>> to accomplish this, the first patch removes the MTU constraint of 1500
>>>>> for vxlan netdevs without an underlying device.
>>>>
>>>>
>>>> Is this really the right thing to do?
>>>
>>>
>>> I'm certainly open to suggestions of better ways to solve the problem.
>>
>>
>> One option is to simply set the MTU on the device from userspace.
>>
>> The reality is that the code you're modifying is compatibility code.
>> Maybe we should make this change to preserve the old behavior for old
>> callers (although, again, it should not be just for VXLAN). But no new
>> features or tunnel types will be supported in this manner.
>>
>> New or updated userspace programs should work by simply creating and
>> adding tunnel devices to OVS. That won't go through this path at all
>> so you're going to need to find another approach in the near future in
>> any case.
>
>
> I don't see any other way as to make MTUs part of the flow if we want to
> have correct ip_local_error notifications. And those must also work across
> VMs, so openvswitch in quasi brouting mode would need to emit ICMP PtBs
> (hopefully with a correct source address, otherwise uRPF kills them before
> reaching the applications) or do error signaling via virtio_net.

I actually implemented this a long time ago and then there was some
additional discussion on this about a year ago. I agree it's the right
solution overall but it's not entirely clearly to me how to get the
details correct.

> Either the openvswitch user space can feed those information to the datapath
> or the ovs dataplane can do a lookup on the outer ip address while filling
> out the metadata_dst and caching it in the flow or we just keep the dst in
> the flow anyway. So a net_device used by ovs has no real mtu anymore.

I agree that the concept of MTU is much more complicated than a single
number on a device, we just have to find the right way to model it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
       [not found]               ` <86wprmp6z6.fsf-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
@ 2016-01-07  1:10                 ` Jesse Gross
  0 siblings, 0 replies; 28+ messages in thread
From: Jesse Gross @ 2016-01-07  1:10 UTC (permalink / raw)
  To: David Wragg
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Linux Kernel Network Developers,
	David Miller

On Wed, Jan 6, 2016 at 4:29 PM, David Wragg <david@weave.works> wrote:
> Jesse Gross <jesse@kernel.org> writes:
>> On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote:
>>> I'm certainly open to suggestions of better ways to solve the problem.
>>
>> One option is to simply set the MTU on the device from userspace.
>
> If that worked I wouldn't be submitting a patch.
>
> The MTU value of 1500 is not merely the default.  It is also the maximum
> allowed for a vxlan netdev not associated with an underlying netdev.  If
> you do e.g. "ip link set dev vxlan-6784 mtu 8950", where vxlan-6784
> was created by an ovs vport, it fails with EINVAL.
>
> The first patch of the two submitted removes that limit.

I saw your first patch and I agree that it fixes a problem. I was
referring to the second patch.

>> The reality is that the code you're modifying is compatibility code.
>> Maybe we should make this change to preserve the old behavior or old
>> callers (although, again, it should not be just for VXLAN). But no new
>> features or tunnel types will be supported in this manner.
>
> That's fine.  Naturally, the ideal from our point of view is if the
> compatibility code is fully compatible, so we don't have to make changes
> on our side that involve different code paths for different kernel
> versions.  That's what my patches are intended to achieve.

The intention is to be fully backwards compatible with existing
software. If you want to take advantage of future functionality then,
yes, you will need to change to the new model.

I agree that behavior changed with existing compatibility code. I'm
fine with your series as long as you generalize it to all tunnel types
and not just VXLAN. Just be aware that you're going to have to find
another solution long term.

> Ok.  But please try to be gentle on the poor souls who have to come up
> with a single codebase that works on a range of kernel versions going
> back a few years.

I maintain a large program that needs to do this myself, so I am aware
that it can be challenging.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices
       [not found]     ` <1452087186-12926-2-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
@ 2016-01-07 11:24       ` Thomas Graf
  2016-01-07 11:31         ` David Wragg
  2016-01-09 18:39       ` roopa
  1 sibling, 1 reply; 28+ messages in thread
From: Thomas Graf @ 2016-01-07 11:24 UTC (permalink / raw)
  To: David Wragg; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA

On 01/06/16 at 01:33pm, David Wragg wrote:
> Allow the MTU of vxlan devices without an underlying device to be set to
> larger values (up to a maximum based on IP packet limits and vxlan
> overhead).
> 
> Previously, their MTUs could not be set to higher than the conventional
> ethernet value of 1500.  This is a very arbitrary value in the context
> of vxlan, and prevented such vxlan devices from being able to take
> advantage of jumbo frames etc.
> 
> The default MTU remains 1500, for compatibility.
> 
> Signed-off-by: David Wragg <david@weave.works>

A remain of eth_change_mtu

> +	int max_mtu = 65535;

This should probably be represented as a new const DEV_MAX_MTU which
can be used by veth, tun, and virtio as well instead of hardcoding
this separately in each driver.

> +static int vxlan_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +	struct vxlan_dev *vxlan = netdev_priv(dev);
> +	struct vxlan_rdst *dst = &vxlan->default_dst;
> +	struct net_device *lowerdev = __dev_get_by_index(vxlan->net,
> +							 dst->remote_ifindex);
> +	return __vxlan_change_mtu(dev, lowerdev, dst, new_mtu);

Any particular reason for the indirection?
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices
  2016-01-07 11:24       ` Thomas Graf
@ 2016-01-07 11:31         ` David Wragg
  2016-01-07 11:50           ` Thomas Graf
  0 siblings, 1 reply; 28+ messages in thread
From: David Wragg @ 2016-01-07 11:31 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, dev

Thomas Graf <tgraf@suug.ch> writes:
>> +	int max_mtu = 65535;
>
> This should probably be represented as a new const DEV_MAX_MTU which
> can be used by veth, tun, and virtio as well instead of hardcoding
> this separately in each driver.

I discovered IP_MAX_MTU in net/route.h after putting the patch together.
Seems appropriate to use that?

>> +static int vxlan_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +	struct vxlan_dev *vxlan = netdev_priv(dev);
>> +	struct vxlan_rdst *dst = &vxlan->default_dst;
>> +	struct net_device *lowerdev = __dev_get_by_index(vxlan->net,
>> +							 dst->remote_ifindex);
>> +	return __vxlan_change_mtu(dev, lowerdev, dst, new_mtu);
>
> Any particular reason for the indirection?

To make patch 2/2 simpler.  I can rearrange to eliminate the indirection
here if that is preferred.

David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 2/2] vxlan: Set a large MTU on ovs-created vxlan devices
       [not found]   ` <1452087186-12926-3-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
@ 2016-01-07 11:36     ` Thomas Graf
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Graf @ 2016-01-07 11:36 UTC (permalink / raw)
  To: David Wragg; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA

On 01/06/16 at 01:33pm, David Wragg wrote:
> Prior to 4.3, vxlan vports could transmit vxlan packets of any size,
> constrained only by the ability to transmit the resulting UDP packets.
> 4.3 introduced vxlan netdevs corresponding to vxlan vports.  These
> netdevs have an MTU, which limits the size of a packet that can be
> successfully vxlan-encapsulated.  The default value for this MTU is
> 1500, which is awkwardly small, and leads to a conspicuous change in
> behaviour for userspace.
> 
> This sets the MTU on openvswitch-created vxlan devices to be 65465
> (the maximum IP packet size minus the vxlan-on-IPv6 overhead),
> effectively restoring the behaviour prior to 4.3.  Although the
> vxlan_config struct already had a mtu field for this,
> vxlan_dev_configure mostly ignored it; that is also addressed here.

I agree with Jesse that this is fine. The tunnel net_device is a
logical device anyway and the real MTU is not known at this point
until the underlay route has been resolved.

It is really a model we should move away from though. Unless the
endpoints negotiate a proper MTU, the underlay can't possible
cope with this correctly without major performance problems due
to fragmentation.

> @@ -2810,6 +2809,12 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
>  		needed_headroom = lowerdev->hard_header_len;
>  	}
>  
> +	if (conf->mtu) {
> +		err = __vxlan_change_mtu(dev, lowerdev, dst, conf->mtu);

OK, I see the reason for the indirection in the first patch. Never
mind.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-07  0:46             ` Jesse Gross
@ 2016-01-07 11:49               ` Thomas Graf
       [not found]                 ` <20160107114935.GJ32456-4EA/1caXOu0mYvmMESoHnA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Graf @ 2016-01-07 11:49 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Hannes Frederic Sowa, David Wragg, David Miller, dev,
	Linux Kernel Network Developers

On 01/06/16 at 04:46pm, Jesse Gross wrote:
> On Wed, Jan 6, 2016 at 4:14 PM, Hannes Frederic Sowa
> > I don't see any other way as to make MTUs part of the flow if we want to
> > have correct ip_local_error notifications. And those must also work across
> > VMs, so openvswitch in quasi brouting mode would need to emit ICMP PtBs
> > (hopefully with a correct source address, otherwise uRPF kills them before
> > reaching the applications) or do error signaling via virtio_net.
> 
> I actually implemented this a long time ago and then there was some
> additional discussion on this about a year ago. I agree it's the right
> solution overall but it's not entirely clearly to me how to get the
> details correct.

When I looked into this last, the wildcard flow model of OVS  made this
difficult to get 100% right. That said, I don't think we have to do
the actual dropping in OVS itself but the signaling has to back to OVS
and ultimately the source. We don't want to replicate the entire flow
cache model in OVS.

A simple start could be to add a new return code for > MTU drops in
the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in
ovs_vport_send() and emit proper ICMPs.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices
  2016-01-07 11:31         ` David Wragg
@ 2016-01-07 11:50           ` Thomas Graf
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Graf @ 2016-01-07 11:50 UTC (permalink / raw)
  To: David Wragg; +Cc: netdev, dev

On 01/07/16 at 11:31am, David Wragg wrote:
> Thomas Graf <tgraf@suug.ch> writes:
> >> +	int max_mtu = 65535;
> >
> > This should probably be represented as a new const DEV_MAX_MTU which
> > can be used by veth, tun, and virtio as well instead of hardcoding
> > this separately in each driver.
> 
> I discovered IP_MAX_MTU in net/route.h after putting the patch together.
> Seems appropriate to use that?

That seems fine for both patch 1 and 2.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
       [not found]                 ` <20160107114935.GJ32456-4EA/1caXOu0mYvmMESoHnA@public.gmane.org>
@ 2016-01-07 16:35                   ` Jesse Gross
  2016-01-07 17:21                     ` [ovs-dev] " Thomas Graf
  0 siblings, 1 reply; 28+ messages in thread
From: Jesse Gross @ 2016-01-07 16:35 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, David Wragg, David Miller,
	Hannes Frederic Sowa, Linux Kernel Network Developers

On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 01/06/16 at 04:46pm, Jesse Gross wrote:
>> On Wed, Jan 6, 2016 at 4:14 PM, Hannes Frederic Sowa
>> > I don't see any other way as to make MTUs part of the flow if we want to
>> > have correct ip_local_error notifications. And those must also work across
>> > VMs, so openvswitch in quasi brouting mode would need to emit ICMP PtBs
>> > (hopefully with a correct source address, otherwise uRPF kills them before
>> > reaching the applications) or do error signaling via virtio_net.
>>
>> I actually implemented this a long time ago and then there was some
>> additional discussion on this about a year ago. I agree it's the right
>> solution overall but it's not entirely clearly to me how to get the
>> details correct.
>
> When I looked into this last, the wildcard flow model of OVS  made this
> difficult to get 100% right. That said, I don't think we have to do
> the actual dropping in OVS itself but the signaling has to back to OVS
> and ultimately the source. We don't want to replicate the entire flow
> cache model in OVS.
>
> A simple start could be to add a new return code for > MTU drops in
> the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in
> ovs_vport_send() and emit proper ICMPs.

That could be interesting. The problem in the past was making sure
that ICMPs that are generated fit in the virtual network appropriately
- right addresses, etc. This requires either spoofing addresses or
some additional knowledge about the topology that we don't currently
have in the kernel.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-07 16:35                   ` Jesse Gross
@ 2016-01-07 17:21                     ` Thomas Graf
  2016-01-07 17:50                       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Graf @ 2016-01-07 17:21 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Hannes Frederic Sowa, David Wragg, David Miller, dev,
	Linux Kernel Network Developers

On 01/07/16 at 08:35am, Jesse Gross wrote:
> On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > A simple start could be to add a new return code for > MTU drops in
> > the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in
> > ovs_vport_send() and emit proper ICMPs.
> 
> That could be interesting. The problem in the past was making sure
> that ICMPs that are generated fit in the virtual network appropriately
> - right addresses, etc. This requires either spoofing addresses or
> some additional knowledge about the topology that we don't currently
> have in the kernel.

Are you worried about emitting an ICMP with a source which is not
a local host address?

Can't we just use icmp_send() in the context of the inner header and
feed it to the flow table to send it back? It should be the same as
for ip_forward().

skb->dev or skb->dst should lead us to the real MTU which can be
included in the ICMP frag needed. It's a bit tricky because we would
have to know whether it was encapsulated or not and adjust
accordingly.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-07 17:21                     ` [ovs-dev] " Thomas Graf
@ 2016-01-07 17:50                       ` Hannes Frederic Sowa
       [not found]                         ` <568EA55A.7070305-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Frederic Sowa @ 2016-01-07 17:50 UTC (permalink / raw)
  To: Thomas Graf, Jesse Gross
  Cc: David Wragg, David Miller, dev, Linux Kernel Network Developers

On 07.01.2016 18:21, Thomas Graf wrote:
> On 01/07/16 at 08:35am, Jesse Gross wrote:
>> On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote:
>>> A simple start could be to add a new return code for > MTU drops in
>>> the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in
>>> ovs_vport_send() and emit proper ICMPs.
>>
>> That could be interesting. The problem in the past was making sure
>> that ICMPs that are generated fit in the virtual network appropriately
>> - right addresses, etc. This requires either spoofing addresses or
>> some additional knowledge about the topology that we don't currently
>> have in the kernel.
>
> Are you worried about emitting an ICMP with a source which is not
> a local host address?

We have uRPF enabled for IPv4 by default on all kernels. Thus if we 
generate an IPv4 ICMP packet back with an error message it must have a 
source address which the receiving kernel considers valid. Valid means 
that sending to the source address would have used the same outgoing 
interface the ICMP error came in from.

> Can't we just use icmp_send() in the context of the inner header and
> feed it to the flow table to send it back? It should be the same as
> for ip_forward().

The bridge's ip address often has no valid path as seen from the end 
host system receiving the icmp error, because the openvswitch is not 
really part of the L3 forwarding chain.

Faking the address from the packet (e.g. using the destination address 
of the original packet) will make traceroute go nuts.

> skb->dev or skb->dst should lead us to the real MTU which can be
> included in the ICMP frag needed. It's a bit tricky because we would
> have to know whether it was encapsulated or not and adjust
> accordingly.

Exactly, but this would be the way to go regarding figuring out the 
correct mtu.

Normally ethernet devices don't return icmp error messages. E.g. broken 
jumbo frame configuration just leads to silent packet loss because the 
packet is discarded before a router can handle it. Thus it would be best 
in case of local ovs installation if the error is already transported 
back to the client application via the network call stack. This might be 
very difficult in case we enqueue the packet to a backlog queue and 
reschedule softirqs. Probably we need some way of faking source 
addresses from bridges now.... :/

Bye,
Hannes

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
       [not found]                         ` <568EA55A.7070305-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>
@ 2016-01-07 18:40                           ` Thomas Graf
       [not found]                             ` <20160107184042.GB24672-4EA/1caXOu0mYvmMESoHnA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Graf @ 2016-01-07 18:40 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Linux Kernel Network Developers,
	David Miller, David Wragg

On 01/07/16 at 06:50pm, Hannes Frederic Sowa wrote:
> On 07.01.2016 18:21, Thomas Graf wrote:
> >On 01/07/16 at 08:35am, Jesse Gross wrote:
> >>On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote:
> >>>A simple start could be to add a new return code for > MTU drops in
> >>>the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in
> >>>ovs_vport_send() and emit proper ICMPs.
> >>
> >>That could be interesting. The problem in the past was making sure
> >>that ICMPs that are generated fit in the virtual network appropriately
> >>- right addresses, etc. This requires either spoofing addresses or
> >>some additional knowledge about the topology that we don't currently
> >>have in the kernel.
> >
> >Are you worried about emitting an ICMP with a source which is not
> >a local host address?
> 
> We have uRPF enabled for IPv4 by default on all kernels. Thus if we generate
> an IPv4 ICMP packet back with an error message it must have a source address
> which the receiving kernel considers valid. Valid means that sending to the
> source address would have used the same outgoing interface the ICMP error
> came in from.

Agreed. I think this is given though as we would reverse the addresses
as icmp_send() already does:

        saddr = iph->daddr;

> >Can't we just use icmp_send() in the context of the inner header and
> >feed it to the flow table to send it back? It should be the same as
> >for ip_forward().
> 
> The bridge's ip address often has no valid path as seen from the end host
> system receiving the icmp error, because the openvswitch is not really part
> of the L3 forwarding chain.

I don't think the IP of the bridge ever comes into play. It shouldn't.
I'm not even sure what could be considered the address of the bridge
;-)

> Faking the address from the packet (e.g. using the destination address of
> the original packet) will make traceroute go nuts.

I think you are worried about an ICMP error from a hop which does not
decrement TTL. I think that's a good point and I think we should only
send an ICMP error if the TTL is decremented in the action list of
the flow for which we have seen a MTU based drop (or TTL=0).

I don't really see a difference between ip_forward(), some
sophisticated tc action or OVS. As soon as they decremented TTL and
perform L3 forwarding, then they should send out ICMP errors to allow
for proper PMTU.

> Normally ethernet devices don't return icmp error messages. E.g. broken
> jumbo frame configuration just leads to silent packet loss because the
> packet is discarded before a router can handle it. Thus it would be best in
> case of local ovs installation if the error is already transported back to
> the client application via the network call stack. This might be very
> difficult in case we enqueue the packet to a backlog queue and reschedule
> softirqs. Probably we need some way of faking source addresses from bridges
> now.... :/

I think the major complications comes from the assumption that OVS is
a bridge. This is not necessarily the case as stated above. If a flow
is doing L3 forwarding, we should send ICMPs as expected from a
router.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-06 23:25       ` David Wragg
  2016-01-06 23:57         ` [ovs-dev] " Jesse Gross
@ 2016-01-07 21:47         ` David Miller
  2016-01-07 23:42           ` David Wragg
  1 sibling, 1 reply; 28+ messages in thread
From: David Miller @ 2016-01-07 21:47 UTC (permalink / raw)
  To: david; +Cc: netdev, dev

From: David Wragg <david@weave.works>
Date: Wed, 06 Jan 2016 23:25:56 +0000

> Considering non-openvswitch scenarios, when using vxlan netdevs
> directly, a vxlan netdev locked to an underlying device supporting jumbo
> frames can use a larger MTU.  It's only vxlan netdevs without an
> underlying device that have the limit of 1500 imposed.  But why
> shouldn't there be the same flexibility to select an MTU for best
> performance in both cases?  Aren't the fragmentation concerns the same?

When there is an underlying device, there are no fragmentation concerns
and PMTU will function properly because the MTU will be set accurately.

> True.  But in what sense is 1500 accurate?  Uses/users of the kernel
> openvswitch code have always had to get this right, making sure that
> the MTU set on a vxlan overlay network conforms to the underlying
> network paths involved.

"right" is a subjective thing.

Here, once again, the poorly thought out amount of flexibility
openvswitch provides is going to have a serious negative consequence
for our implementation.

I'm really tired of seeing this happen over and over again.
Openvswitch's growth and the expansion of it's feature set was
extremely poorly managed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-07 21:47         ` David Miller
@ 2016-01-07 23:42           ` David Wragg
  2016-01-08  2:48             ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: David Wragg @ 2016-01-07 23:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, dev

David Miller <davem@davemloft.net> writes:
>> Considering non-openvswitch scenarios, when using vxlan netdevs
>> directly, a vxlan netdev locked to an underlying device supporting jumbo
>> frames can use a larger MTU.  It's only vxlan netdevs without an
>> underlying device that have the limit of 1500 imposed.  But why
>> shouldn't there be the same flexibility to select an MTU for best
>> performance in both cases?  Aren't the fragmentation concerns the same?
>
> When there is an underlying device, there are no fragmentation concerns
> and PMTU will function properly because the MTU will be set
> accurately.

I can see how that works in the case where the vxlan endpoints are on
the same subnet of the underlying network.  But in the general case, the
"accurate" vxlan MTU to avoid fragmentation concerns should correspond
to the path MTU between the endpoints.  How does knowing an underlying
device help with that?  It's reasonable to calculate a default MTU for
the vxlan device based on an underlying device, if one is specified.
But unless the endpoints are on the same subnet, that's just a guess.

(In the case of our application, we have openvswitch configured to set
DF on the vxlan packets.  So we know we are not getting inadvertent
fragmentation.)

>> True.  But in what sense is 1500 accurate?  Uses/users of the kernel
>> openvswitch code have always had to get this right, making sure that
>> the MTU set on a vxlan overlay network conforms to the underlying
>> network paths involved.
>
> "right" is a subjective thing.
>
> Here, once again, the poorly thought out amount of flexibility
> openvswitch provides is going to have a serious negative consequence
> for our implementation.
>
> I'm really tired of seeing this happen over and over again.
> Openvswitch's growth and the expansion of it's feature set was
> extremely poorly managed.

I'm sorry that there is some angst around the openvswitch kernel
subsystem.  I wasn't involved in its development, so I have no strong
opinion on those issues.  But we use it in our application, and we'd
like to find a way to restore the ability to perform vxlan encapsulation
of packets greater than 1500 bytes from openvswitch, as we could prior
to 4.3.

Is there another path to that same goal that is less contentious than my
proposal?

I'm willing to follow up on Jesse's request to look into the other
tunnel types too, but at the moment I'm wondering what the chances are
that the resulting submission would get accepted.

David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-07 23:42           ` David Wragg
@ 2016-01-08  2:48             ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2016-01-08  2:48 UTC (permalink / raw)
  To: david; +Cc: netdev, dev

From: David Wragg <david@weave.works>
Date: Thu, 07 Jan 2016 23:42:52 +0000

> I'm willing to follow up on Jesse's request to look into the other
> tunnel types too, but at the moment I'm wondering what the chances are
> that the resulting submission would get accepted.

I'm ok with these fixes if you look into Jesse's feedback.

My basic gripe is that openvswitch can act as an L3 forwarding agent,
but does not contain the necessary state (f.e. topology information in
the form of a full FIB table) in order to behave like it should.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
       [not found]                             ` <20160107184042.GB24672-4EA/1caXOu0mYvmMESoHnA@public.gmane.org>
@ 2016-01-08 21:29                               ` Hannes Frederic Sowa
  2016-01-10 10:49                                 ` [ovs-dev] " Thomas Graf
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Frederic Sowa @ 2016-01-08 21:29 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Linux Kernel Network Developers,
	David Miller, David Wragg

On 07.01.2016 19:40, Thomas Graf wrote:
> On 01/07/16 at 06:50pm, Hannes Frederic Sowa wrote:
>> On 07.01.2016 18:21, Thomas Graf wrote:
>>> On 01/07/16 at 08:35am, Jesse Gross wrote:
>>>> On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote:
>>>>> A simple start could be to add a new return code for > MTU drops in
>>>>> the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in
>>>>> ovs_vport_send() and emit proper ICMPs.
>>>>
>>>> That could be interesting. The problem in the past was making sure
>>>> that ICMPs that are generated fit in the virtual network appropriately
>>>> - right addresses, etc. This requires either spoofing addresses or
>>>> some additional knowledge about the topology that we don't currently
>>>> have in the kernel.
>>>
>>> Are you worried about emitting an ICMP with a source which is not
>>> a local host address?
>>
>> We have uRPF enabled for IPv4 by default on all kernels. Thus if we generate
>> an IPv4 ICMP packet back with an error message it must have a source address
>> which the receiving kernel considers valid. Valid means that sending to the
>> source address would have used the same outgoing interface the ICMP error
>> came in from.
>
> Agreed. I think this is given though as we would reverse the addresses
> as icmp_send() already does:
>
>          saddr = iph->daddr;
>
>>> Can't we just use icmp_send() in the context of the inner header and
>>> feed it to the flow table to send it back? It should be the same as
>>> for ip_forward().
>>
>> The bridge's ip address often has no valid path as seen from the end host
>> system receiving the icmp error, because the openvswitch is not really part
>> of the L3 forwarding chain.
>
> I don't think the IP of the bridge ever comes into play. It shouldn't.
> I'm not even sure what could be considered the address of the bridge
> ;-)

Yes, exactly. :)

>
>> Faking the address from the packet (e.g. using the destination address of
>> the original packet) will make traceroute go nuts.
>
> I think you are worried about an ICMP error from a hop which does not
> decrement TTL. I think that's a good point and I think we should only
> send an ICMP error if the TTL is decremented in the action list of
> the flow for which we have seen a MTU based drop (or TTL=0).

Also agreed, ovs must act in routing mode but at the same time must have 
an IP address on the path. I think this is actually the problem.

Currently we have no way to feedback an error in current configurations 
with ovs sitting in another namespace for e.g. docker containers:

We traverse a net namespace so we drop skb->sk, we don't hold any socket 
reference to enqueue an PtB error to the original socket.

We mostly use netif_rx_internal queues the socket on the backlog, so we 
can't signal an error over the callstack either.

And ovs does not necessarily have an ip address as the first hop of the 
namespace or the virtual machine, so it cannot know a valid ip address 
with which to reply, no?

> I don't really see a difference between ip_forward(), some
> sophisticated tc action or OVS. As soon as they decremented TTL and
> perform L3 forwarding, then they should send out ICMP errors to allow
> for proper PMTU.

Yes, but depending on the ip configuration, those icmps will then be 
dropped in the reverse path filter.

>> Normally ethernet devices don't return icmp error messages. E.g. broken
>> jumbo frame configuration just leads to silent packet loss because the
>> packet is discarded before a router can handle it. Thus it would be best in
>> case of local ovs installation if the error is already transported back to
>> the client application via the network call stack. This might be very
>> difficult in case we enqueue the packet to a backlog queue and reschedule
>> softirqs. Probably we need some way of faking source addresses from bridges
>> now.... :/
>
> I think the major complications comes from the assumption that OVS is
> a bridge. This is not necessarily the case as stated above. If a flow
> is doing L3 forwarding, we should send ICMPs as expected from a
> router.

If we are doing L3 forwarding into a tunnel, this is absolutely correct 
and can be easily done.

Bye,
Hannes


_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices
       [not found]     ` <1452087186-12926-2-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
  2016-01-07 11:24       ` Thomas Graf
@ 2016-01-09 18:39       ` roopa
  2016-01-10 10:28         ` [ovs-dev] " Thomas Graf
  1 sibling, 1 reply; 28+ messages in thread
From: roopa @ 2016-01-09 18:39 UTC (permalink / raw)
  To: David Wragg; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA

On 1/6/16, 5:33 AM, David Wragg wrote:
> Allow the MTU of vxlan devices without an underlying device to be set to
> larger values (up to a maximum based on IP packet limits and vxlan
> overhead).
>
> Previously, their MTUs could not be set to higher than the conventional
> ethernet value of 1500.  This is a very arbitrary value in the context
> of vxlan, and prevented such vxlan devices from being able to take
> advantage of jumbo frames etc.
>
> The default MTU remains 1500, for compatibility.
>
> Signed-off-by: David Wragg <david@weave.works>
>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>

I have an internal patch which does the same thing and
was hoping to post it soon.
I am not using ovs. so, I am not closely following the thread on the other
patch in the series. But, this patch certainly stands on its own and is required.

Thanks!




_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices
  2016-01-09 18:39       ` roopa
@ 2016-01-10 10:28         ` Thomas Graf
  2016-01-27 16:39           ` roopa
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Graf @ 2016-01-10 10:28 UTC (permalink / raw)
  To: roopa; +Cc: David Wragg, dev, netdev

On 01/09/16 at 10:39am, roopa wrote:
> On 1/6/16, 5:33 AM, David Wragg wrote:
> > Allow the MTU of vxlan devices without an underlying device to be set to
> > larger values (up to a maximum based on IP packet limits and vxlan
> > overhead).
> >
> > Previously, their MTUs could not be set to higher than the conventional
> > ethernet value of 1500.  This is a very arbitrary value in the context
> > of vxlan, and prevented such vxlan devices from being able to take
> > advantage of jumbo frames etc.
> >
> > The default MTU remains 1500, for compatibility.
> >
> > Signed-off-by: David Wragg <david@weave.works>
> >
> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> I have an internal patch which does the same thing and
> was hoping to post it soon.
> I am not using ovs. so, I am not closely following the thread on the other
> patch in the series. But, this patch certainly stands on its own and is required.

Agreed. In fact the issue described is not OVS specific, anyone using a
tunnel device in metadata mode benefits form this but is also exposed
to the MTU issue.

We either create a tunnel device for each underlay device and thus
expose the baremetal MTU into the virtual network thus allowing for
the L3 in the virtual network to check the MTU or we will not notice
until we hit the underlay in which the context for the ICMP is much
less useful.

I'll think about how to solve this as discussed in the other portion
of this thread as I assume you will be interested in a fix for this as
well.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
  2016-01-08 21:29                               ` Hannes Frederic Sowa
@ 2016-01-10 10:49                                 ` Thomas Graf
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Graf @ 2016-01-10 10:49 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jesse Gross, David Wragg, David Miller, dev,
	Linux Kernel Network Developers

On 01/08/16 at 10:29pm, Hannes Frederic Sowa wrote:
> On 07.01.2016 19:40, Thomas Graf wrote:
> >I think you are worried about an ICMP error from a hop which does not
> >decrement TTL. I think that's a good point and I think we should only
> >send an ICMP error if the TTL is decremented in the action list of
> >the flow for which we have seen a MTU based drop (or TTL=0).
> 
> Also agreed, ovs must act in routing mode but at the same time must have an
> IP address on the path. I think this is actually the problem.
> 
> Currently we have no way to feedback an error in current configurations with
> ovs sitting in another namespace for e.g. docker containers:
> 
> We traverse a net namespace so we drop skb->sk, we don't hold any socket
> reference to enqueue an PtB error to the original socket.
> 
> We mostly use netif_rx_internal queues the socket on the backlog, so we
> can't signal an error over the callstack either.
> 
> And ovs does not necessarily have an ip address as the first hop of the
> namespace or the virtual machine, so it cannot know a valid ip address with
> which to reply, no?

[your last statement moved up here:]
> If we are doing L3 forwarding into a tunnel, this is absolutely correct and
> can be easily done.

OK, I can see where you are going with this. I was assuming pure
virtual networks due to the contexts of these patches.

So an ICMP is always UDP encapsulated or directly delivered to a veth or
tap which runs in its own netns or is a VM of which the IP stack
operates exclusively in the context of the virtual network. The stack of
the OVS host never gets to see the actual ICMPs and rp_filter never gets
into play.

In such a context, the virtual router IPs are typically programmed
into the flow table because they are only valid in the virtual network
context, assigning them to the OVS bridge would be wrong as it
represents the underlay context.

The virtual router address is known in the flow context of the virtual
network though and can be given to the icmp_send variant.

Can you elaborate a bit on your container scenario, is it ovs running
in host netns with veth pairs bridging into container netns?

Shouldn't that be solved with the above as the ICMPs sent back in
return by the local OVS are perfectly valid in the IP stack context of
the container netns?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices
  2016-01-10 10:28         ` [ovs-dev] " Thomas Graf
@ 2016-01-27 16:39           ` roopa
  0 siblings, 0 replies; 28+ messages in thread
From: roopa @ 2016-01-27 16:39 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David Wragg, dev, netdev

On 1/10/16, 2:28 AM, Thomas Graf wrote:
> On 01/09/16 at 10:39am, roopa wrote:
>> On 1/6/16, 5:33 AM, David Wragg wrote:
>>> Allow the MTU of vxlan devices without an underlying device to be set to
>>> larger values (up to a maximum based on IP packet limits and vxlan
>>> overhead).
>>>
>>> Previously, their MTUs could not be set to higher than the conventional
>>> ethernet value of 1500.  This is a very arbitrary value in the context
>>> of vxlan, and prevented such vxlan devices from being able to take
>>> advantage of jumbo frames etc.
>>>
>>> The default MTU remains 1500, for compatibility.
>>>
>>> Signed-off-by: David Wragg <david@weave.works>
>>>
>> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> I have an internal patch which does the same thing and
>> was hoping to post it soon.
>> I am not using ovs. so, I am not closely following the thread on the other
>> patch in the series. But, this patch certainly stands on its own and is required.
> Agreed. In fact the issue described is not OVS specific, anyone using a
> tunnel device in metadata mode benefits form this but is also exposed
> to the MTU issue.
>
> We either create a tunnel device for each underlay device and thus
> expose the baremetal MTU into the virtual network thus allowing for
> the L3 in the virtual network to check the MTU or we will not notice
> until we hit the underlay in which the context for the ICMP is much
> less useful.
>
> I'll think about how to solve this as discussed in the other portion
> of this thread as I assume you will be interested in a fix for this as
> well.
thanks thomas, will watch the thread. for now I need this for the vxlan netdevice on
my vxlan gateway. I don't really configure a default dst.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-01-27 16:39 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-06 13:33 [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices David Wragg
     [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
2016-01-06 13:33   ` [PATCH net 1/2] vxlan: Relax the MTU constraint on " David Wragg
     [not found]     ` <1452087186-12926-2-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
2016-01-07 11:24       ` Thomas Graf
2016-01-07 11:31         ` David Wragg
2016-01-07 11:50           ` Thomas Graf
2016-01-09 18:39       ` roopa
2016-01-10 10:28         ` [ovs-dev] " Thomas Graf
2016-01-27 16:39           ` roopa
2016-01-06 20:59   ` [PATCH net 0/2] vxlan: Set a large MTU on ovs-created " David Miller
     [not found]     ` <20160106.155950.1007160228570301281.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2016-01-06 22:53       ` Jesse Gross
2016-01-06 23:25       ` David Wragg
2016-01-06 23:57         ` [ovs-dev] " Jesse Gross
2016-01-07  0:14           ` Hannes Frederic Sowa
2016-01-07  0:46             ` Jesse Gross
2016-01-07 11:49               ` Thomas Graf
     [not found]                 ` <20160107114935.GJ32456-4EA/1caXOu0mYvmMESoHnA@public.gmane.org>
2016-01-07 16:35                   ` Jesse Gross
2016-01-07 17:21                     ` [ovs-dev] " Thomas Graf
2016-01-07 17:50                       ` Hannes Frederic Sowa
     [not found]                         ` <568EA55A.7070305-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>
2016-01-07 18:40                           ` Thomas Graf
     [not found]                             ` <20160107184042.GB24672-4EA/1caXOu0mYvmMESoHnA@public.gmane.org>
2016-01-08 21:29                               ` Hannes Frederic Sowa
2016-01-10 10:49                                 ` [ovs-dev] " Thomas Graf
     [not found]           ` <CAEh+42iWSZOyikNydU2Bs8meqYfrKfUJLDGFJ8HzQ06k64LP0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-01-07  0:29             ` David Wragg
     [not found]               ` <86wprmp6z6.fsf-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
2016-01-07  1:10                 ` Jesse Gross
2016-01-07 21:47         ` David Miller
2016-01-07 23:42           ` David Wragg
2016-01-08  2:48             ` David Miller
2016-01-06 13:33 ` [PATCH net 2/2] " David Wragg
     [not found]   ` <1452087186-12926-3-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>
2016-01-07 11:36     ` Thomas Graf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.