* [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices @ 2016-01-06 13:33 David Wragg [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 2016-01-06 13:33 ` [PATCH net 2/2] " David Wragg 0 siblings, 2 replies; 28+ messages in thread From: David Wragg @ 2016-01-06 13:33 UTC (permalink / raw) To: netdev, dev; +Cc: David Wragg Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of any size, constrained only by the ability to transmit the resulting UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan vports. These netdevs have an MTU, which limits the size of a packet that can be successfully vxlan-encapsulated. The default value for this MTU is 1500, which is awkwardly small, and leads to a conspicuous change in behaviour for userspace. These two patches set the MTU on openvswitch-crated vxlan devices to be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 overhead), effectively restoring the behaviour prior to 4.3. In order to accomplish this, the first patch removes the MTU constraint of 1500 for vxlan netdevs without an underlying device. David Wragg (2): vxlan: Relax the MTU constraint on vxlan devices vxlan: Set a large MTU on ovs-created vxlan devices drivers/net/vxlan.c | 38 +++++++++++++++++++++++++------------- net/openvswitch/vport-vxlan.c | 2 ++ 2 files changed, 27 insertions(+), 13 deletions(-) -- 2.5.0 ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>]
* [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> @ 2016-01-06 13:33 ` David Wragg [not found] ` <1452087186-12926-2-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 2016-01-06 20:59 ` [PATCH net 0/2] vxlan: Set a large MTU on ovs-created " David Miller 1 sibling, 1 reply; 28+ messages in thread From: David Wragg @ 2016-01-06 13:33 UTC (permalink / raw) To: netdev-u79uwXL29TY76Z2rM5mHXA, dev-yBygre7rU0TnMu66kgdUjQ; +Cc: David Wragg Allow the MTU of vxlan devices without an underlying device to be set to larger values (up to a maximum based on IP packet limits and vxlan overhead). Previously, their MTUs could not be set to higher than the conventional ethernet value of 1500. This is a very arbitrary value in the context of vxlan, and prevented such vxlan devices from being able to take advantage of jumbo frames etc. The default MTU remains 1500, for compatibility. Signed-off-by: David Wragg <david@weave.works> --- drivers/net/vxlan.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index ba363ce..96d1c55 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -2359,21 +2359,19 @@ static void vxlan_set_multicast_list(struct net_device *dev) { } -static int vxlan_change_mtu(struct net_device *dev, int new_mtu) +static int __vxlan_change_mtu(struct net_device *dev, + struct net_device *lowerdev, + struct vxlan_rdst *dst, int new_mtu) { - struct vxlan_dev *vxlan = netdev_priv(dev); - struct vxlan_rdst *dst = &vxlan->default_dst; - struct net_device *lowerdev; - int max_mtu; + int max_mtu = 65535; - lowerdev = __dev_get_by_index(vxlan->net, dst->remote_ifindex); - if (lowerdev == NULL) - return eth_change_mtu(dev, new_mtu); + if (lowerdev) + max_mtu = lowerdev->mtu; if (dst->remote_ip.sa.sa_family == AF_INET6) - max_mtu = lowerdev->mtu - VXLAN6_HEADROOM; + max_mtu -= VXLAN6_HEADROOM; else - max_mtu = lowerdev->mtu - VXLAN_HEADROOM; + max_mtu -= VXLAN_HEADROOM; if (new_mtu < 68 || new_mtu > max_mtu) return -EINVAL; @@ -2382,6 +2380,15 @@ static int vxlan_change_mtu(struct net_device *dev, int new_mtu) return 0; } +static int vxlan_change_mtu(struct net_device *dev, int new_mtu) +{ + struct vxlan_dev *vxlan = netdev_priv(dev); + struct vxlan_rdst *dst = &vxlan->default_dst; + struct net_device *lowerdev = __dev_get_by_index(vxlan->net, + dst->remote_ifindex); + return __vxlan_change_mtu(dev, lowerdev, dst, new_mtu); +} + static int egress_ipv4_tun_info(struct net_device *dev, struct sk_buff *skb, struct ip_tunnel_info *info, __be16 sport, __be16 dport) -- 2.5.0 _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply related [flat|nested] 28+ messages in thread
[parent not found: <1452087186-12926-2-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>]
* Re: [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices [not found] ` <1452087186-12926-2-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> @ 2016-01-07 11:24 ` Thomas Graf 2016-01-07 11:31 ` David Wragg 2016-01-09 18:39 ` roopa 1 sibling, 1 reply; 28+ messages in thread From: Thomas Graf @ 2016-01-07 11:24 UTC (permalink / raw) To: David Wragg; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA On 01/06/16 at 01:33pm, David Wragg wrote: > Allow the MTU of vxlan devices without an underlying device to be set to > larger values (up to a maximum based on IP packet limits and vxlan > overhead). > > Previously, their MTUs could not be set to higher than the conventional > ethernet value of 1500. This is a very arbitrary value in the context > of vxlan, and prevented such vxlan devices from being able to take > advantage of jumbo frames etc. > > The default MTU remains 1500, for compatibility. > > Signed-off-by: David Wragg <david@weave.works> A remain of eth_change_mtu > + int max_mtu = 65535; This should probably be represented as a new const DEV_MAX_MTU which can be used by veth, tun, and virtio as well instead of hardcoding this separately in each driver. > +static int vxlan_change_mtu(struct net_device *dev, int new_mtu) > +{ > + struct vxlan_dev *vxlan = netdev_priv(dev); > + struct vxlan_rdst *dst = &vxlan->default_dst; > + struct net_device *lowerdev = __dev_get_by_index(vxlan->net, > + dst->remote_ifindex); > + return __vxlan_change_mtu(dev, lowerdev, dst, new_mtu); Any particular reason for the indirection? _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices 2016-01-07 11:24 ` Thomas Graf @ 2016-01-07 11:31 ` David Wragg 2016-01-07 11:50 ` Thomas Graf 0 siblings, 1 reply; 28+ messages in thread From: David Wragg @ 2016-01-07 11:31 UTC (permalink / raw) To: Thomas Graf; +Cc: netdev, dev Thomas Graf <tgraf@suug.ch> writes: >> + int max_mtu = 65535; > > This should probably be represented as a new const DEV_MAX_MTU which > can be used by veth, tun, and virtio as well instead of hardcoding > this separately in each driver. I discovered IP_MAX_MTU in net/route.h after putting the patch together. Seems appropriate to use that? >> +static int vxlan_change_mtu(struct net_device *dev, int new_mtu) >> +{ >> + struct vxlan_dev *vxlan = netdev_priv(dev); >> + struct vxlan_rdst *dst = &vxlan->default_dst; >> + struct net_device *lowerdev = __dev_get_by_index(vxlan->net, >> + dst->remote_ifindex); >> + return __vxlan_change_mtu(dev, lowerdev, dst, new_mtu); > > Any particular reason for the indirection? To make patch 2/2 simpler. I can rearrange to eliminate the indirection here if that is preferred. David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices 2016-01-07 11:31 ` David Wragg @ 2016-01-07 11:50 ` Thomas Graf 0 siblings, 0 replies; 28+ messages in thread From: Thomas Graf @ 2016-01-07 11:50 UTC (permalink / raw) To: David Wragg; +Cc: netdev, dev On 01/07/16 at 11:31am, David Wragg wrote: > Thomas Graf <tgraf@suug.ch> writes: > >> + int max_mtu = 65535; > > > > This should probably be represented as a new const DEV_MAX_MTU which > > can be used by veth, tun, and virtio as well instead of hardcoding > > this separately in each driver. > > I discovered IP_MAX_MTU in net/route.h after putting the patch together. > Seems appropriate to use that? That seems fine for both patch 1 and 2. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices [not found] ` <1452087186-12926-2-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 2016-01-07 11:24 ` Thomas Graf @ 2016-01-09 18:39 ` roopa 2016-01-10 10:28 ` [ovs-dev] " Thomas Graf 1 sibling, 1 reply; 28+ messages in thread From: roopa @ 2016-01-09 18:39 UTC (permalink / raw) To: David Wragg; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA On 1/6/16, 5:33 AM, David Wragg wrote: > Allow the MTU of vxlan devices without an underlying device to be set to > larger values (up to a maximum based on IP packet limits and vxlan > overhead). > > Previously, their MTUs could not be set to higher than the conventional > ethernet value of 1500. This is a very arbitrary value in the context > of vxlan, and prevented such vxlan devices from being able to take > advantage of jumbo frames etc. > > The default MTU remains 1500, for compatibility. > > Signed-off-by: David Wragg <david@weave.works> > Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> I have an internal patch which does the same thing and was hoping to post it soon. I am not using ovs. so, I am not closely following the thread on the other patch in the series. But, this patch certainly stands on its own and is required. Thanks! _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [ovs-dev] [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices 2016-01-09 18:39 ` roopa @ 2016-01-10 10:28 ` Thomas Graf 2016-01-27 16:39 ` roopa 0 siblings, 1 reply; 28+ messages in thread From: Thomas Graf @ 2016-01-10 10:28 UTC (permalink / raw) To: roopa; +Cc: David Wragg, dev, netdev On 01/09/16 at 10:39am, roopa wrote: > On 1/6/16, 5:33 AM, David Wragg wrote: > > Allow the MTU of vxlan devices without an underlying device to be set to > > larger values (up to a maximum based on IP packet limits and vxlan > > overhead). > > > > Previously, their MTUs could not be set to higher than the conventional > > ethernet value of 1500. This is a very arbitrary value in the context > > of vxlan, and prevented such vxlan devices from being able to take > > advantage of jumbo frames etc. > > > > The default MTU remains 1500, for compatibility. > > > > Signed-off-by: David Wragg <david@weave.works> > > > Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> > > I have an internal patch which does the same thing and > was hoping to post it soon. > I am not using ovs. so, I am not closely following the thread on the other > patch in the series. But, this patch certainly stands on its own and is required. Agreed. In fact the issue described is not OVS specific, anyone using a tunnel device in metadata mode benefits form this but is also exposed to the MTU issue. We either create a tunnel device for each underlay device and thus expose the baremetal MTU into the virtual network thus allowing for the L3 in the virtual network to check the MTU or we will not notice until we hit the underlay in which the context for the ICMP is much less useful. I'll think about how to solve this as discussed in the other portion of this thread as I assume you will be interested in a fix for this as well. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [ovs-dev] [PATCH net 1/2] vxlan: Relax the MTU constraint on vxlan devices 2016-01-10 10:28 ` [ovs-dev] " Thomas Graf @ 2016-01-27 16:39 ` roopa 0 siblings, 0 replies; 28+ messages in thread From: roopa @ 2016-01-27 16:39 UTC (permalink / raw) To: Thomas Graf; +Cc: David Wragg, dev, netdev On 1/10/16, 2:28 AM, Thomas Graf wrote: > On 01/09/16 at 10:39am, roopa wrote: >> On 1/6/16, 5:33 AM, David Wragg wrote: >>> Allow the MTU of vxlan devices without an underlying device to be set to >>> larger values (up to a maximum based on IP packet limits and vxlan >>> overhead). >>> >>> Previously, their MTUs could not be set to higher than the conventional >>> ethernet value of 1500. This is a very arbitrary value in the context >>> of vxlan, and prevented such vxlan devices from being able to take >>> advantage of jumbo frames etc. >>> >>> The default MTU remains 1500, for compatibility. >>> >>> Signed-off-by: David Wragg <david@weave.works> >>> >> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> >> >> I have an internal patch which does the same thing and >> was hoping to post it soon. >> I am not using ovs. so, I am not closely following the thread on the other >> patch in the series. But, this patch certainly stands on its own and is required. > Agreed. In fact the issue described is not OVS specific, anyone using a > tunnel device in metadata mode benefits form this but is also exposed > to the MTU issue. > > We either create a tunnel device for each underlay device and thus > expose the baremetal MTU into the virtual network thus allowing for > the L3 in the virtual network to check the MTU or we will not notice > until we hit the underlay in which the context for the ICMP is much > less useful. > > I'll think about how to solve this as discussed in the other portion > of this thread as I assume you will be interested in a fix for this as > well. thanks thomas, will watch the thread. for now I need this for the vxlan netdevice on my vxlan gateway. I don't really configure a default dst. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 2016-01-06 13:33 ` [PATCH net 1/2] vxlan: Relax the MTU constraint on " David Wragg @ 2016-01-06 20:59 ` David Miller [not found] ` <20160106.155950.1007160228570301281.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> 1 sibling, 1 reply; 28+ messages in thread From: David Miller @ 2016-01-06 20:59 UTC (permalink / raw) To: david-1SEAoVOfG6VEzL6FDj/jAg Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA From: David Wragg <david@weave.works> Date: Wed, 6 Jan 2016 13:33:04 +0000 > Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of > any size, constrained only by the ability to transmit the resulting > UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan > vports. These netdevs have an MTU, which limits the size of a packet > that can be successfully vxlan-encapsulated. The default value for > this MTU is 1500, which is awkwardly small, and leads to a conspicuous > change in behaviour for userspace. > > These two patches set the MTU on openvswitch-crated vxlan devices to > be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 > overhead), effectively restoring the behaviour prior to 4.3. In order > to accomplish this, the first patch removes the MTU constraint of 1500 > for vxlan netdevs without an underlying device. Is this really the right thing to do? Won't we get a lot of fragmentation by using such a large MTU, especially since you're making it the default for OVS setups? Things like path MTU discovery hinge strongly upon accurate MTU settings. Otherwise they won't function properly. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <20160106.155950.1007160228570301281.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>]
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices [not found] ` <20160106.155950.1007160228570301281.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> @ 2016-01-06 22:53 ` Jesse Gross 2016-01-06 23:25 ` David Wragg 1 sibling, 0 replies; 28+ messages in thread From: Jesse Gross @ 2016-01-06 22:53 UTC (permalink / raw) To: David Miller Cc: dev-yBygre7rU0TnMu66kgdUjQ, david-1SEAoVOfG6VEzL6FDj/jAg, Linux Kernel Network Developers On Wed, Jan 6, 2016 at 12:59 PM, David Miller <davem@davemloft.net> wrote: > From: David Wragg <david@weave.works> > Date: Wed, 6 Jan 2016 13:33:04 +0000 > >> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of >> any size, constrained only by the ability to transmit the resulting >> UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan >> vports. These netdevs have an MTU, which limits the size of a packet >> that can be successfully vxlan-encapsulated. The default value for >> this MTU is 1500, which is awkwardly small, and leads to a conspicuous >> change in behaviour for userspace. >> >> These two patches set the MTU on openvswitch-crated vxlan devices to >> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 >> overhead), effectively restoring the behaviour prior to 4.3. In order >> to accomplish this, the first patch removes the MTU constraint of 1500 >> for vxlan netdevs without an underlying device. > > Is this really the right thing to do? Won't we get a lot of fragmentation > by using such a large MTU, especially since you're making it the default > for OVS setups? > > Things like path MTU discovery hinge strongly upon accurate MTU settings. > Otherwise they won't function properly. At a minimum, I don't think this should be VXLAN specific. But I agree that I'm not sure this is the right thing to do. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices [not found] ` <20160106.155950.1007160228570301281.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> 2016-01-06 22:53 ` Jesse Gross @ 2016-01-06 23:25 ` David Wragg 2016-01-06 23:57 ` [ovs-dev] " Jesse Gross 2016-01-07 21:47 ` David Miller 1 sibling, 2 replies; 28+ messages in thread From: David Wragg @ 2016-01-06 23:25 UTC (permalink / raw) To: David Miller; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA David Miller <davem@davemloft.net> writes: >> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of >> any size, constrained only by the ability to transmit the resulting >> UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan >> vports. These netdevs have an MTU, which limits the size of a packet >> that can be successfully vxlan-encapsulated. The default value for >> this MTU is 1500, which is awkwardly small, and leads to a conspicuous >> change in behaviour for userspace. >> >> These two patches set the MTU on openvswitch-crated vxlan devices to >> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 >> overhead), effectively restoring the behaviour prior to 4.3. In order >> to accomplish this, the first patch removes the MTU constraint of 1500 >> for vxlan netdevs without an underlying device. > > Is this really the right thing to do? I'm certainly open to suggestions of better ways to solve the problem. To be clear, the problem from our perspective is that a use of the kernel openvswitch that worked fine in 4.2 and earlier is hobbled in 4.3. Previously the MTU of an openvswitch-based vxlan overlay network was constrained only by the MTU of the physical network. In 4.3, we can't take advantage of physical networks that support jumbo frames, causing a huge hit to throughput across the overlay network. The specific limit of 1500 seems very arbitrary. For a vxlan overlay network on top of a traditional ethernet network, the "correct" MTU for the vxlan netdevs is 1450 rather than 1500. And in general with openvswitch, the destination for vxlan packets is determined on a packet-by-packet basis, possibly involving different path MTUs of the underlying network. There is no single "correct" value. > Won't we get a lot of fragmentation > by using such a large MTU, especially since you're making it the default > for OVS setups? In the context of the openvswitch vxlan vport transmit path, I can't find a place where the dev->mtu is used (and it would be surprising, on the basis that the relevant parts of vxlan.c have not changed that much since 4.2, when no netdev was involved in that path). Considering non-openvswitch scenarios, when using vxlan netdevs directly, a vxlan netdev locked to an underlying device supporting jumbo frames can use a larger MTU. It's only vxlan netdevs without an underlying device that have the limit of 1500 imposed. But why shouldn't there be the same flexibility to select an MTU for best performance in both cases? Aren't the fragmentation concerns the same? > Things like path MTU discovery hinge strongly upon accurate MTU settings. > Otherwise they won't function properly. True. But in what sense is 1500 accurate? Uses/users of the kernel openvswitch code have always had to get this right, making sure that the MTU set on a vxlan overlay network conforms to the underlying network paths involved. David _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-06 23:25 ` David Wragg @ 2016-01-06 23:57 ` Jesse Gross 2016-01-07 0:14 ` Hannes Frederic Sowa [not found] ` <CAEh+42iWSZOyikNydU2Bs8meqYfrKfUJLDGFJ8HzQ06k64LP0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-01-07 21:47 ` David Miller 1 sibling, 2 replies; 28+ messages in thread From: Jesse Gross @ 2016-01-06 23:57 UTC (permalink / raw) To: David Wragg; +Cc: David Miller, dev, Linux Kernel Network Developers On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote: > David Miller <davem@davemloft.net> writes: >>> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of >>> any size, constrained only by the ability to transmit the resulting >>> UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan >>> vports. These netdevs have an MTU, which limits the size of a packet >>> that can be successfully vxlan-encapsulated. The default value for >>> this MTU is 1500, which is awkwardly small, and leads to a conspicuous >>> change in behaviour for userspace. >>> >>> These two patches set the MTU on openvswitch-crated vxlan devices to >>> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 >>> overhead), effectively restoring the behaviour prior to 4.3. In order >>> to accomplish this, the first patch removes the MTU constraint of 1500 >>> for vxlan netdevs without an underlying device. >> >> Is this really the right thing to do? > > I'm certainly open to suggestions of better ways to solve the problem. One option is to simply set the MTU on the device from userspace. The reality is that the code you're modifying is compatibility code. Maybe we should make this change to preserve the old behavior for old callers (although, again, it should not be just for VXLAN). But no new features or tunnel types will be supported in this manner. New or updated userspace programs should work by simply creating and adding tunnel devices to OVS. That won't go through this path at all so you're going to need to find another approach in the near future in any case. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-06 23:57 ` [ovs-dev] " Jesse Gross @ 2016-01-07 0:14 ` Hannes Frederic Sowa 2016-01-07 0:46 ` Jesse Gross [not found] ` <CAEh+42iWSZOyikNydU2Bs8meqYfrKfUJLDGFJ8HzQ06k64LP0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 28+ messages in thread From: Hannes Frederic Sowa @ 2016-01-07 0:14 UTC (permalink / raw) To: Jesse Gross, David Wragg Cc: David Miller, dev, Linux Kernel Network Developers Hi, On 07.01.2016 00:57, Jesse Gross wrote: > On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote: >> David Miller <davem@davemloft.net> writes: >>>> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of >>>> any size, constrained only by the ability to transmit the resulting >>>> UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan >>>> vports. These netdevs have an MTU, which limits the size of a packet >>>> that can be successfully vxlan-encapsulated. The default value for >>>> this MTU is 1500, which is awkwardly small, and leads to a conspicuous >>>> change in behaviour for userspace. >>>> >>>> These two patches set the MTU on openvswitch-crated vxlan devices to >>>> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 >>>> overhead), effectively restoring the behaviour prior to 4.3. In order >>>> to accomplish this, the first patch removes the MTU constraint of 1500 >>>> for vxlan netdevs without an underlying device. >>> >>> Is this really the right thing to do? >> >> I'm certainly open to suggestions of better ways to solve the problem. > > One option is to simply set the MTU on the device from userspace. > > The reality is that the code you're modifying is compatibility code. > Maybe we should make this change to preserve the old behavior for old > callers (although, again, it should not be just for VXLAN). But no new > features or tunnel types will be supported in this manner. > > New or updated userspace programs should work by simply creating and > adding tunnel devices to OVS. That won't go through this path at all > so you're going to need to find another approach in the near future in > any case. I don't see any other way as to make MTUs part of the flow if we want to have correct ip_local_error notifications. And those must also work across VMs, so openvswitch in quasi brouting mode would need to emit ICMP PtBs (hopefully with a correct source address, otherwise uRPF kills them before reaching the applications) or do error signaling via virtio_net. Either the openvswitch user space can feed those information to the datapath or the ovs dataplane can do a lookup on the outer ip address while filling out the metadata_dst and caching it in the flow or we just keep the dst in the flow anyway. So a net_device used by ovs has no real mtu anymore. Bye, Hannes ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-07 0:14 ` Hannes Frederic Sowa @ 2016-01-07 0:46 ` Jesse Gross 2016-01-07 11:49 ` Thomas Graf 0 siblings, 1 reply; 28+ messages in thread From: Jesse Gross @ 2016-01-07 0:46 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: David Wragg, David Miller, dev, Linux Kernel Network Developers On Wed, Jan 6, 2016 at 4:14 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > Hi, > > > On 07.01.2016 00:57, Jesse Gross wrote: >> >> On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote: >>> >>> David Miller <davem@davemloft.net> writes: >>>>> >>>>> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of >>>>> any size, constrained only by the ability to transmit the resulting >>>>> UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan >>>>> vports. These netdevs have an MTU, which limits the size of a packet >>>>> that can be successfully vxlan-encapsulated. The default value for >>>>> this MTU is 1500, which is awkwardly small, and leads to a conspicuous >>>>> change in behaviour for userspace. >>>>> >>>>> These two patches set the MTU on openvswitch-crated vxlan devices to >>>>> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 >>>>> overhead), effectively restoring the behaviour prior to 4.3. In order >>>>> to accomplish this, the first patch removes the MTU constraint of 1500 >>>>> for vxlan netdevs without an underlying device. >>>> >>>> >>>> Is this really the right thing to do? >>> >>> >>> I'm certainly open to suggestions of better ways to solve the problem. >> >> >> One option is to simply set the MTU on the device from userspace. >> >> The reality is that the code you're modifying is compatibility code. >> Maybe we should make this change to preserve the old behavior for old >> callers (although, again, it should not be just for VXLAN). But no new >> features or tunnel types will be supported in this manner. >> >> New or updated userspace programs should work by simply creating and >> adding tunnel devices to OVS. That won't go through this path at all >> so you're going to need to find another approach in the near future in >> any case. > > > I don't see any other way as to make MTUs part of the flow if we want to > have correct ip_local_error notifications. And those must also work across > VMs, so openvswitch in quasi brouting mode would need to emit ICMP PtBs > (hopefully with a correct source address, otherwise uRPF kills them before > reaching the applications) or do error signaling via virtio_net. I actually implemented this a long time ago and then there was some additional discussion on this about a year ago. I agree it's the right solution overall but it's not entirely clearly to me how to get the details correct. > Either the openvswitch user space can feed those information to the datapath > or the ovs dataplane can do a lookup on the outer ip address while filling > out the metadata_dst and caching it in the flow or we just keep the dst in > the flow anyway. So a net_device used by ovs has no real mtu anymore. I agree that the concept of MTU is much more complicated than a single number on a device, we just have to find the right way to model it. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-07 0:46 ` Jesse Gross @ 2016-01-07 11:49 ` Thomas Graf [not found] ` <20160107114935.GJ32456-4EA/1caXOu0mYvmMESoHnA@public.gmane.org> 0 siblings, 1 reply; 28+ messages in thread From: Thomas Graf @ 2016-01-07 11:49 UTC (permalink / raw) To: Jesse Gross Cc: Hannes Frederic Sowa, David Wragg, David Miller, dev, Linux Kernel Network Developers On 01/06/16 at 04:46pm, Jesse Gross wrote: > On Wed, Jan 6, 2016 at 4:14 PM, Hannes Frederic Sowa > > I don't see any other way as to make MTUs part of the flow if we want to > > have correct ip_local_error notifications. And those must also work across > > VMs, so openvswitch in quasi brouting mode would need to emit ICMP PtBs > > (hopefully with a correct source address, otherwise uRPF kills them before > > reaching the applications) or do error signaling via virtio_net. > > I actually implemented this a long time ago and then there was some > additional discussion on this about a year ago. I agree it's the right > solution overall but it's not entirely clearly to me how to get the > details correct. When I looked into this last, the wildcard flow model of OVS made this difficult to get 100% right. That said, I don't think we have to do the actual dropping in OVS itself but the signaling has to back to OVS and ultimately the source. We don't want to replicate the entire flow cache model in OVS. A simple start could be to add a new return code for > MTU drops in the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in ovs_vport_send() and emit proper ICMPs. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <20160107114935.GJ32456-4EA/1caXOu0mYvmMESoHnA@public.gmane.org>]
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices [not found] ` <20160107114935.GJ32456-4EA/1caXOu0mYvmMESoHnA@public.gmane.org> @ 2016-01-07 16:35 ` Jesse Gross 2016-01-07 17:21 ` [ovs-dev] " Thomas Graf 0 siblings, 1 reply; 28+ messages in thread From: Jesse Gross @ 2016-01-07 16:35 UTC (permalink / raw) To: Thomas Graf Cc: dev-yBygre7rU0TnMu66kgdUjQ, David Wragg, David Miller, Hannes Frederic Sowa, Linux Kernel Network Developers On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote: > On 01/06/16 at 04:46pm, Jesse Gross wrote: >> On Wed, Jan 6, 2016 at 4:14 PM, Hannes Frederic Sowa >> > I don't see any other way as to make MTUs part of the flow if we want to >> > have correct ip_local_error notifications. And those must also work across >> > VMs, so openvswitch in quasi brouting mode would need to emit ICMP PtBs >> > (hopefully with a correct source address, otherwise uRPF kills them before >> > reaching the applications) or do error signaling via virtio_net. >> >> I actually implemented this a long time ago and then there was some >> additional discussion on this about a year ago. I agree it's the right >> solution overall but it's not entirely clearly to me how to get the >> details correct. > > When I looked into this last, the wildcard flow model of OVS made this > difficult to get 100% right. That said, I don't think we have to do > the actual dropping in OVS itself but the signaling has to back to OVS > and ultimately the source. We don't want to replicate the entire flow > cache model in OVS. > > A simple start could be to add a new return code for > MTU drops in > the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in > ovs_vport_send() and emit proper ICMPs. That could be interesting. The problem in the past was making sure that ICMPs that are generated fit in the virtual network appropriately - right addresses, etc. This requires either spoofing addresses or some additional knowledge about the topology that we don't currently have in the kernel. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-07 16:35 ` Jesse Gross @ 2016-01-07 17:21 ` Thomas Graf 2016-01-07 17:50 ` Hannes Frederic Sowa 0 siblings, 1 reply; 28+ messages in thread From: Thomas Graf @ 2016-01-07 17:21 UTC (permalink / raw) To: Jesse Gross Cc: Hannes Frederic Sowa, David Wragg, David Miller, dev, Linux Kernel Network Developers On 01/07/16 at 08:35am, Jesse Gross wrote: > On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote: > > A simple start could be to add a new return code for > MTU drops in > > the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in > > ovs_vport_send() and emit proper ICMPs. > > That could be interesting. The problem in the past was making sure > that ICMPs that are generated fit in the virtual network appropriately > - right addresses, etc. This requires either spoofing addresses or > some additional knowledge about the topology that we don't currently > have in the kernel. Are you worried about emitting an ICMP with a source which is not a local host address? Can't we just use icmp_send() in the context of the inner header and feed it to the flow table to send it back? It should be the same as for ip_forward(). skb->dev or skb->dst should lead us to the real MTU which can be included in the ICMP frag needed. It's a bit tricky because we would have to know whether it was encapsulated or not and adjust accordingly. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-07 17:21 ` [ovs-dev] " Thomas Graf @ 2016-01-07 17:50 ` Hannes Frederic Sowa [not found] ` <568EA55A.7070305-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org> 0 siblings, 1 reply; 28+ messages in thread From: Hannes Frederic Sowa @ 2016-01-07 17:50 UTC (permalink / raw) To: Thomas Graf, Jesse Gross Cc: David Wragg, David Miller, dev, Linux Kernel Network Developers On 07.01.2016 18:21, Thomas Graf wrote: > On 01/07/16 at 08:35am, Jesse Gross wrote: >> On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote: >>> A simple start could be to add a new return code for > MTU drops in >>> the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in >>> ovs_vport_send() and emit proper ICMPs. >> >> That could be interesting. The problem in the past was making sure >> that ICMPs that are generated fit in the virtual network appropriately >> - right addresses, etc. This requires either spoofing addresses or >> some additional knowledge about the topology that we don't currently >> have in the kernel. > > Are you worried about emitting an ICMP with a source which is not > a local host address? We have uRPF enabled for IPv4 by default on all kernels. Thus if we generate an IPv4 ICMP packet back with an error message it must have a source address which the receiving kernel considers valid. Valid means that sending to the source address would have used the same outgoing interface the ICMP error came in from. > Can't we just use icmp_send() in the context of the inner header and > feed it to the flow table to send it back? It should be the same as > for ip_forward(). The bridge's ip address often has no valid path as seen from the end host system receiving the icmp error, because the openvswitch is not really part of the L3 forwarding chain. Faking the address from the packet (e.g. using the destination address of the original packet) will make traceroute go nuts. > skb->dev or skb->dst should lead us to the real MTU which can be > included in the ICMP frag needed. It's a bit tricky because we would > have to know whether it was encapsulated or not and adjust > accordingly. Exactly, but this would be the way to go regarding figuring out the correct mtu. Normally ethernet devices don't return icmp error messages. E.g. broken jumbo frame configuration just leads to silent packet loss because the packet is discarded before a router can handle it. Thus it would be best in case of local ovs installation if the error is already transported back to the client application via the network call stack. This might be very difficult in case we enqueue the packet to a backlog queue and reschedule softirqs. Probably we need some way of faking source addresses from bridges now.... :/ Bye, Hannes ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <568EA55A.7070305-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>]
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices [not found] ` <568EA55A.7070305-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org> @ 2016-01-07 18:40 ` Thomas Graf [not found] ` <20160107184042.GB24672-4EA/1caXOu0mYvmMESoHnA@public.gmane.org> 0 siblings, 1 reply; 28+ messages in thread From: Thomas Graf @ 2016-01-07 18:40 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: dev-yBygre7rU0TnMu66kgdUjQ, Linux Kernel Network Developers, David Miller, David Wragg On 01/07/16 at 06:50pm, Hannes Frederic Sowa wrote: > On 07.01.2016 18:21, Thomas Graf wrote: > >On 01/07/16 at 08:35am, Jesse Gross wrote: > >>On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote: > >>>A simple start could be to add a new return code for > MTU drops in > >>>the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in > >>>ovs_vport_send() and emit proper ICMPs. > >> > >>That could be interesting. The problem in the past was making sure > >>that ICMPs that are generated fit in the virtual network appropriately > >>- right addresses, etc. This requires either spoofing addresses or > >>some additional knowledge about the topology that we don't currently > >>have in the kernel. > > > >Are you worried about emitting an ICMP with a source which is not > >a local host address? > > We have uRPF enabled for IPv4 by default on all kernels. Thus if we generate > an IPv4 ICMP packet back with an error message it must have a source address > which the receiving kernel considers valid. Valid means that sending to the > source address would have used the same outgoing interface the ICMP error > came in from. Agreed. I think this is given though as we would reverse the addresses as icmp_send() already does: saddr = iph->daddr; > >Can't we just use icmp_send() in the context of the inner header and > >feed it to the flow table to send it back? It should be the same as > >for ip_forward(). > > The bridge's ip address often has no valid path as seen from the end host > system receiving the icmp error, because the openvswitch is not really part > of the L3 forwarding chain. I don't think the IP of the bridge ever comes into play. It shouldn't. I'm not even sure what could be considered the address of the bridge ;-) > Faking the address from the packet (e.g. using the destination address of > the original packet) will make traceroute go nuts. I think you are worried about an ICMP error from a hop which does not decrement TTL. I think that's a good point and I think we should only send an ICMP error if the TTL is decremented in the action list of the flow for which we have seen a MTU based drop (or TTL=0). I don't really see a difference between ip_forward(), some sophisticated tc action or OVS. As soon as they decremented TTL and perform L3 forwarding, then they should send out ICMP errors to allow for proper PMTU. > Normally ethernet devices don't return icmp error messages. E.g. broken > jumbo frame configuration just leads to silent packet loss because the > packet is discarded before a router can handle it. Thus it would be best in > case of local ovs installation if the error is already transported back to > the client application via the network call stack. This might be very > difficult in case we enqueue the packet to a backlog queue and reschedule > softirqs. Probably we need some way of faking source addresses from bridges > now.... :/ I think the major complications comes from the assumption that OVS is a bridge. This is not necessarily the case as stated above. If a flow is doing L3 forwarding, we should send ICMPs as expected from a router. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <20160107184042.GB24672-4EA/1caXOu0mYvmMESoHnA@public.gmane.org>]
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices [not found] ` <20160107184042.GB24672-4EA/1caXOu0mYvmMESoHnA@public.gmane.org> @ 2016-01-08 21:29 ` Hannes Frederic Sowa 2016-01-10 10:49 ` [ovs-dev] " Thomas Graf 0 siblings, 1 reply; 28+ messages in thread From: Hannes Frederic Sowa @ 2016-01-08 21:29 UTC (permalink / raw) To: Thomas Graf Cc: dev-yBygre7rU0TnMu66kgdUjQ, Linux Kernel Network Developers, David Miller, David Wragg On 07.01.2016 19:40, Thomas Graf wrote: > On 01/07/16 at 06:50pm, Hannes Frederic Sowa wrote: >> On 07.01.2016 18:21, Thomas Graf wrote: >>> On 01/07/16 at 08:35am, Jesse Gross wrote: >>>> On Thu, Jan 7, 2016 at 3:49 AM, Thomas Graf <tgraf@suug.ch> wrote: >>>>> A simple start could be to add a new return code for > MTU drops in >>>>> the dev_queue_xmit() path and check for NET_XMIT_DROP_MTU in >>>>> ovs_vport_send() and emit proper ICMPs. >>>> >>>> That could be interesting. The problem in the past was making sure >>>> that ICMPs that are generated fit in the virtual network appropriately >>>> - right addresses, etc. This requires either spoofing addresses or >>>> some additional knowledge about the topology that we don't currently >>>> have in the kernel. >>> >>> Are you worried about emitting an ICMP with a source which is not >>> a local host address? >> >> We have uRPF enabled for IPv4 by default on all kernels. Thus if we generate >> an IPv4 ICMP packet back with an error message it must have a source address >> which the receiving kernel considers valid. Valid means that sending to the >> source address would have used the same outgoing interface the ICMP error >> came in from. > > Agreed. I think this is given though as we would reverse the addresses > as icmp_send() already does: > > saddr = iph->daddr; > >>> Can't we just use icmp_send() in the context of the inner header and >>> feed it to the flow table to send it back? It should be the same as >>> for ip_forward(). >> >> The bridge's ip address often has no valid path as seen from the end host >> system receiving the icmp error, because the openvswitch is not really part >> of the L3 forwarding chain. > > I don't think the IP of the bridge ever comes into play. It shouldn't. > I'm not even sure what could be considered the address of the bridge > ;-) Yes, exactly. :) > >> Faking the address from the packet (e.g. using the destination address of >> the original packet) will make traceroute go nuts. > > I think you are worried about an ICMP error from a hop which does not > decrement TTL. I think that's a good point and I think we should only > send an ICMP error if the TTL is decremented in the action list of > the flow for which we have seen a MTU based drop (or TTL=0). Also agreed, ovs must act in routing mode but at the same time must have an IP address on the path. I think this is actually the problem. Currently we have no way to feedback an error in current configurations with ovs sitting in another namespace for e.g. docker containers: We traverse a net namespace so we drop skb->sk, we don't hold any socket reference to enqueue an PtB error to the original socket. We mostly use netif_rx_internal queues the socket on the backlog, so we can't signal an error over the callstack either. And ovs does not necessarily have an ip address as the first hop of the namespace or the virtual machine, so it cannot know a valid ip address with which to reply, no? > I don't really see a difference between ip_forward(), some > sophisticated tc action or OVS. As soon as they decremented TTL and > perform L3 forwarding, then they should send out ICMP errors to allow > for proper PMTU. Yes, but depending on the ip configuration, those icmps will then be dropped in the reverse path filter. >> Normally ethernet devices don't return icmp error messages. E.g. broken >> jumbo frame configuration just leads to silent packet loss because the >> packet is discarded before a router can handle it. Thus it would be best in >> case of local ovs installation if the error is already transported back to >> the client application via the network call stack. This might be very >> difficult in case we enqueue the packet to a backlog queue and reschedule >> softirqs. Probably we need some way of faking source addresses from bridges >> now.... :/ > > I think the major complications comes from the assumption that OVS is > a bridge. This is not necessarily the case as stated above. If a flow > is doing L3 forwarding, we should send ICMPs as expected from a > router. If we are doing L3 forwarding into a tunnel, this is absolutely correct and can be easily done. Bye, Hannes _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-08 21:29 ` Hannes Frederic Sowa @ 2016-01-10 10:49 ` Thomas Graf 0 siblings, 0 replies; 28+ messages in thread From: Thomas Graf @ 2016-01-10 10:49 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: Jesse Gross, David Wragg, David Miller, dev, Linux Kernel Network Developers On 01/08/16 at 10:29pm, Hannes Frederic Sowa wrote: > On 07.01.2016 19:40, Thomas Graf wrote: > >I think you are worried about an ICMP error from a hop which does not > >decrement TTL. I think that's a good point and I think we should only > >send an ICMP error if the TTL is decremented in the action list of > >the flow for which we have seen a MTU based drop (or TTL=0). > > Also agreed, ovs must act in routing mode but at the same time must have an > IP address on the path. I think this is actually the problem. > > Currently we have no way to feedback an error in current configurations with > ovs sitting in another namespace for e.g. docker containers: > > We traverse a net namespace so we drop skb->sk, we don't hold any socket > reference to enqueue an PtB error to the original socket. > > We mostly use netif_rx_internal queues the socket on the backlog, so we > can't signal an error over the callstack either. > > And ovs does not necessarily have an ip address as the first hop of the > namespace or the virtual machine, so it cannot know a valid ip address with > which to reply, no? [your last statement moved up here:] > If we are doing L3 forwarding into a tunnel, this is absolutely correct and > can be easily done. OK, I can see where you are going with this. I was assuming pure virtual networks due to the contexts of these patches. So an ICMP is always UDP encapsulated or directly delivered to a veth or tap which runs in its own netns or is a VM of which the IP stack operates exclusively in the context of the virtual network. The stack of the OVS host never gets to see the actual ICMPs and rp_filter never gets into play. In such a context, the virtual router IPs are typically programmed into the flow table because they are only valid in the virtual network context, assigning them to the OVS bridge would be wrong as it represents the underlay context. The virtual router address is known in the flow context of the virtual network though and can be given to the icmp_send variant. Can you elaborate a bit on your container scenario, is it ovs running in host netns with veth pairs bridging into container netns? Shouldn't that be solved with the above as the ICMPs sent back in return by the local OVS are perfectly valid in the IP stack context of the container netns? ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <CAEh+42iWSZOyikNydU2Bs8meqYfrKfUJLDGFJ8HzQ06k64LP0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices [not found] ` <CAEh+42iWSZOyikNydU2Bs8meqYfrKfUJLDGFJ8HzQ06k64LP0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-01-07 0:29 ` David Wragg [not found] ` <86wprmp6z6.fsf-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 0 siblings, 1 reply; 28+ messages in thread From: David Wragg @ 2016-01-07 0:29 UTC (permalink / raw) To: Jesse Gross Cc: dev-yBygre7rU0TnMu66kgdUjQ, Linux Kernel Network Developers, David Miller Jesse Gross <jesse@kernel.org> writes: > On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote: >> I'm certainly open to suggestions of better ways to solve the problem. > > One option is to simply set the MTU on the device from userspace. If that worked I wouldn't be submitting a patch. The MTU value of 1500 is not merely the default. It is also the maximum allowed for a vxlan netdev not associated with an underlying netdev. If you do e.g. "ip link set dev vxlan-6784 mtu 8950", where vxlan-6784 was created by an ovs vport, it fails with EINVAL. The first patch of the two submitted removes that limit. > The reality is that the code you're modifying is compatibility code. > Maybe we should make this change to preserve the old behavior or old > callers (although, again, it should not be just for VXLAN). But no new > features or tunnel types will be supported in this manner. That's fine. Naturally, the ideal from our point of view is if the compatibility code is fully compatible, so we don't have to make changes on our side that involve different code paths for different kernel versions. That's what my patches are intended to achieve. But we can live with such changes on our side, as long as there is some reasonable way to do so. In the case of this vxlan MTU issue, there doesn't seem to be one. > New or updated userspace programs should work by simply creating and > adding tunnel devices to OVS. That won't go through this path at all > so you're going to need to find another approach in the near future in > any case. Ok. But please try to be gentle on the poor souls who have to come up with a single codebase that works on a range of kernel versions going back a few years. David _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <86wprmp6z6.fsf-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>]
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices [not found] ` <86wprmp6z6.fsf-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> @ 2016-01-07 1:10 ` Jesse Gross 0 siblings, 0 replies; 28+ messages in thread From: Jesse Gross @ 2016-01-07 1:10 UTC (permalink / raw) To: David Wragg Cc: dev-yBygre7rU0TnMu66kgdUjQ, Linux Kernel Network Developers, David Miller On Wed, Jan 6, 2016 at 4:29 PM, David Wragg <david@weave.works> wrote: > Jesse Gross <jesse@kernel.org> writes: >> On Wed, Jan 6, 2016 at 3:25 PM, David Wragg <david@weave.works> wrote: >>> I'm certainly open to suggestions of better ways to solve the problem. >> >> One option is to simply set the MTU on the device from userspace. > > If that worked I wouldn't be submitting a patch. > > The MTU value of 1500 is not merely the default. It is also the maximum > allowed for a vxlan netdev not associated with an underlying netdev. If > you do e.g. "ip link set dev vxlan-6784 mtu 8950", where vxlan-6784 > was created by an ovs vport, it fails with EINVAL. > > The first patch of the two submitted removes that limit. I saw your first patch and I agree that it fixes a problem. I was referring to the second patch. >> The reality is that the code you're modifying is compatibility code. >> Maybe we should make this change to preserve the old behavior or old >> callers (although, again, it should not be just for VXLAN). But no new >> features or tunnel types will be supported in this manner. > > That's fine. Naturally, the ideal from our point of view is if the > compatibility code is fully compatible, so we don't have to make changes > on our side that involve different code paths for different kernel > versions. That's what my patches are intended to achieve. The intention is to be fully backwards compatible with existing software. If you want to take advantage of future functionality then, yes, you will need to change to the new model. I agree that behavior changed with existing compatibility code. I'm fine with your series as long as you generalize it to all tunnel types and not just VXLAN. Just be aware that you're going to have to find another solution long term. > Ok. But please try to be gentle on the poor souls who have to come up > with a single codebase that works on a range of kernel versions going > back a few years. I maintain a large program that needs to do this myself, so I am aware that it can be challenging. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-06 23:25 ` David Wragg 2016-01-06 23:57 ` [ovs-dev] " Jesse Gross @ 2016-01-07 21:47 ` David Miller 2016-01-07 23:42 ` David Wragg 1 sibling, 1 reply; 28+ messages in thread From: David Miller @ 2016-01-07 21:47 UTC (permalink / raw) To: david; +Cc: netdev, dev From: David Wragg <david@weave.works> Date: Wed, 06 Jan 2016 23:25:56 +0000 > Considering non-openvswitch scenarios, when using vxlan netdevs > directly, a vxlan netdev locked to an underlying device supporting jumbo > frames can use a larger MTU. It's only vxlan netdevs without an > underlying device that have the limit of 1500 imposed. But why > shouldn't there be the same flexibility to select an MTU for best > performance in both cases? Aren't the fragmentation concerns the same? When there is an underlying device, there are no fragmentation concerns and PMTU will function properly because the MTU will be set accurately. > True. But in what sense is 1500 accurate? Uses/users of the kernel > openvswitch code have always had to get this right, making sure that > the MTU set on a vxlan overlay network conforms to the underlying > network paths involved. "right" is a subjective thing. Here, once again, the poorly thought out amount of flexibility openvswitch provides is going to have a serious negative consequence for our implementation. I'm really tired of seeing this happen over and over again. Openvswitch's growth and the expansion of it's feature set was extremely poorly managed. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-07 21:47 ` David Miller @ 2016-01-07 23:42 ` David Wragg 2016-01-08 2:48 ` David Miller 0 siblings, 1 reply; 28+ messages in thread From: David Wragg @ 2016-01-07 23:42 UTC (permalink / raw) To: David Miller; +Cc: netdev, dev David Miller <davem@davemloft.net> writes: >> Considering non-openvswitch scenarios, when using vxlan netdevs >> directly, a vxlan netdev locked to an underlying device supporting jumbo >> frames can use a larger MTU. It's only vxlan netdevs without an >> underlying device that have the limit of 1500 imposed. But why >> shouldn't there be the same flexibility to select an MTU for best >> performance in both cases? Aren't the fragmentation concerns the same? > > When there is an underlying device, there are no fragmentation concerns > and PMTU will function properly because the MTU will be set > accurately. I can see how that works in the case where the vxlan endpoints are on the same subnet of the underlying network. But in the general case, the "accurate" vxlan MTU to avoid fragmentation concerns should correspond to the path MTU between the endpoints. How does knowing an underlying device help with that? It's reasonable to calculate a default MTU for the vxlan device based on an underlying device, if one is specified. But unless the endpoints are on the same subnet, that's just a guess. (In the case of our application, we have openvswitch configured to set DF on the vxlan packets. So we know we are not getting inadvertent fragmentation.) >> True. But in what sense is 1500 accurate? Uses/users of the kernel >> openvswitch code have always had to get this right, making sure that >> the MTU set on a vxlan overlay network conforms to the underlying >> network paths involved. > > "right" is a subjective thing. > > Here, once again, the poorly thought out amount of flexibility > openvswitch provides is going to have a serious negative consequence > for our implementation. > > I'm really tired of seeing this happen over and over again. > Openvswitch's growth and the expansion of it's feature set was > extremely poorly managed. I'm sorry that there is some angst around the openvswitch kernel subsystem. I wasn't involved in its development, so I have no strong opinion on those issues. But we use it in our application, and we'd like to find a way to restore the ability to perform vxlan encapsulation of packets greater than 1500 bytes from openvswitch, as we could prior to 4.3. Is there another path to that same goal that is less contentious than my proposal? I'm willing to follow up on Jesse's request to look into the other tunnel types too, but at the moment I'm wondering what the chances are that the resulting submission would get accepted. David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-07 23:42 ` David Wragg @ 2016-01-08 2:48 ` David Miller 0 siblings, 0 replies; 28+ messages in thread From: David Miller @ 2016-01-08 2:48 UTC (permalink / raw) To: david; +Cc: netdev, dev From: David Wragg <david@weave.works> Date: Thu, 07 Jan 2016 23:42:52 +0000 > I'm willing to follow up on Jesse's request to look into the other > tunnel types too, but at the moment I'm wondering what the chances are > that the resulting submission would get accepted. I'm ok with these fixes if you look into Jesse's feedback. My basic gripe is that openvswitch can act as an L3 forwarding agent, but does not contain the necessary state (f.e. topology information in the form of a full FIB table) in order to behave like it should. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH net 2/2] vxlan: Set a large MTU on ovs-created vxlan devices 2016-01-06 13:33 [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices David Wragg [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> @ 2016-01-06 13:33 ` David Wragg [not found] ` <1452087186-12926-3-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 1 sibling, 1 reply; 28+ messages in thread From: David Wragg @ 2016-01-06 13:33 UTC (permalink / raw) To: netdev, dev; +Cc: David Wragg Prior to 4.3, vxlan vports could transmit vxlan packets of any size, constrained only by the ability to transmit the resulting UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan vports. These netdevs have an MTU, which limits the size of a packet that can be successfully vxlan-encapsulated. The default value for this MTU is 1500, which is awkwardly small, and leads to a conspicuous change in behaviour for userspace. This sets the MTU on openvswitch-created vxlan devices to be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 overhead), effectively restoring the behaviour prior to 4.3. Although the vxlan_config struct already had a mtu field for this, vxlan_dev_configure mostly ignored it; that is also addressed here. Signed-off-by: David Wragg <david@weave.works> --- drivers/net/vxlan.c | 11 ++++++++--- net/openvswitch/vport-vxlan.c | 2 ++ 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 96d1c55..a15d300 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -2764,6 +2764,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev, int err; bool use_ipv6 = false; __be16 default_port = vxlan->cfg.dst_port; + struct net_device *lowerdev = NULL; vxlan->net = src_net; @@ -2784,9 +2785,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev, } if (conf->remote_ifindex) { - struct net_device *lowerdev - = __dev_get_by_index(src_net, conf->remote_ifindex); - + lowerdev = __dev_get_by_index(src_net, conf->remote_ifindex); dst->remote_ifindex = conf->remote_ifindex; if (!lowerdev) { @@ -2810,6 +2809,12 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev, needed_headroom = lowerdev->hard_header_len; } + if (conf->mtu) { + err = __vxlan_change_mtu(dev, lowerdev, dst, conf->mtu); + if (err) + return err; + } + if (use_ipv6 || conf->flags & VXLAN_F_COLLECT_METADATA) needed_headroom += VXLAN6_HEADROOM; else diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c index 1605691..a97279f 100644 --- a/net/openvswitch/vport-vxlan.c +++ b/net/openvswitch/vport-vxlan.c @@ -91,6 +91,8 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms) struct vxlan_config conf = { .no_share = true, .flags = VXLAN_F_COLLECT_METADATA, + /* The maximum VXLAN payload to fit in an IPv6 packet */ + .mtu = 65535 - VXLAN6_HEADROOM, }; if (!options) { -- 2.5.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
[parent not found: <1452087186-12926-3-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org>]
* Re: [PATCH net 2/2] vxlan: Set a large MTU on ovs-created vxlan devices [not found] ` <1452087186-12926-3-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> @ 2016-01-07 11:36 ` Thomas Graf 0 siblings, 0 replies; 28+ messages in thread From: Thomas Graf @ 2016-01-07 11:36 UTC (permalink / raw) To: David Wragg; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA On 01/06/16 at 01:33pm, David Wragg wrote: > Prior to 4.3, vxlan vports could transmit vxlan packets of any size, > constrained only by the ability to transmit the resulting UDP packets. > 4.3 introduced vxlan netdevs corresponding to vxlan vports. These > netdevs have an MTU, which limits the size of a packet that can be > successfully vxlan-encapsulated. The default value for this MTU is > 1500, which is awkwardly small, and leads to a conspicuous change in > behaviour for userspace. > > This sets the MTU on openvswitch-created vxlan devices to be 65465 > (the maximum IP packet size minus the vxlan-on-IPv6 overhead), > effectively restoring the behaviour prior to 4.3. Although the > vxlan_config struct already had a mtu field for this, > vxlan_dev_configure mostly ignored it; that is also addressed here. I agree with Jesse that this is fine. The tunnel net_device is a logical device anyway and the real MTU is not known at this point until the underlay route has been resolved. It is really a model we should move away from though. Unless the endpoints negotiate a proper MTU, the underlay can't possible cope with this correctly without major performance problems due to fragmentation. > @@ -2810,6 +2809,12 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev, > needed_headroom = lowerdev->hard_header_len; > } > > + if (conf->mtu) { > + err = __vxlan_change_mtu(dev, lowerdev, dst, conf->mtu); OK, I see the reason for the indirection in the first patch. Never mind. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2016-01-27 16:39 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-06 13:33 [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices David Wragg [not found] ` <1452087186-12926-1-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 2016-01-06 13:33 ` [PATCH net 1/2] vxlan: Relax the MTU constraint on " David Wragg [not found] ` <1452087186-12926-2-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 2016-01-07 11:24 ` Thomas Graf 2016-01-07 11:31 ` David Wragg 2016-01-07 11:50 ` Thomas Graf 2016-01-09 18:39 ` roopa 2016-01-10 10:28 ` [ovs-dev] " Thomas Graf 2016-01-27 16:39 ` roopa 2016-01-06 20:59 ` [PATCH net 0/2] vxlan: Set a large MTU on ovs-created " David Miller [not found] ` <20160106.155950.1007160228570301281.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> 2016-01-06 22:53 ` Jesse Gross 2016-01-06 23:25 ` David Wragg 2016-01-06 23:57 ` [ovs-dev] " Jesse Gross 2016-01-07 0:14 ` Hannes Frederic Sowa 2016-01-07 0:46 ` Jesse Gross 2016-01-07 11:49 ` Thomas Graf [not found] ` <20160107114935.GJ32456-4EA/1caXOu0mYvmMESoHnA@public.gmane.org> 2016-01-07 16:35 ` Jesse Gross 2016-01-07 17:21 ` [ovs-dev] " Thomas Graf 2016-01-07 17:50 ` Hannes Frederic Sowa [not found] ` <568EA55A.7070305-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org> 2016-01-07 18:40 ` Thomas Graf [not found] ` <20160107184042.GB24672-4EA/1caXOu0mYvmMESoHnA@public.gmane.org> 2016-01-08 21:29 ` Hannes Frederic Sowa 2016-01-10 10:49 ` [ovs-dev] " Thomas Graf [not found] ` <CAEh+42iWSZOyikNydU2Bs8meqYfrKfUJLDGFJ8HzQ06k64LP0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-01-07 0:29 ` David Wragg [not found] ` <86wprmp6z6.fsf-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 2016-01-07 1:10 ` Jesse Gross 2016-01-07 21:47 ` David Miller 2016-01-07 23:42 ` David Wragg 2016-01-08 2:48 ` David Miller 2016-01-06 13:33 ` [PATCH net 2/2] " David Wragg [not found] ` <1452087186-12926-3-git-send-email-david-1SEAoVOfG6VEzL6FDj/jAg@public.gmane.org> 2016-01-07 11:36 ` Thomas Graf
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.