All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 0/3] Add VRF support for VXLAN underlay
@ 2018-11-20 14:23 Alexis Bauvin
  2018-11-20 14:23 ` [RFC v3 1/3] udp_tunnel: add config option to bind to a device Alexis Bauvin
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-20 14:23 UTC (permalink / raw)
  To: dsa, roopa; +Cc: netdev, abauvin, akherbouche

v2 -> v3:
- fix build when CONFIG_NET_IPV6 is off
- fix build "unused l3mdev_master_upper_ifindex_by_index" build error with some
  configs

v1 -> v2:
- move vxlan_get_l3mdev from vxlan driver to l3mdev driver as
  l3mdev_master_upper_ifindex_by_index
- vxlan: rename variables named l3mdev_ifindex to ifindex

v0 -> v1:
- fix typos

We are trying to isolate the VXLAN traffic from different VMs with VRF as shown
in the schemas below:

+-------------------------+   +----------------------------+
| +----------+            |   |     +------------+         |
| |          |            |   |     |            |         |
| | tap-red  |            |   |     |  tap-blue  |         |
| |          |            |   |     |            |         |
| +----+-----+            |   |     +-----+------+         |
|      |                  |   |           |                |
|      |                  |   |           |                |
| +----+---+              |   |      +----+----+           |
| |        |              |   |      |         |           |
| | br-red |              |   |      | br-blue |           |
| |        |              |   |      |         |           |
| +----+---+              |   |      +----+----+           |
|      |                  |   |           |                |
|      |                  |   |           |                |
|      |                  |   |           |                |
| +----+--------+         |   |     +--------------+       |
| |             |         |   |     |              |       |
| |  vxlan-red  |         |   |     |  vxlan-blue  |       |
| |             |         |   |     |              |       |
| +------+------+         |   |     +-------+------+       |
|        |                |   |             |              |
|        |           VRF  |   |             |          VRF |
|        |           red  |   |             |         blue |
+-------------------------+   +----------------------------+
         |                                  |
         |                                  |
 +---------------------------------------------------------+
 |       |                                  |              |
 |       |                                  |              |
 |       |         +--------------+         |              |
 |       |         |              |         |              |
 |       +---------+  eth0.2030   +---------+              |
 |                 |  10.0.0.1/24 |                        |
 |                 +-----+--------+                    VRF |
 |                       |                            green|
 +---------------------------------------------------------+
                         |
                         |
                    +----+---+
                    |        |
                    |  eth0  |
                    |        |
                    +--------+


iproute2 commands to reproduce the setup:

ip link add green type vrf table 1
ip link set green up
ip link add eth0.2030 link eth0 type vlan id 2030
ip link set eth0.2030 master green
ip addr add 10.0.0.1/24 dev eth0.2030
ip link set eth0.2030 up

ip link add blue type vrf table 2
ip link set blue up
ip link add br-blue type bridge
ip link set br-blue master blue
ip link set br-blue up
ip link add vxlan-blue type vxlan id 2 local 10.0.0.1 dev eth0.2030 \
 port 4789
ip link set vxlan-blue master br-blue
ip link set vxlan-blue up
ip link set tap-blue master br-blue
ip link set tap-blue up

ip link add red type vrf table 3
ip link set red up
ip link add br-red type bridge
ip link set br-red master red
ip link set br-red up
ip link add vxlan-red type vxlan id 3 local 10.0.0.1 dev eth0.2030 \
 port 4789
ip link set vxlan-red master br-red
ip link set vxlan-red up
ip link set tap-red master br-red
ip link set tap-red up

We faced some issue in the datapath, here are the details:

* Egress traffic:
The vxlan packets are sent directly to the default VRF because it's where the
socket is bound, therefore the traffic has a default route via eth0. the
workarount is to force this traffic to VRF green with ip rules.

* Ingress traffic:
When receiving the traffic on eth0.2030 the vxlan socket is unreachable from
VRF green. The workaround is to enable *udp_l3mdev_accept* sysctl, but
this breaks isolation between overlay and underlay: packets sent from
blue or red by e.g. a guest VM will be accepted by the socket, allowing
injection of VXLAN packets from the overlay.

This patch serie fixes the issues describe above by allowing VXLAN socket to be
bound to a specific VRF device therefore looking up in the correct table.

Alexis Bauvin (3):
  udp_tunnel: add config option to bind to a device
  vxlan: add support for underlay in non-default VRF
  vxlan: handle underlay VRF changes

 drivers/net/vxlan.c       | 126 +++++++++++++++++++++++++++++++++++---
 include/net/l3mdev.h      |  22 +++++++
 include/net/udp_tunnel.h  |   1 +
 net/ipv4/udp_tunnel.c     |  10 +++
 net/ipv6/ip6_udp_tunnel.c |   9 +++
 net/l3mdev/l3mdev.c       |  18 ++++++
 6 files changed, 178 insertions(+), 8 deletions(-)

-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC v3 1/3] udp_tunnel: add config option to bind to a device
  2018-11-20 14:23 [RFC v3 0/3] Add VRF support for VXLAN underlay Alexis Bauvin
@ 2018-11-20 14:23 ` Alexis Bauvin
  2018-11-20 14:23 ` [RFC v3 2/3] vxlan: add support for underlay in non-default VRF Alexis Bauvin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-20 14:23 UTC (permalink / raw)
  To: dsa, roopa; +Cc: netdev, abauvin, akherbouche

UDP tunnel sockets are always opened unbound to a specific device. This
patch allow the socket to be bound on a custom device, which
incidentally makes UDP tunnels VRF-aware if binding to an l3mdev.

Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
---
 include/net/udp_tunnel.h  |  1 +
 net/ipv4/udp_tunnel.c     | 10 ++++++++++
 net/ipv6/ip6_udp_tunnel.c |  9 +++++++++
 3 files changed, 20 insertions(+)

diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index fe680ab6b15a..9f7970d010f9 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -30,6 +30,7 @@ struct udp_port_cfg {
 
 	__be16			local_udp_port;
 	__be16			peer_udp_port;
+	int			bind_ifindex;
 	unsigned int		use_udp_checksums:1,
 				use_udp6_tx_checksums:1,
 				use_udp6_rx_checksums:1,
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 6539ff15e9a3..dc68e15a4f72 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -20,6 +20,16 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg *cfg,
 	if (err < 0)
 		goto error;
 
+	if (cfg->bind_ifindex) {
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(net, cfg->bind_ifindex);
+		err = kernel_setsockopt(sock, SOL_SOCKET, SO_BINDTODEVICE,
+					dev->name, strlen(dev->name) + 1);
+		if (err < 0)
+			goto error;
+	}
+
 	udp_addr.sin_family = AF_INET;
 	udp_addr.sin_addr = cfg->local_ip;
 	udp_addr.sin_port = cfg->local_udp_port;
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index b283f293ee4a..fc3811ef8787 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -31,6 +31,15 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 		if (err < 0)
 			goto error;
 	}
+	if (cfg->bind_ifindex) {
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(net, cfg->bind_ifindex);
+		err = kernel_setsockopt(sock, SOL_SOCKET, SO_BINDTODEVICE,
+					dev->name, strlen(dev->name) + 1);
+		if (err < 0)
+			goto error;
+	}
 
 	udp6_addr.sin6_family = AF_INET6;
 	memcpy(&udp6_addr.sin6_addr, &cfg->local_ip6,
-- 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v3 2/3] vxlan: add support for underlay in non-default VRF
  2018-11-20 14:23 [RFC v3 0/3] Add VRF support for VXLAN underlay Alexis Bauvin
  2018-11-20 14:23 ` [RFC v3 1/3] udp_tunnel: add config option to bind to a device Alexis Bauvin
@ 2018-11-20 14:23 ` Alexis Bauvin
  2018-11-20 14:57   ` David Ahern
  2018-11-20 15:25   ` Roopa Prabhu
  2018-11-20 14:23 ` [RFC v3 3/3] vxlan: handle underlay VRF changes Alexis Bauvin
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-20 14:23 UTC (permalink / raw)
  To: dsa, roopa; +Cc: netdev, abauvin, akherbouche

Creating a VXLAN device with is underlay in the non-default VRF makes
egress route lookup fail or incorrect since it will resolve in the
default VRF, and ingress fail because the socket listens in the default
VRF.

This patch binds the underlying UDP tunnel socket to the l3mdev of the
lower device of the VXLAN device. This will listen in the proper VRF and
output traffic from said l3mdev, matching l3mdev routing rules and
looking up the correct routing table.

When the VXLAN device does not have a lower device, or the lower device
is in the default VRF, the socket will not be bound to any interface,
keeping the previous behaviour.

The underlay l3mdev is deduced from the VXLAN lower device
(IFLA_VXLAN_LINK).

The l3mdev_master_upper_ifindex_by_index function has been added to
l3mdev. Its goal is to fetch the effective l3mdev of an interface which
is not a direct slave of said l3mdev. It handles the following example,
properly resolving the l3mdev of eth0 to vrf-blue:

+----------+                         +---------+
|          |                         |         |
| vrf-blue |                         | vrf-red |
|          |                         |         |
+----+-----+                         +----+----+
     |                                    |
     |                                    |
+----+-----+                         +----+----+
|          |                         |         |
| br-blue  |                         | br-red  |
|          |                         |         |
+----+-----+                         +---+-+---+
     |                                   | |
     |                             +-----+ +-----+
     |                             |             |
+----+-----+                +------+----+   +----+----+
|          |  lower device  |           |   |         |
|   eth0   | <- - - - - - - | vxlan-red |   | tap-red | (... more taps)
|          |                |           |   |         |
+----------+                +-----------+   +---------+

Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
---
 drivers/net/vxlan.c  | 32 ++++++++++++++++++++++++--------
 include/net/l3mdev.h | 22 ++++++++++++++++++++++
 net/l3mdev/l3mdev.c  | 18 ++++++++++++++++++
 3 files changed, 64 insertions(+), 8 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 27bd586b94b0..a3de08122269 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -212,7 +212,7 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
  * and enabled unshareable flags.
  */
 static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
-					  __be16 port, u32 flags)
+					  __be16 port, u32 flags, int ifindex)
 {
 	struct vxlan_sock *vs;
 
@@ -221,7 +221,8 @@ static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
 	hlist_for_each_entry_rcu(vs, vs_head(net, port), hlist) {
 		if (inet_sk(vs->sock->sk)->inet_sport == port &&
 		    vxlan_get_sk_family(vs) == family &&
-		    vs->flags == flags)
+		    vs->flags == flags &&
+		    vs->sock->sk->sk_bound_dev_if == ifindex)
 			return vs;
 	}
 	return NULL;
@@ -261,7 +262,7 @@ static struct vxlan_dev *vxlan_find_vni(struct net *net, int ifindex,
 {
 	struct vxlan_sock *vs;
 
-	vs = vxlan_find_sock(net, family, port, flags);
+	vs = vxlan_find_sock(net, family, port, flags, ifindex);
 	if (!vs)
 		return NULL;
 
@@ -2172,6 +2173,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		struct rtable *rt;
 		__be16 df = 0;
 
+		if (!ifindex)
+			ifindex = sock4->sock->sk->sk_bound_dev_if;
+
 		rt = vxlan_get_route(vxlan, dev, sock4, skb, ifindex, tos,
 				     dst->sin.sin_addr.s_addr,
 				     &local_ip.sin.sin_addr.s_addr,
@@ -2210,6 +2214,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 	} else {
 		struct vxlan_sock *sock6 = rcu_dereference(vxlan->vn6_sock);
 
+		if (!ifindex)
+			ifindex = sock6->sock->sk->sk_bound_dev_if;
+
 		ndst = vxlan6_get_route(vxlan, dev, sock6, skb, ifindex, tos,
 					label, &dst->sin6.sin6_addr,
 					&local_ip.sin6.sin6_addr,
@@ -2813,7 +2820,7 @@ static const struct ethtool_ops vxlan_ethtool_ops = {
 };
 
 static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
-					__be16 port, u32 flags)
+					__be16 port, u32 flags, int ifindex)
 {
 	struct socket *sock;
 	struct udp_port_cfg udp_conf;
@@ -2831,6 +2838,7 @@ static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
 	}
 
 	udp_conf.local_udp_port = port;
+	udp_conf.bind_ifindex = ifindex;
 
 	/* Open UDP socket */
 	err = udp_sock_create(net, &udp_conf, &sock);
@@ -2842,7 +2850,8 @@ static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
 
 /* Create new listen socket if needed */
 static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,
-					      __be16 port, u32 flags)
+					      __be16 port, u32 flags,
+					      int ifindex)
 {
 	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
 	struct vxlan_sock *vs;
@@ -2857,7 +2866,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,
 	for (h = 0; h < VNI_HASH_SIZE; ++h)
 		INIT_HLIST_HEAD(&vs->vni_list[h]);
 
-	sock = vxlan_create_sock(net, ipv6, port, flags);
+	sock = vxlan_create_sock(net, ipv6, port, flags, ifindex);
 	if (IS_ERR(sock)) {
 		kfree(vs);
 		return ERR_CAST(sock);
@@ -2894,11 +2903,17 @@ static int __vxlan_sock_add(struct vxlan_dev *vxlan, bool ipv6)
 	struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
 	struct vxlan_sock *vs = NULL;
 	struct vxlan_dev_node *node;
+	int l3mdev_index;
+
+	l3mdev_index =
+		l3mdev_master_upper_ifindex_by_index(vxlan->net,
+						     vxlan->cfg.remote_ifindex);
 
 	if (!vxlan->cfg.no_share) {
 		spin_lock(&vn->sock_lock);
 		vs = vxlan_find_sock(vxlan->net, ipv6 ? AF_INET6 : AF_INET,
-				     vxlan->cfg.dst_port, vxlan->cfg.flags);
+				     vxlan->cfg.dst_port, vxlan->cfg.flags,
+				     l3mdev_index);
 		if (vs && !refcount_inc_not_zero(&vs->refcnt)) {
 			spin_unlock(&vn->sock_lock);
 			return -EBUSY;
@@ -2907,7 +2922,8 @@ static int __vxlan_sock_add(struct vxlan_dev *vxlan, bool ipv6)
 	}
 	if (!vs)
 		vs = vxlan_socket_create(vxlan->net, ipv6,
-					 vxlan->cfg.dst_port, vxlan->cfg.flags);
+					 vxlan->cfg.dst_port, vxlan->cfg.flags,
+					 l3mdev_index);
 	if (IS_ERR(vs))
 		return PTR_ERR(vs);
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index 3832099289c5..78fa0ac4613c 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -101,6 +101,17 @@ struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev)
 	return master;
 }
 
+int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex);
+static inline
+int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
+{
+	rcu_read_lock();
+	ifindex = l3mdev_master_upper_ifindex_by_index_rcu(net, ifindex);
+	rcu_read_unlock();
+
+	return ifindex;
+}
+
 u32 l3mdev_fib_table_rcu(const struct net_device *dev);
 u32 l3mdev_fib_table_by_index(struct net *net, int ifindex);
 static inline u32 l3mdev_fib_table(const struct net_device *dev)
@@ -207,6 +218,17 @@ static inline int l3mdev_master_ifindex_by_index(struct net *net, int ifindex)
 	return 0;
 }
 
+static inline
+int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
+{
+	return 0;
+}
+static inline
+int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
+{
+	return 0;
+}
+
 static inline
 struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev)
 {
diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
index 8da86ceca33d..309dee76724e 100644
--- a/net/l3mdev/l3mdev.c
+++ b/net/l3mdev/l3mdev.c
@@ -46,6 +46,24 @@ int l3mdev_master_ifindex_rcu(const struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(l3mdev_master_ifindex_rcu);
 
+/**
+ *	l3mdev_master_upper_ifindex_by_index - get index of upper l3 master
+ *					       device
+ *	@net: network namespace for device index lookup
+ *	@ifindex: targeted interface
+ */
+int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
+{
+	struct net_device *dev;
+
+	dev = dev_get_by_index_rcu(net, ifindex);
+	while (dev && !netif_is_l3_master(dev))
+		dev = netdev_master_upper_dev_get(dev);
+
+	return dev ? dev->ifindex : 0;
+}
+EXPORT_SYMBOL_GPL(l3mdev_master_upper_ifindex_by_index_rcu);
+
 /**
  *	l3mdev_fib_table - get FIB table id associated with an L3
  *                             master interface
-- 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 14:23 [RFC v3 0/3] Add VRF support for VXLAN underlay Alexis Bauvin
  2018-11-20 14:23 ` [RFC v3 1/3] udp_tunnel: add config option to bind to a device Alexis Bauvin
  2018-11-20 14:23 ` [RFC v3 2/3] vxlan: add support for underlay in non-default VRF Alexis Bauvin
@ 2018-11-20 14:23 ` Alexis Bauvin
  2018-11-20 15:04   ` David Ahern
  2018-11-20 14:48 ` [RFC v3 0/3] Add VRF support for VXLAN underlay David Ahern
  2018-11-20 21:45 ` David Ahern
  4 siblings, 1 reply; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-20 14:23 UTC (permalink / raw)
  To: dsa, roopa; +Cc: netdev, abauvin, akherbouche

When underlay VRF changes, either because the lower device itself changed,
or its VRF changed, this patch releases the current socket of the VXLAN
device and recreates another one in the right VRF. This allows for
on-the-fly change of the underlay VRF of a VXLAN device.

Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
---
 drivers/net/vxlan.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index a3de08122269..1e6ccad6df6a 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -208,6 +208,18 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
 	return list_first_entry(&fdb->remotes, struct vxlan_rdst, list);
 }
 
+static int vxlan_is_in_l3mdev_chain(struct net_device *chain,
+				    struct net_device *dev)
+{
+	if (!chain)
+		return 0;
+
+	if (chain->ifindex == dev->ifindex)
+		return 1;
+	return vxlan_is_in_l3mdev_chain(netdev_master_upper_dev_get(chain),
+					dev);
+}
+
 /* Find VXLAN socket based on network namespace, address family and UDP port
  * and enabled unshareable flags.
  */
@@ -3720,6 +3732,33 @@ struct net_device *vxlan_dev_create(struct net *net, const char *name,
 }
 EXPORT_SYMBOL_GPL(vxlan_dev_create);
 
+static int vxlan_reopen(struct vxlan_net *vn, struct vxlan_dev *vxlan)
+{
+	int ret = 0;
+
+	if (vxlan_addr_multicast(&vxlan->default_dst.remote_ip) &&
+	    !vxlan_group_used(vn, vxlan))
+		ret = vxlan_igmp_leave(vxlan);
+	vxlan_sock_release(vxlan);
+
+	if (ret < 0)
+		return ret;
+
+	ret = vxlan_sock_add(vxlan);
+	if (ret < 0)
+		return ret;
+
+	if (vxlan_addr_multicast(&vxlan->default_dst.remote_ip)) {
+		ret = vxlan_igmp_join(vxlan);
+		if (ret == -EADDRINUSE)
+			ret = 0;
+		if (ret)
+			vxlan_sock_release(vxlan);
+	}
+
+	return ret;
+}
+
 static void vxlan_handle_lowerdev_unregister(struct vxlan_net *vn,
 					     struct net_device *dev)
 {
@@ -3742,6 +3781,55 @@ static void vxlan_handle_lowerdev_unregister(struct vxlan_net *vn,
 	unregister_netdevice_many(&list_kill);
 }
 
+static void vxlan_handle_change_upper(struct vxlan_net *vn,
+				      struct net_device *dev)
+{
+	struct vxlan_dev *vxlan, *next;
+
+	list_for_each_entry_safe(vxlan, next, &vn->vxlan_list, next) {
+		struct net_device *lower;
+		int err;
+
+		lower = __dev_get_by_index(vxlan->net,
+					   vxlan->cfg.remote_ifindex);
+		if (!vxlan_is_in_l3mdev_chain(lower, dev))
+			continue;
+
+		err = vxlan_reopen(vn, vxlan);
+		if (err < 0)
+			netdev_err(vxlan->dev, "Failed to reopen socket: %d\n",
+				   err);
+	}
+}
+
+static void vxlan_handle_change(struct vxlan_net *vn, struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_sock *sock;
+	int l3mdev_index;
+
+#if IS_ENABLED(CONFIG_IPV6)
+	bool metadata = vxlan->cfg.flags & VXLAN_F_COLLECT_METADATA;
+	bool ipv6 = vxlan->cfg.flags & VXLAN_F_IPV6 || metadata;
+
+	sock = ipv6 ? rcu_dereference(vxlan->vn6_sock)
+		    : rcu_dereference(vxlan->vn4_sock);
+#else
+	sock = rcu_dereference(vxlan->vn4_sock);
+#endif
+
+	l3mdev_index =
+		l3mdev_master_upper_ifindex_by_index(vxlan->net,
+						     vxlan->cfg.remote_ifindex);
+	if (sock->sock->sk->sk_bound_dev_if != l3mdev_index) {
+		int ret = vxlan_reopen(vn, vxlan);
+
+		if (ret < 0)
+			netdev_err(vxlan->dev, "Failed to reopen socket: %d\n",
+				   ret);
+	}
+}
+
 static int vxlan_netdevice_event(struct notifier_block *unused,
 				 unsigned long event, void *ptr)
 {
@@ -3756,6 +3844,12 @@ static int vxlan_netdevice_event(struct notifier_block *unused,
 	} else if (event == NETDEV_UDP_TUNNEL_PUSH_INFO ||
 		   event == NETDEV_UDP_TUNNEL_DROP_INFO) {
 		vxlan_offload_rx_ports(dev, event == NETDEV_UDP_TUNNEL_PUSH_INFO);
+	} else if (event == NETDEV_CHANGEUPPER) {
+		vxlan_handle_change_upper(vn, dev);
+	} else if (event == NETDEV_CHANGE) {
+		if (dev->rtnl_link_ops &&
+		    !strcmp(dev->rtnl_link_ops->kind, vxlan_link_ops.kind))
+			vxlan_handle_change(vn, dev);
 	}
 
 	return NOTIFY_DONE;
-- 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC v3 0/3] Add VRF support for VXLAN underlay
  2018-11-20 14:23 [RFC v3 0/3] Add VRF support for VXLAN underlay Alexis Bauvin
                   ` (2 preceding siblings ...)
  2018-11-20 14:23 ` [RFC v3 3/3] vxlan: handle underlay VRF changes Alexis Bauvin
@ 2018-11-20 14:48 ` David Ahern
  2018-11-20 21:45 ` David Ahern
  4 siblings, 0 replies; 26+ messages in thread
From: David Ahern @ 2018-11-20 14:48 UTC (permalink / raw)
  To: Alexis Bauvin, roopa; +Cc: netdev, akherbouche

On 11/20/18 7:23 AM, Alexis Bauvin wrote:
> We are trying to isolate the VXLAN traffic from different VMs with VRF as shown
> in the schemas below:
> 
> +-------------------------+   +----------------------------+
> | +----------+            |   |     +------------+         |
> | |          |            |   |     |            |         |
> | | tap-red  |            |   |     |  tap-blue  |         |
> | |          |            |   |     |            |         |
> | +----+-----+            |   |     +-----+------+         |
> |      |                  |   |           |                |
> |      |                  |   |           |                |
> | +----+---+              |   |      +----+----+           |
> | |        |              |   |      |         |           |
> | | br-red |              |   |      | br-blue |           |
> | |        |              |   |      |         |           |
> | +----+---+              |   |      +----+----+           |
> |      |                  |   |           |                |
> |      |                  |   |           |                |
> |      |                  |   |           |                |
> | +----+--------+         |   |     +--------------+       |
> | |             |         |   |     |              |       |
> | |  vxlan-red  |         |   |     |  vxlan-blue  |       |
> | |             |         |   |     |              |       |
> | +------+------+         |   |     +-------+------+       |
> |        |                |   |             |              |
> |        |           VRF  |   |             |          VRF |
> |        |           red  |   |             |         blue |
> +-------------------------+   +----------------------------+
>          |                                  |
>          |                                  |
>  +---------------------------------------------------------+
>  |       |                                  |              |
>  |       |                                  |              |
>  |       |         +--------------+         |              |
>  |       |         |              |         |              |
>  |       +---------+  eth0.2030   +---------+              |
>  |                 |  10.0.0.1/24 |                        |
>  |                 +-----+--------+                    VRF |
>  |                       |                            green|
>  +---------------------------------------------------------+
>                          |
>                          |
>                     +----+---+
>                     |        |
>                     |  eth0  |
>                     |        |
>                     +--------+
> 
> 
> iproute2 commands to reproduce the setup:

Thanks for putting the diagram and commands in the cover letter. Really
helps to understood (and test) what you are wanting to do.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 2/3] vxlan: add support for underlay in non-default VRF
  2018-11-20 14:23 ` [RFC v3 2/3] vxlan: add support for underlay in non-default VRF Alexis Bauvin
@ 2018-11-20 14:57   ` David Ahern
  2018-11-20 16:11     ` Alexis Bauvin
  2018-11-20 15:25   ` Roopa Prabhu
  1 sibling, 1 reply; 26+ messages in thread
From: David Ahern @ 2018-11-20 14:57 UTC (permalink / raw)
  To: Alexis Bauvin, roopa; +Cc: netdev, akherbouche

On 11/20/18 7:23 AM, Alexis Bauvin wrote:
> Creating a VXLAN device with is underlay in the non-default VRF makes
> egress route lookup fail or incorrect since it will resolve in the
> default VRF, and ingress fail because the socket listens in the default
> VRF.
> 
> This patch binds the underlying UDP tunnel socket to the l3mdev of the
> lower device of the VXLAN device. This will listen in the proper VRF and
> output traffic from said l3mdev, matching l3mdev routing rules and
> looking up the correct routing table.
> 
> When the VXLAN device does not have a lower device, or the lower device
> is in the default VRF, the socket will not be bound to any interface,
> keeping the previous behaviour.
> 
> The underlay l3mdev is deduced from the VXLAN lower device
> (IFLA_VXLAN_LINK).
> 
> The l3mdev_master_upper_ifindex_by_index function has been added to
> l3mdev. Its goal is to fetch the effective l3mdev of an interface which
> is not a direct slave of said l3mdev. It handles the following example,
> properly resolving the l3mdev of eth0 to vrf-blue:
> 
> +----------+                         +---------+
> |          |                         |         |
> | vrf-blue |                         | vrf-red |
> |          |                         |         |
> +----+-----+                         +----+----+
>      |                                    |
>      |                                    |
> +----+-----+                         +----+----+
> |          |                         |         |
> | br-blue  |                         | br-red  |
> |          |                         |         |
> +----+-----+                         +---+-+---+
>      |                                   | |
>      |                             +-----+ +-----+
>      |                             |             |
> +----+-----+                +------+----+   +----+----+
> |          |  lower device  |           |   |         |
> |   eth0   | <- - - - - - - | vxlan-red |   | tap-red | (... more taps)
> |          |                |           |   |         |
> +----------+                +-----------+   +---------+

same here. Very helpful diagram.


> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 27bd586b94b0..a3de08122269 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c

The vxlan changes look ok to me. It would be good for someone who know
that code better than I do to review it.

Move the following l3mdev changes to a separate patch - introduce infra
changes separate from their use:

> diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
> index 3832099289c5..78fa0ac4613c 100644
> --- a/include/net/l3mdev.h
> +++ b/include/net/l3mdev.h
> @@ -101,6 +101,17 @@ struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev)
>  	return master;
>  }
>  
> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex);
> +static inline
> +int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
> +{
> +	rcu_read_lock();
> +	ifindex = l3mdev_master_upper_ifindex_by_index_rcu(net, ifindex);
> +	rcu_read_unlock();
> +
> +	return ifindex;
> +}
> +
>  u32 l3mdev_fib_table_rcu(const struct net_device *dev);
>  u32 l3mdev_fib_table_by_index(struct net *net, int ifindex);
>  static inline u32 l3mdev_fib_table(const struct net_device *dev)
> @@ -207,6 +218,17 @@ static inline int l3mdev_master_ifindex_by_index(struct net *net, int ifindex)
>  	return 0;
>  }
>  
> +static inline
> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
> +{
> +	return 0;
> +}
> +static inline
> +int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
> +{
> +	return 0;
> +}
> +
>  static inline
>  struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev)
>  {
> diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
> index 8da86ceca33d..309dee76724e 100644
> --- a/net/l3mdev/l3mdev.c
> +++ b/net/l3mdev/l3mdev.c
> @@ -46,6 +46,24 @@ int l3mdev_master_ifindex_rcu(const struct net_device *dev)
>  }
>  EXPORT_SYMBOL_GPL(l3mdev_master_ifindex_rcu);
>  
> +/**
> + *	l3mdev_master_upper_ifindex_by_index - get index of upper l3 master
> + *					       device
> + *	@net: network namespace for device index lookup
> + *	@ifindex: targeted interface
> + */
> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
> +{
> +	struct net_device *dev;
> +
> +	dev = dev_get_by_index_rcu(net, ifindex);
> +	while (dev && !netif_is_l3_master(dev))
> +		dev = netdev_master_upper_dev_get(dev);
> +
> +	return dev ? dev->ifindex : 0;
> +}
> +EXPORT_SYMBOL_GPL(l3mdev_master_upper_ifindex_by_index_rcu);
> +
>  /**
>   *	l3mdev_fib_table - get FIB table id associated with an L3
>   *                             master interface
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 14:23 ` [RFC v3 3/3] vxlan: handle underlay VRF changes Alexis Bauvin
@ 2018-11-20 15:04   ` David Ahern
  2018-11-20 15:35     ` Roopa Prabhu
  2018-11-20 16:27     ` Alexis Bauvin
  0 siblings, 2 replies; 26+ messages in thread
From: David Ahern @ 2018-11-20 15:04 UTC (permalink / raw)
  To: Alexis Bauvin, roopa; +Cc: netdev, akherbouche

On 11/20/18 7:23 AM, Alexis Bauvin wrote:
> When underlay VRF changes, either because the lower device itself changed,
> or its VRF changed, this patch releases the current socket of the VXLAN
> device and recreates another one in the right VRF. This allows for
> on-the-fly change of the underlay VRF of a VXLAN device.
> 
> Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
> Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
> ---
>  drivers/net/vxlan.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 94 insertions(+)
> 
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index a3de08122269..1e6ccad6df6a 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -208,6 +208,18 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
>  	return list_first_entry(&fdb->remotes, struct vxlan_rdst, list);
>  }
>  
> +static int vxlan_is_in_l3mdev_chain(struct net_device *chain,
> +				    struct net_device *dev)
> +{
> +	if (!chain)
> +		return 0;
> +
> +	if (chain->ifindex == dev->ifindex)
> +		return 1;
> +	return vxlan_is_in_l3mdev_chain(netdev_master_upper_dev_get(chain),
> +					dev);
> +}

This should return bool and true/false.

Also, why l3mdev in the name? None of the checks look at whether it is
an l3mdev master.

And again here, someone more familiar with the vxlan code should review it.

> +
>  /* Find VXLAN socket based on network namespace, address family and UDP port
>   * and enabled unshareable flags.
>   */
> @@ -3720,6 +3732,33 @@ struct net_device *vxlan_dev_create(struct net *net, const char *name,
>  }
>  EXPORT_SYMBOL_GPL(vxlan_dev_create);
>  
> +static int vxlan_reopen(struct vxlan_net *vn, struct vxlan_dev *vxlan)
> +{
> +	int ret = 0;
> +
> +	if (vxlan_addr_multicast(&vxlan->default_dst.remote_ip) &&
> +	    !vxlan_group_used(vn, vxlan))
> +		ret = vxlan_igmp_leave(vxlan);
> +	vxlan_sock_release(vxlan);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = vxlan_sock_add(vxlan);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (vxlan_addr_multicast(&vxlan->default_dst.remote_ip)) {
> +		ret = vxlan_igmp_join(vxlan);
> +		if (ret == -EADDRINUSE)
> +			ret = 0;
> +		if (ret)
> +			vxlan_sock_release(vxlan);
> +	}
> +
> +	return ret;
> +}
> +
>  static void vxlan_handle_lowerdev_unregister(struct vxlan_net *vn,
>  					     struct net_device *dev)
>  {
> @@ -3742,6 +3781,55 @@ static void vxlan_handle_lowerdev_unregister(struct vxlan_net *vn,
>  	unregister_netdevice_many(&list_kill);
>  }
>  
> +static void vxlan_handle_change_upper(struct vxlan_net *vn,
> +				      struct net_device *dev)
> +{
> +	struct vxlan_dev *vxlan, *next;
> +
> +	list_for_each_entry_safe(vxlan, next, &vn->vxlan_list, next) {
> +		struct net_device *lower;
> +		int err;
> +
> +		lower = __dev_get_by_index(vxlan->net,
> +					   vxlan->cfg.remote_ifindex);
> +		if (!vxlan_is_in_l3mdev_chain(lower, dev))
> +			continue;
> +
> +		err = vxlan_reopen(vn, vxlan);
> +		if (err < 0)
> +			netdev_err(vxlan->dev, "Failed to reopen socket: %d\n",
> +				   err);
> +	}
> +}
> +
> +static void vxlan_handle_change(struct vxlan_net *vn, struct net_device *dev)
> +{
> +	struct vxlan_dev *vxlan = netdev_priv(dev);
> +	struct vxlan_sock *sock;
> +	int l3mdev_index;
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +	bool metadata = vxlan->cfg.flags & VXLAN_F_COLLECT_METADATA;
> +	bool ipv6 = vxlan->cfg.flags & VXLAN_F_IPV6 || metadata;
> +
> +	sock = ipv6 ? rcu_dereference(vxlan->vn6_sock)
> +		    : rcu_dereference(vxlan->vn4_sock);
> +#else
> +	sock = rcu_dereference(vxlan->vn4_sock);
> +#endif
> +
> +	l3mdev_index =
> +		l3mdev_master_upper_ifindex_by_index(vxlan->net,
> +						     vxlan->cfg.remote_ifindex);
> +	if (sock->sock->sk->sk_bound_dev_if != l3mdev_index) {
> +		int ret = vxlan_reopen(vn, vxlan);
> +
> +		if (ret < 0)
> +			netdev_err(vxlan->dev, "Failed to reopen socket: %d\n",
> +				   ret);
> +	}
> +}
> +
>  static int vxlan_netdevice_event(struct notifier_block *unused,
>  				 unsigned long event, void *ptr)
>  {
> @@ -3756,6 +3844,12 @@ static int vxlan_netdevice_event(struct notifier_block *unused,
>  	} else if (event == NETDEV_UDP_TUNNEL_PUSH_INFO ||
>  		   event == NETDEV_UDP_TUNNEL_DROP_INFO) {
>  		vxlan_offload_rx_ports(dev, event == NETDEV_UDP_TUNNEL_PUSH_INFO);
> +	} else if (event == NETDEV_CHANGEUPPER) {
> +		vxlan_handle_change_upper(vn, dev);
> +	} else if (event == NETDEV_CHANGE) {
> +		if (dev->rtnl_link_ops &&
> +		    !strcmp(dev->rtnl_link_ops->kind, vxlan_link_ops.kind))
> +			vxlan_handle_change(vn, dev);
>  	}
>  
>  	return NOTIFY_DONE;
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 2/3] vxlan: add support for underlay in non-default VRF
  2018-11-20 14:23 ` [RFC v3 2/3] vxlan: add support for underlay in non-default VRF Alexis Bauvin
  2018-11-20 14:57   ` David Ahern
@ 2018-11-20 15:25   ` Roopa Prabhu
  2018-11-20 16:14     ` Alexis Bauvin
  1 sibling, 1 reply; 26+ messages in thread
From: Roopa Prabhu @ 2018-11-20 15:25 UTC (permalink / raw)
  To: abauvin; +Cc: David Ahern, netdev, akherbouche

On Tue, Nov 20, 2018 at 6:23 AM Alexis Bauvin <abauvin@scaleway.com> wrote:
>
> Creating a VXLAN device with is underlay in the non-default VRF makes
> egress route lookup fail or incorrect since it will resolve in the
> default VRF, and ingress fail because the socket listens in the default
> VRF.
>
> This patch binds the underlying UDP tunnel socket to the l3mdev of the
> lower device of the VXLAN device. This will listen in the proper VRF and
> output traffic from said l3mdev, matching l3mdev routing rules and
> looking up the correct routing table.
>
> When the VXLAN device does not have a lower device, or the lower device
> is in the default VRF, the socket will not be bound to any interface,
> keeping the previous behaviour.
>
> The underlay l3mdev is deduced from the VXLAN lower device
> (IFLA_VXLAN_LINK).
>
> The l3mdev_master_upper_ifindex_by_index function has been added to
> l3mdev. Its goal is to fetch the effective l3mdev of an interface which
> is not a direct slave of said l3mdev. It handles the following example,
> properly resolving the l3mdev of eth0 to vrf-blue:
>
> +----------+                         +---------+
> |          |                         |         |
> | vrf-blue |                         | vrf-red |
> |          |                         |         |
> +----+-----+                         +----+----+
>      |                                    |
>      |                                    |
> +----+-----+                         +----+----+
> |          |                         |         |
> | br-blue  |                         | br-red  |
> |          |                         |         |
> +----+-----+                         +---+-+---+
>      |                                   | |
>      |                             +-----+ +-----+
>      |                             |             |
> +----+-----+                +------+----+   +----+----+
> |          |  lower device  |           |   |         |
> |   eth0   | <- - - - - - - | vxlan-red |   | tap-red | (... more taps)
> |          |                |           |   |         |
> +----------+                +-----------+   +---------+
>
> Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
> Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
> ---
>  drivers/net/vxlan.c  | 32 ++++++++++++++++++++++++--------
>  include/net/l3mdev.h | 22 ++++++++++++++++++++++
>  net/l3mdev/l3mdev.c  | 18 ++++++++++++++++++
>  3 files changed, 64 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 27bd586b94b0..a3de08122269 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -212,7 +212,7 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
>   * and enabled unshareable flags.
>   */
>  static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
> -                                         __be16 port, u32 flags)
> +                                         __be16 port, u32 flags, int ifindex)
>  {
>         struct vxlan_sock *vs;
>
> @@ -221,7 +221,8 @@ static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
>         hlist_for_each_entry_rcu(vs, vs_head(net, port), hlist) {
>                 if (inet_sk(vs->sock->sk)->inet_sport == port &&
>                     vxlan_get_sk_family(vs) == family &&
> -                   vs->flags == flags)
> +                   vs->flags == flags &&
> +                   vs->sock->sk->sk_bound_dev_if == ifindex)
>                         return vs;
>         }
>         return NULL;
> @@ -261,7 +262,7 @@ static struct vxlan_dev *vxlan_find_vni(struct net *net, int ifindex,
>  {
>         struct vxlan_sock *vs;
>
> -       vs = vxlan_find_sock(net, family, port, flags);
> +       vs = vxlan_find_sock(net, family, port, flags, ifindex);
>         if (!vs)
>                 return NULL;
>
> @@ -2172,6 +2173,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>                 struct rtable *rt;
>                 __be16 df = 0;
>
> +               if (!ifindex)
> +                       ifindex = sock4->sock->sk->sk_bound_dev_if;
> +
>                 rt = vxlan_get_route(vxlan, dev, sock4, skb, ifindex, tos,
>                                      dst->sin.sin_addr.s_addr,
>                                      &local_ip.sin.sin_addr.s_addr,
> @@ -2210,6 +2214,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>         } else {
>                 struct vxlan_sock *sock6 = rcu_dereference(vxlan->vn6_sock);
>
> +               if (!ifindex)
> +                       ifindex = sock6->sock->sk->sk_bound_dev_if;
> +
>                 ndst = vxlan6_get_route(vxlan, dev, sock6, skb, ifindex, tos,
>                                         label, &dst->sin6.sin6_addr,
>                                         &local_ip.sin6.sin6_addr,
> @@ -2813,7 +2820,7 @@ static const struct ethtool_ops vxlan_ethtool_ops = {
>  };
>
>  static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
> -                                       __be16 port, u32 flags)
> +                                       __be16 port, u32 flags, int ifindex)
>  {
>         struct socket *sock;
>         struct udp_port_cfg udp_conf;
> @@ -2831,6 +2838,7 @@ static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
>         }
>
>         udp_conf.local_udp_port = port;
> +       udp_conf.bind_ifindex = ifindex;
>
>         /* Open UDP socket */
>         err = udp_sock_create(net, &udp_conf, &sock);
> @@ -2842,7 +2850,8 @@ static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
>
>  /* Create new listen socket if needed */
>  static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,
> -                                             __be16 port, u32 flags)
> +                                             __be16 port, u32 flags,
> +                                             int ifindex)
>  {
>         struct vxlan_net *vn = net_generic(net, vxlan_net_id);
>         struct vxlan_sock *vs;
> @@ -2857,7 +2866,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,
>         for (h = 0; h < VNI_HASH_SIZE; ++h)
>                 INIT_HLIST_HEAD(&vs->vni_list[h]);
>
> -       sock = vxlan_create_sock(net, ipv6, port, flags);
> +       sock = vxlan_create_sock(net, ipv6, port, flags, ifindex);
>         if (IS_ERR(sock)) {
>                 kfree(vs);
>                 return ERR_CAST(sock);
> @@ -2894,11 +2903,17 @@ static int __vxlan_sock_add(struct vxlan_dev *vxlan, bool ipv6)
>         struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
>         struct vxlan_sock *vs = NULL;
>         struct vxlan_dev_node *node;
> +       int l3mdev_index;
> +
> +       l3mdev_index =
> +               l3mdev_master_upper_ifindex_by_index(vxlan->net,
> +                                                    vxlan->cfg.remote_ifindex);

vxlan->cfg.remote_ifindex is optional, so we can avoid trying to
derive the l3mdev_ifindex for cases where it is not present


>
>         if (!vxlan->cfg.no_share) {
>                 spin_lock(&vn->sock_lock);
>                 vs = vxlan_find_sock(vxlan->net, ipv6 ? AF_INET6 : AF_INET,
> -                                    vxlan->cfg.dst_port, vxlan->cfg.flags);
> +                                    vxlan->cfg.dst_port, vxlan->cfg.flags,
> +                                    l3mdev_index);
>                 if (vs && !refcount_inc_not_zero(&vs->refcnt)) {
>                         spin_unlock(&vn->sock_lock);
>                         return -EBUSY;
> @@ -2907,7 +2922,8 @@ static int __vxlan_sock_add(struct vxlan_dev *vxlan, bool ipv6)
>         }
>         if (!vs)
>                 vs = vxlan_socket_create(vxlan->net, ipv6,
> -                                        vxlan->cfg.dst_port, vxlan->cfg.flags);
> +                                        vxlan->cfg.dst_port, vxlan->cfg.flags,
> +                                        l3mdev_index);
>         if (IS_ERR(vs))
>                 return PTR_ERR(vs);
>  #if IS_ENABLED(CONFIG_IPV6)
> diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
> index 3832099289c5..78fa0ac4613c 100644
> --- a/include/net/l3mdev.h
> +++ b/include/net/l3mdev.h
> @@ -101,6 +101,17 @@ struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev)
>         return master;
>  }
>
> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex);
> +static inline
> +int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
> +{
> +       rcu_read_lock();
> +       ifindex = l3mdev_master_upper_ifindex_by_index_rcu(net, ifindex);
> +       rcu_read_unlock();
> +
> +       return ifindex;
> +}
> +
>  u32 l3mdev_fib_table_rcu(const struct net_device *dev);
>  u32 l3mdev_fib_table_by_index(struct net *net, int ifindex);
>  static inline u32 l3mdev_fib_table(const struct net_device *dev)
> @@ -207,6 +218,17 @@ static inline int l3mdev_master_ifindex_by_index(struct net *net, int ifindex)
>         return 0;
>  }
>
> +static inline
> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
> +{
> +       return 0;
> +}
> +static inline
> +int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
> +{
> +       return 0;
> +}
> +
>  static inline
>  struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev)
>  {
> diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
> index 8da86ceca33d..309dee76724e 100644
> --- a/net/l3mdev/l3mdev.c
> +++ b/net/l3mdev/l3mdev.c
> @@ -46,6 +46,24 @@ int l3mdev_master_ifindex_rcu(const struct net_device *dev)
>  }
>  EXPORT_SYMBOL_GPL(l3mdev_master_ifindex_rcu);
>
> +/**
> + *     l3mdev_master_upper_ifindex_by_index - get index of upper l3 master
> + *                                            device
> + *     @net: network namespace for device index lookup
> + *     @ifindex: targeted interface
> + */
> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
> +{
> +       struct net_device *dev;
> +
> +       dev = dev_get_by_index_rcu(net, ifindex);
> +       while (dev && !netif_is_l3_master(dev))
> +               dev = netdev_master_upper_dev_get(dev);
> +
> +       return dev ? dev->ifindex : 0;
> +}
> +EXPORT_SYMBOL_GPL(l3mdev_master_upper_ifindex_by_index_rcu);
> +
>  /**
>   *     l3mdev_fib_table - get FIB table id associated with an L3
>   *                             master interface
> --
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 15:04   ` David Ahern
@ 2018-11-20 15:35     ` Roopa Prabhu
  2018-11-20 15:48       ` David Ahern
  2018-11-20 16:58       ` Alexis Bauvin
  2018-11-20 16:27     ` Alexis Bauvin
  1 sibling, 2 replies; 26+ messages in thread
From: Roopa Prabhu @ 2018-11-20 15:35 UTC (permalink / raw)
  To: David Ahern; +Cc: abauvin, netdev, akherbouche

On Tue, Nov 20, 2018 at 7:04 AM David Ahern <dsa@cumulusnetworks.com> wrote:
>
> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
> > When underlay VRF changes, either because the lower device itself changed,
> > or its VRF changed, this patch releases the current socket of the VXLAN
> > device and recreates another one in the right VRF. This allows for
> > on-the-fly change of the underlay VRF of a VXLAN device.
> >
> > Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
> > Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
> > Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
> > ---
> >  drivers/net/vxlan.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 94 insertions(+)
> >
> > diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> > index a3de08122269..1e6ccad6df6a 100644
> > --- a/drivers/net/vxlan.c
> > +++ b/drivers/net/vxlan.c
> > @@ -208,6 +208,18 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
> >       return list_first_entry(&fdb->remotes, struct vxlan_rdst, list);
> >  }
> >
> > +static int vxlan_is_in_l3mdev_chain(struct net_device *chain,
> > +                                 struct net_device *dev)
> > +{
> > +     if (!chain)
> > +             return 0;
> > +
> > +     if (chain->ifindex == dev->ifindex)
> > +             return 1;
> > +     return vxlan_is_in_l3mdev_chain(netdev_master_upper_dev_get(chain),
> > +                                     dev);
> > +}
>
> This should return bool and true/false.
>
> Also, why l3mdev in the name? None of the checks look at whether it is
> an l3mdev master.
>
> And again here, someone more familiar with the vxlan code should review it.
>


I understand the need for patch 2. But I don't understand the need for
the complexity this patch introduces (especially implicit down and up
of the vxlan device).
Alexis, If your underlay routing changes,  you can down and up the
vxlan device from user-space correct ?. This should be true for any
tunnel device.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 15:35     ` Roopa Prabhu
@ 2018-11-20 15:48       ` David Ahern
  2018-11-20 16:13         ` David Ahern
  2018-11-20 16:58       ` Alexis Bauvin
  1 sibling, 1 reply; 26+ messages in thread
From: David Ahern @ 2018-11-20 15:48 UTC (permalink / raw)
  To: Roopa Prabhu; +Cc: abauvin, netdev, akherbouche

On 11/20/18 8:35 AM, Roopa Prabhu wrote:
> On Tue, Nov 20, 2018 at 7:04 AM David Ahern <dsa@cumulusnetworks.com> wrote:
>>
>> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
>>> When underlay VRF changes, either because the lower device itself changed,
>>> or its VRF changed, this patch releases the current socket of the VXLAN
>>> device and recreates another one in the right VRF. This allows for
>>> on-the-fly change of the underlay VRF of a VXLAN device.
>>>
>>> Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
>>> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
>>> Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
>>> ---
>>>  drivers/net/vxlan.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 94 insertions(+)
>>>
>>> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>>> index a3de08122269..1e6ccad6df6a 100644
>>> --- a/drivers/net/vxlan.c
>>> +++ b/drivers/net/vxlan.c
>>> @@ -208,6 +208,18 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
>>>       return list_first_entry(&fdb->remotes, struct vxlan_rdst, list);
>>>  }
>>>
>>> +static int vxlan_is_in_l3mdev_chain(struct net_device *chain,
>>> +                                 struct net_device *dev)
>>> +{
>>> +     if (!chain)
>>> +             return 0;
>>> +
>>> +     if (chain->ifindex == dev->ifindex)
>>> +             return 1;
>>> +     return vxlan_is_in_l3mdev_chain(netdev_master_upper_dev_get(chain),
>>> +                                     dev);
>>> +}
>>
>> This should return bool and true/false.
>>
>> Also, why l3mdev in the name? None of the checks look at whether it is
>> an l3mdev master.
>>
>> And again here, someone more familiar with the vxlan code should review it.
>>
> 
> 
> I understand the need for patch 2. But I don't understand the need for
> the complexity this patch introduces (especially implicit down and up
> of the vxlan device).
> Alexis, If your underlay routing changes,  you can down and up the
> vxlan device from user-space correct ?. This should be true for any
> tunnel device.
> 

I believe this patch handles changes to the VRF association of the bridge.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 2/3] vxlan: add support for underlay in non-default VRF
  2018-11-20 14:57   ` David Ahern
@ 2018-11-20 16:11     ` Alexis Bauvin
  0 siblings, 0 replies; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-20 16:11 UTC (permalink / raw)
  To: David Ahern, roopa; +Cc: netdev, akherbouche

Le 20 nov. 2018 à 15:57, David Ahern <dsa@cumulusnetworks.com> a écrit :
> 
> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
>> Creating a VXLAN device with is underlay in the non-default VRF makes
>> egress route lookup fail or incorrect since it will resolve in the
>> default VRF, and ingress fail because the socket listens in the default
>> VRF.
>> 
>> This patch binds the underlying UDP tunnel socket to the l3mdev of the
>> lower device of the VXLAN device. This will listen in the proper VRF and
>> output traffic from said l3mdev, matching l3mdev routing rules and
>> looking up the correct routing table.
>> 
>> When the VXLAN device does not have a lower device, or the lower device
>> is in the default VRF, the socket will not be bound to any interface,
>> keeping the previous behaviour.
>> 
>> The underlay l3mdev is deduced from the VXLAN lower device
>> (IFLA_VXLAN_LINK).
>> 
>> The l3mdev_master_upper_ifindex_by_index function has been added to
>> l3mdev. Its goal is to fetch the effective l3mdev of an interface which
>> is not a direct slave of said l3mdev. It handles the following example,
>> properly resolving the l3mdev of eth0 to vrf-blue:
>> 
>> +----------+                         +---------+
>> |          |                         |         |
>> | vrf-blue |                         | vrf-red |
>> |          |                         |         |
>> +----+-----+                         +----+----+
>>     |                                    |
>>     |                                    |
>> +----+-----+                         +----+----+
>> |          |                         |         |
>> | br-blue  |                         | br-red  |
>> |          |                         |         |
>> +----+-----+                         +---+-+---+
>>     |                                   | |
>>     |                             +-----+ +-----+
>>     |                             |             |
>> +----+-----+                +------+----+   +----+----+
>> |          |  lower device  |           |   |         |
>> |   eth0   | <- - - - - - - | vxlan-red |   | tap-red | (... more taps)
>> |          |                |           |   |         |
>> +----------+                +-----------+   +---------+
> 
> same here. Very helpful diagram.
> 
> 
>> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>> index 27bd586b94b0..a3de08122269 100644
>> --- a/drivers/net/vxlan.c
>> +++ b/drivers/net/vxlan.c
> 
> The vxlan changes look ok to me. It would be good for someone who know
> that code better than I do to review it.
> 
> Move the following l3mdev changes to a separate patch - introduce infra
> changes separate from their use:

Sure, will do in next version. Thanks!

>> diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
>> index 3832099289c5..78fa0ac4613c 100644
>> --- a/include/net/l3mdev.h
>> +++ b/include/net/l3mdev.h
>> @@ -101,6 +101,17 @@ struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev)
>> 	return master;
>> }
>> 
>> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex);
>> +static inline
>> +int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
>> +{
>> +	rcu_read_lock();
>> +	ifindex = l3mdev_master_upper_ifindex_by_index_rcu(net, ifindex);
>> +	rcu_read_unlock();
>> +
>> +	return ifindex;
>> +}
>> +
>> u32 l3mdev_fib_table_rcu(const struct net_device *dev);
>> u32 l3mdev_fib_table_by_index(struct net *net, int ifindex);
>> static inline u32 l3mdev_fib_table(const struct net_device *dev)
>> @@ -207,6 +218,17 @@ static inline int l3mdev_master_ifindex_by_index(struct net *net, int ifindex)
>> 	return 0;
>> }
>> 
>> +static inline
>> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
>> +{
>> +	return 0;
>> +}
>> +static inline
>> +int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
>> +{
>> +	return 0;
>> +}
>> +
>> static inline
>> struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev)
>> {
>> diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
>> index 8da86ceca33d..309dee76724e 100644
>> --- a/net/l3mdev/l3mdev.c
>> +++ b/net/l3mdev/l3mdev.c
>> @@ -46,6 +46,24 @@ int l3mdev_master_ifindex_rcu(const struct net_device *dev)
>> }
>> EXPORT_SYMBOL_GPL(l3mdev_master_ifindex_rcu);
>> 
>> +/**
>> + *	l3mdev_master_upper_ifindex_by_index - get index of upper l3 master
>> + *					       device
>> + *	@net: network namespace for device index lookup
>> + *	@ifindex: targeted interface
>> + */
>> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
>> +{
>> +	struct net_device *dev;
>> +
>> +	dev = dev_get_by_index_rcu(net, ifindex);
>> +	while (dev && !netif_is_l3_master(dev))
>> +		dev = netdev_master_upper_dev_get(dev);
>> +
>> +	return dev ? dev->ifindex : 0;
>> +}
>> +EXPORT_SYMBOL_GPL(l3mdev_master_upper_ifindex_by_index_rcu);
>> +
>> /**
>>  *	l3mdev_fib_table - get FIB table id associated with an L3
>>  *                             master interface

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 15:48       ` David Ahern
@ 2018-11-20 16:13         ` David Ahern
  2018-11-20 18:19           ` Alexis Bauvin
  0 siblings, 1 reply; 26+ messages in thread
From: David Ahern @ 2018-11-20 16:13 UTC (permalink / raw)
  To: Roopa Prabhu; +Cc: abauvin, netdev, akherbouche

On 11/20/18 8:48 AM, David Ahern wrote:
> On 11/20/18 8:35 AM, Roopa Prabhu wrote:
>> On Tue, Nov 20, 2018 at 7:04 AM David Ahern <dsa@cumulusnetworks.com> wrote:
>>>
>>> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
>>>> When underlay VRF changes, either because the lower device itself changed,
>>>> or its VRF changed, this patch releases the current socket of the VXLAN
>>>> device and recreates another one in the right VRF. This allows for
>>>> on-the-fly change of the underlay VRF of a VXLAN device.
>>>>
>>>> Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
>>>> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
>>>> Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
>>>> ---
>>>>  drivers/net/vxlan.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 94 insertions(+)
>>>>
>>>> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>>>> index a3de08122269..1e6ccad6df6a 100644
>>>> --- a/drivers/net/vxlan.c
>>>> +++ b/drivers/net/vxlan.c
>>>> @@ -208,6 +208,18 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
>>>>       return list_first_entry(&fdb->remotes, struct vxlan_rdst, list);
>>>>  }
>>>>
>>>> +static int vxlan_is_in_l3mdev_chain(struct net_device *chain,
>>>> +                                 struct net_device *dev)
>>>> +{
>>>> +     if (!chain)
>>>> +             return 0;
>>>> +
>>>> +     if (chain->ifindex == dev->ifindex)
>>>> +             return 1;
>>>> +     return vxlan_is_in_l3mdev_chain(netdev_master_upper_dev_get(chain),
>>>> +                                     dev);
>>>> +}
>>>
>>> This should return bool and true/false.
>>>
>>> Also, why l3mdev in the name? None of the checks look at whether it is
>>> an l3mdev master.
>>>
>>> And again here, someone more familiar with the vxlan code should review it.
>>>
>>
>>
>> I understand the need for patch 2. But I don't understand the need for
>> the complexity this patch introduces (especially implicit down and up
>> of the vxlan device).
>> Alexis, If your underlay routing changes,  you can down and up the
>> vxlan device from user-space correct ?. This should be true for any
>> tunnel device.
>>
> 
> I believe this patch handles changes to the VRF association of the bridge.
> 
Re-reading the commit message, this handles changes in VRF association
of the lower device.

If the vxlan socket in general (vrf or not) can be bound to the lower
device instead of the VRF then it simplifies things a lot.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 2/3] vxlan: add support for underlay in non-default VRF
  2018-11-20 15:25   ` Roopa Prabhu
@ 2018-11-20 16:14     ` Alexis Bauvin
  0 siblings, 0 replies; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-20 16:14 UTC (permalink / raw)
  To: Roopa Prabhu; +Cc: David Ahern, netdev, akherbouche

Le 20 nov. 2018 à 16:25, Roopa Prabhu <roopa@cumulusnetworks.com> a écrit :
> 
> On Tue, Nov 20, 2018 at 6:23 AM Alexis Bauvin <abauvin@scaleway.com> wrote:
>> 
>> Creating a VXLAN device with is underlay in the non-default VRF makes
>> egress route lookup fail or incorrect since it will resolve in the
>> default VRF, and ingress fail because the socket listens in the default
>> VRF.
>> 
>> This patch binds the underlying UDP tunnel socket to the l3mdev of the
>> lower device of the VXLAN device. This will listen in the proper VRF and
>> output traffic from said l3mdev, matching l3mdev routing rules and
>> looking up the correct routing table.
>> 
>> When the VXLAN device does not have a lower device, or the lower device
>> is in the default VRF, the socket will not be bound to any interface,
>> keeping the previous behaviour.
>> 
>> The underlay l3mdev is deduced from the VXLAN lower device
>> (IFLA_VXLAN_LINK).
>> 
>> The l3mdev_master_upper_ifindex_by_index function has been added to
>> l3mdev. Its goal is to fetch the effective l3mdev of an interface which
>> is not a direct slave of said l3mdev. It handles the following example,
>> properly resolving the l3mdev of eth0 to vrf-blue:
>> 
>> +----------+                         +---------+
>> |          |                         |         |
>> | vrf-blue |                         | vrf-red |
>> |          |                         |         |
>> +----+-----+                         +----+----+
>>     |                                    |
>>     |                                    |
>> +----+-----+                         +----+----+
>> |          |                         |         |
>> | br-blue  |                         | br-red  |
>> |          |                         |         |
>> +----+-----+                         +---+-+---+
>>     |                                   | |
>>     |                             +-----+ +-----+
>>     |                             |             |
>> +----+-----+                +------+----+   +----+----+
>> |          |  lower device  |           |   |         |
>> |   eth0   | <- - - - - - - | vxlan-red |   | tap-red | (... more taps)
>> |          |                |           |   |         |
>> +----------+                +-----------+   +---------+
>> 
>> Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
>> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
>> Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
>> ---
>> drivers/net/vxlan.c  | 32 ++++++++++++++++++++++++--------
>> include/net/l3mdev.h | 22 ++++++++++++++++++++++
>> net/l3mdev/l3mdev.c  | 18 ++++++++++++++++++
>> 3 files changed, 64 insertions(+), 8 deletions(-)
>> 
>> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>> index 27bd586b94b0..a3de08122269 100644
>> --- a/drivers/net/vxlan.c
>> +++ b/drivers/net/vxlan.c
>> @@ -212,7 +212,7 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
>>  * and enabled unshareable flags.
>>  */
>> static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
>> -                                         __be16 port, u32 flags)
>> +                                         __be16 port, u32 flags, int ifindex)
>> {
>>        struct vxlan_sock *vs;
>> 
>> @@ -221,7 +221,8 @@ static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
>>        hlist_for_each_entry_rcu(vs, vs_head(net, port), hlist) {
>>                if (inet_sk(vs->sock->sk)->inet_sport == port &&
>>                    vxlan_get_sk_family(vs) == family &&
>> -                   vs->flags == flags)
>> +                   vs->flags == flags &&
>> +                   vs->sock->sk->sk_bound_dev_if == ifindex)
>>                        return vs;
>>        }
>>        return NULL;
>> @@ -261,7 +262,7 @@ static struct vxlan_dev *vxlan_find_vni(struct net *net, int ifindex,
>> {
>>        struct vxlan_sock *vs;
>> 
>> -       vs = vxlan_find_sock(net, family, port, flags);
>> +       vs = vxlan_find_sock(net, family, port, flags, ifindex);
>>        if (!vs)
>>                return NULL;
>> 
>> @@ -2172,6 +2173,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>>                struct rtable *rt;
>>                __be16 df = 0;
>> 
>> +               if (!ifindex)
>> +                       ifindex = sock4->sock->sk->sk_bound_dev_if;
>> +
>>                rt = vxlan_get_route(vxlan, dev, sock4, skb, ifindex, tos,
>>                                     dst->sin.sin_addr.s_addr,
>>                                     &local_ip.sin.sin_addr.s_addr,
>> @@ -2210,6 +2214,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>>        } else {
>>                struct vxlan_sock *sock6 = rcu_dereference(vxlan->vn6_sock);
>> 
>> +               if (!ifindex)
>> +                       ifindex = sock6->sock->sk->sk_bound_dev_if;
>> +
>>                ndst = vxlan6_get_route(vxlan, dev, sock6, skb, ifindex, tos,
>>                                        label, &dst->sin6.sin6_addr,
>>                                        &local_ip.sin6.sin6_addr,
>> @@ -2813,7 +2820,7 @@ static const struct ethtool_ops vxlan_ethtool_ops = {
>> };
>> 
>> static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
>> -                                       __be16 port, u32 flags)
>> +                                       __be16 port, u32 flags, int ifindex)
>> {
>>        struct socket *sock;
>>        struct udp_port_cfg udp_conf;
>> @@ -2831,6 +2838,7 @@ static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
>>        }
>> 
>>        udp_conf.local_udp_port = port;
>> +       udp_conf.bind_ifindex = ifindex;
>> 
>>        /* Open UDP socket */
>>        err = udp_sock_create(net, &udp_conf, &sock);
>> @@ -2842,7 +2850,8 @@ static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
>> 
>> /* Create new listen socket if needed */
>> static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,
>> -                                             __be16 port, u32 flags)
>> +                                             __be16 port, u32 flags,
>> +                                             int ifindex)
>> {
>>        struct vxlan_net *vn = net_generic(net, vxlan_net_id);
>>        struct vxlan_sock *vs;
>> @@ -2857,7 +2866,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,
>>        for (h = 0; h < VNI_HASH_SIZE; ++h)
>>                INIT_HLIST_HEAD(&vs->vni_list[h]);
>> 
>> -       sock = vxlan_create_sock(net, ipv6, port, flags);
>> +       sock = vxlan_create_sock(net, ipv6, port, flags, ifindex);
>>        if (IS_ERR(sock)) {
>>                kfree(vs);
>>                return ERR_CAST(sock);
>> @@ -2894,11 +2903,17 @@ static int __vxlan_sock_add(struct vxlan_dev *vxlan, bool ipv6)
>>        struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
>>        struct vxlan_sock *vs = NULL;
>>        struct vxlan_dev_node *node;
>> +       int l3mdev_index;
>> +
>> +       l3mdev_index =
>> +               l3mdev_master_upper_ifindex_by_index(vxlan->net,
>> +                                                    vxlan->cfg.remote_ifindex);
> 
> vxlan->cfg.remote_ifindex is optional, so we can avoid trying to
> derive the l3mdev_ifindex for cases where it is not present

So you do suggest to check first remote_ifindex, and derive
the l3mdev_ifindex only if not zero? If so, I will add it
in next version.

>> 
>>        if (!vxlan->cfg.no_share) {
>>                spin_lock(&vn->sock_lock);
>>                vs = vxlan_find_sock(vxlan->net, ipv6 ? AF_INET6 : AF_INET,
>> -                                    vxlan->cfg.dst_port, vxlan->cfg.flags);
>> +                                    vxlan->cfg.dst_port, vxlan->cfg.flags,
>> +                                    l3mdev_index);
>>                if (vs && !refcount_inc_not_zero(&vs->refcnt)) {
>>                        spin_unlock(&vn->sock_lock);
>>                        return -EBUSY;
>> @@ -2907,7 +2922,8 @@ static int __vxlan_sock_add(struct vxlan_dev *vxlan, bool ipv6)
>>        }
>>        if (!vs)
>>                vs = vxlan_socket_create(vxlan->net, ipv6,
>> -                                        vxlan->cfg.dst_port, vxlan->cfg.flags);
>> +                                        vxlan->cfg.dst_port, vxlan->cfg.flags,
>> +                                        l3mdev_index);
>>        if (IS_ERR(vs))
>>                return PTR_ERR(vs);
>> #if IS_ENABLED(CONFIG_IPV6)
>> diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
>> index 3832099289c5..78fa0ac4613c 100644
>> --- a/include/net/l3mdev.h
>> +++ b/include/net/l3mdev.h
>> @@ -101,6 +101,17 @@ struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev)
>>        return master;
>> }
>> 
>> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex);
>> +static inline
>> +int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
>> +{
>> +       rcu_read_lock();
>> +       ifindex = l3mdev_master_upper_ifindex_by_index_rcu(net, ifindex);
>> +       rcu_read_unlock();
>> +
>> +       return ifindex;
>> +}
>> +
>> u32 l3mdev_fib_table_rcu(const struct net_device *dev);
>> u32 l3mdev_fib_table_by_index(struct net *net, int ifindex);
>> static inline u32 l3mdev_fib_table(const struct net_device *dev)
>> @@ -207,6 +218,17 @@ static inline int l3mdev_master_ifindex_by_index(struct net *net, int ifindex)
>>        return 0;
>> }
>> 
>> +static inline
>> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
>> +{
>> +       return 0;
>> +}
>> +static inline
>> +int l3mdev_master_upper_ifindex_by_index(struct net *net, int ifindex)
>> +{
>> +       return 0;
>> +}
>> +
>> static inline
>> struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev)
>> {
>> diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
>> index 8da86ceca33d..309dee76724e 100644
>> --- a/net/l3mdev/l3mdev.c
>> +++ b/net/l3mdev/l3mdev.c
>> @@ -46,6 +46,24 @@ int l3mdev_master_ifindex_rcu(const struct net_device *dev)
>> }
>> EXPORT_SYMBOL_GPL(l3mdev_master_ifindex_rcu);
>> 
>> +/**
>> + *     l3mdev_master_upper_ifindex_by_index - get index of upper l3 master
>> + *                                            device
>> + *     @net: network namespace for device index lookup
>> + *     @ifindex: targeted interface
>> + */
>> +int l3mdev_master_upper_ifindex_by_index_rcu(struct net *net, int ifindex)
>> +{
>> +       struct net_device *dev;
>> +
>> +       dev = dev_get_by_index_rcu(net, ifindex);
>> +       while (dev && !netif_is_l3_master(dev))
>> +               dev = netdev_master_upper_dev_get(dev);
>> +
>> +       return dev ? dev->ifindex : 0;
>> +}
>> +EXPORT_SYMBOL_GPL(l3mdev_master_upper_ifindex_by_index_rcu);
>> +
>> /**
>>  *     l3mdev_fib_table - get FIB table id associated with an L3
>>  *                             master interface
>> --

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 15:04   ` David Ahern
  2018-11-20 15:35     ` Roopa Prabhu
@ 2018-11-20 16:27     ` Alexis Bauvin
  2018-11-20 16:54       ` David Ahern
  1 sibling, 1 reply; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-20 16:27 UTC (permalink / raw)
  To: David Ahern, roopa; +Cc: netdev, akherbouche

Le 20 nov. 2018 à 16:04, David Ahern <dsa@cumulusnetworks.com> a écrit :
> 
> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
>> When underlay VRF changes, either because the lower device itself changed,
>> or its VRF changed, this patch releases the current socket of the VXLAN
>> device and recreates another one in the right VRF. This allows for
>> on-the-fly change of the underlay VRF of a VXLAN device.
>> 
>> Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
>> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
>> Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
>> ---
>> drivers/net/vxlan.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 94 insertions(+)
>> 
>> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>> index a3de08122269..1e6ccad6df6a 100644
>> --- a/drivers/net/vxlan.c
>> +++ b/drivers/net/vxlan.c
>> @@ -208,6 +208,18 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
>> 	return list_first_entry(&fdb->remotes, struct vxlan_rdst, list);
>> }
>> 
>> +static int vxlan_is_in_l3mdev_chain(struct net_device *chain,
>> +				    struct net_device *dev)
>> +{
>> +	if (!chain)
>> +		return 0;
>> +
>> +	if (chain->ifindex == dev->ifindex)
>> +		return 1;
>> +	return vxlan_is_in_l3mdev_chain(netdev_master_upper_dev_get(chain),
>> +					dev);
>> +}
> 
> This should return bool and true/false.

Will change in v4.

> Also, why l3mdev in the name? None of the checks look at whether it is
> an l3mdev master.

The original intention was to detect if, through a master change, the
l3mdev of the lower device changed. However, it is right it is much more
generic than that. It will tell if dev is a master (direct or indirect)
of chain.

A rename to vxlan_is_upper_master would be preferred, maybe even move
the function to net/core/dev.c, with other master-related functions.
What do you think?

> And again here, someone more familiar with the vxlan code should review it.
> 
>> +
>> /* Find VXLAN socket based on network namespace, address family and UDP port
>>  * and enabled unshareable flags.
>>  */
>> @@ -3720,6 +3732,33 @@ struct net_device *vxlan_dev_create(struct net *net, const char *name,
>> }
>> EXPORT_SYMBOL_GPL(vxlan_dev_create);
>> 
>> +static int vxlan_reopen(struct vxlan_net *vn, struct vxlan_dev *vxlan)
>> +{
>> +	int ret = 0;
>> +
>> +	if (vxlan_addr_multicast(&vxlan->default_dst.remote_ip) &&
>> +	    !vxlan_group_used(vn, vxlan))
>> +		ret = vxlan_igmp_leave(vxlan);
>> +	vxlan_sock_release(vxlan);
>> +
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	ret = vxlan_sock_add(vxlan);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (vxlan_addr_multicast(&vxlan->default_dst.remote_ip)) {
>> +		ret = vxlan_igmp_join(vxlan);
>> +		if (ret == -EADDRINUSE)
>> +			ret = 0;
>> +		if (ret)
>> +			vxlan_sock_release(vxlan);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> static void vxlan_handle_lowerdev_unregister(struct vxlan_net *vn,
>> 					     struct net_device *dev)
>> {
>> @@ -3742,6 +3781,55 @@ static void vxlan_handle_lowerdev_unregister(struct vxlan_net *vn,
>> 	unregister_netdevice_many(&list_kill);
>> }
>> 
>> +static void vxlan_handle_change_upper(struct vxlan_net *vn,
>> +				      struct net_device *dev)
>> +{
>> +	struct vxlan_dev *vxlan, *next;
>> +
>> +	list_for_each_entry_safe(vxlan, next, &vn->vxlan_list, next) {
>> +		struct net_device *lower;
>> +		int err;
>> +
>> +		lower = __dev_get_by_index(vxlan->net,
>> +					   vxlan->cfg.remote_ifindex);
>> +		if (!vxlan_is_in_l3mdev_chain(lower, dev))
>> +			continue;
>> +
>> +		err = vxlan_reopen(vn, vxlan);
>> +		if (err < 0)
>> +			netdev_err(vxlan->dev, "Failed to reopen socket: %d\n",
>> +				   err);
>> +	}
>> +}
>> +
>> +static void vxlan_handle_change(struct vxlan_net *vn, struct net_device *dev)
>> +{
>> +	struct vxlan_dev *vxlan = netdev_priv(dev);
>> +	struct vxlan_sock *sock;
>> +	int l3mdev_index;
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +	bool metadata = vxlan->cfg.flags & VXLAN_F_COLLECT_METADATA;
>> +	bool ipv6 = vxlan->cfg.flags & VXLAN_F_IPV6 || metadata;
>> +
>> +	sock = ipv6 ? rcu_dereference(vxlan->vn6_sock)
>> +		    : rcu_dereference(vxlan->vn4_sock);
>> +#else
>> +	sock = rcu_dereference(vxlan->vn4_sock);
>> +#endif
>> +
>> +	l3mdev_index =
>> +		l3mdev_master_upper_ifindex_by_index(vxlan->net,
>> +						     vxlan->cfg.remote_ifindex);
>> +	if (sock->sock->sk->sk_bound_dev_if != l3mdev_index) {
>> +		int ret = vxlan_reopen(vn, vxlan);
>> +
>> +		if (ret < 0)
>> +			netdev_err(vxlan->dev, "Failed to reopen socket: %d\n",
>> +				   ret);
>> +	}
>> +}
>> +
>> static int vxlan_netdevice_event(struct notifier_block *unused,
>> 				 unsigned long event, void *ptr)
>> {
>> @@ -3756,6 +3844,12 @@ static int vxlan_netdevice_event(struct notifier_block *unused,
>> 	} else if (event == NETDEV_UDP_TUNNEL_PUSH_INFO ||
>> 		   event == NETDEV_UDP_TUNNEL_DROP_INFO) {
>> 		vxlan_offload_rx_ports(dev, event == NETDEV_UDP_TUNNEL_PUSH_INFO);
>> +	} else if (event == NETDEV_CHANGEUPPER) {
>> +		vxlan_handle_change_upper(vn, dev);
>> +	} else if (event == NETDEV_CHANGE) {
>> +		if (dev->rtnl_link_ops &&
>> +		    !strcmp(dev->rtnl_link_ops->kind, vxlan_link_ops.kind))
>> +			vxlan_handle_change(vn, dev);
>> 	}
>> 
>> 	return NOTIFY_DONE;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 16:27     ` Alexis Bauvin
@ 2018-11-20 16:54       ` David Ahern
  0 siblings, 0 replies; 26+ messages in thread
From: David Ahern @ 2018-11-20 16:54 UTC (permalink / raw)
  To: Alexis Bauvin, roopa; +Cc: netdev, akherbouche

On 11/20/18 9:27 AM, Alexis Bauvin wrote:
>  maybe even move
> the function to net/core/dev.c, with other master-related functions.
> What do you think?

yes, it is a generic function.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 15:35     ` Roopa Prabhu
  2018-11-20 15:48       ` David Ahern
@ 2018-11-20 16:58       ` Alexis Bauvin
  2018-11-20 17:09         ` David Ahern
  1 sibling, 1 reply; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-20 16:58 UTC (permalink / raw)
  To: Roopa Prabhu, David Ahern; +Cc: netdev, akherbouche

Le 20 nov. 2018 à 16:35, Roopa Prabhu <roopa@cumulusnetworks.com> a écrit :
> 
> On Tue, Nov 20, 2018 at 7:04 AM David Ahern <dsa@cumulusnetworks.com> wrote:
>> 
>> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
>>> When underlay VRF changes, either because the lower device itself changed,
>>> or its VRF changed, this patch releases the current socket of the VXLAN
>>> device and recreates another one in the right VRF. This allows for
>>> on-the-fly change of the underlay VRF of a VXLAN device.
>>> 
>>> Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
>>> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
>>> Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
>>> ---
>>> drivers/net/vxlan.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 94 insertions(+)
>>> 
>>> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>>> index a3de08122269..1e6ccad6df6a 100644
>>> --- a/drivers/net/vxlan.c
>>> +++ b/drivers/net/vxlan.c
>>> @@ -208,6 +208,18 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
>>>      return list_first_entry(&fdb->remotes, struct vxlan_rdst, list);
>>> }
>>> 
>>> +static int vxlan_is_in_l3mdev_chain(struct net_device *chain,
>>> +                                 struct net_device *dev)
>>> +{
>>> +     if (!chain)
>>> +             return 0;
>>> +
>>> +     if (chain->ifindex == dev->ifindex)
>>> +             return 1;
>>> +     return vxlan_is_in_l3mdev_chain(netdev_master_upper_dev_get(chain),
>>> +                                     dev);
>>> +}
>> 
>> This should return bool and true/false.
>> 
>> Also, why l3mdev in the name? None of the checks look at whether it is
>> an l3mdev master.
>> 
>> And again here, someone more familiar with the vxlan code should review it.
>> 
> 
> 
> I understand the need for patch 2. But I don't understand the need for
> the complexity this patch introduces (especially implicit down and up
> of the vxlan device).
> Alexis, If your underlay routing changes,  you can down and up the
> vxlan device from user-space correct ?. This should be true for any
> tunnel device.

Yes. Without this patch, down-up the vxlan interface is enough to handle
the underlay VRF change. This patch is more a convenience than a real
need.

Furthermore, an issue with automatic down/up of the interface is a silent
fail when mixing underlays in the default vrf and in specific ones. Say
you are in the following situation:

       +----------+
       |          |
       | vrf-blue |
       |          |
       +--+----+--+
          |    |
     +----+    +----+
     |              |
+----+----+    +----+----+
|         |    |         |
| dummy-a |    | dummy-b |
|         |    |         |
+---------+    +---------+
     .              .
     .(lower device).
     .              .
+---------+    +---------+
|         |    |         |
| vxlan-a |    | vxlan-b |
|         |    |         |
+---------+    +---------+

A socket bound to vrf-blue listens on *:4789, thus owning the port. If moving an
underlay to the default vrf (ip link set dummy-b nomaster), a new socket will be
created, unbound to any interface and listening on *:4789. However, because it
will be in the default vrf, it will try to take ownership of port 4789 on ALL
vrfs, and fail because this port is already owned in vrf-blue for vxlan-a.

At least, a manual down/up of the vxlan interface would fail when upping the
interface, and spit out "Address already in use" through netlink to the
responsible ip link set up.

Dropping this patch would be sensible given its implicit actions, or fixing it
to make the vxlan interface really down when such a situation happens.

It may not be the right place to ask, but I don’t know the reason behind this
default vrf behaviour.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 16:58       ` Alexis Bauvin
@ 2018-11-20 17:09         ` David Ahern
  2018-11-21 14:05           ` Alexis Bauvin
  0 siblings, 1 reply; 26+ messages in thread
From: David Ahern @ 2018-11-20 17:09 UTC (permalink / raw)
  To: Alexis Bauvin, Roopa Prabhu; +Cc: netdev, akherbouche

On 11/20/18 9:58 AM, Alexis Bauvin wrote:
> A socket bound to vrf-blue listens on *:4789, thus owning the port. If moving an
> underlay to the default vrf (ip link set dummy-b nomaster), a new socket will be
> created, unbound to any interface and listening on *:4789. However, because it
> will be in the default vrf, it will try to take ownership of port 4789 on ALL
> vrfs, and fail because this port is already owned in vrf-blue for vxlan-a.

SO_REUSEPORT will fix that and incoming traffic through a vrf and
default (non-)vrf will work. The recent changes by Vyatta provide even
better isolation of default vrf and overlapping ports.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 16:13         ` David Ahern
@ 2018-11-20 18:19           ` Alexis Bauvin
  0 siblings, 0 replies; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-20 18:19 UTC (permalink / raw)
  To: David Ahern, Roopa Prabhu; +Cc: netdev, akherbouche

Le 20 nov. 2018 à 17:13, David Ahern <dsa@cumulusnetworks.com> a écrit :
> On 11/20/18 8:48 AM, David Ahern wrote:
>> On 11/20/18 8:35 AM, Roopa Prabhu wrote:
>>> On Tue, Nov 20, 2018 at 7:04 AM David Ahern <dsa@cumulusnetworks.com> wrote:
>>>> 
>>>> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
>>>>> When underlay VRF changes, either because the lower device itself changed,
>>>>> or its VRF changed, this patch releases the current socket of the VXLAN
>>>>> device and recreates another one in the right VRF. This allows for
>>>>> on-the-fly change of the underlay VRF of a VXLAN device.
>>>>> 
>>>>> Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
>>>>> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
>>>>> Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
>>>>> ---
>>>>> drivers/net/vxlan.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 94 insertions(+)
>>>>> 
>>>>> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>>>>> index a3de08122269..1e6ccad6df6a 100644
>>>>> --- a/drivers/net/vxlan.c
>>>>> +++ b/drivers/net/vxlan.c
>>>>> @@ -208,6 +208,18 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
>>>>>      return list_first_entry(&fdb->remotes, struct vxlan_rdst, list);
>>>>> }
>>>>> 
>>>>> +static int vxlan_is_in_l3mdev_chain(struct net_device *chain,
>>>>> +                                 struct net_device *dev)
>>>>> +{
>>>>> +     if (!chain)
>>>>> +             return 0;
>>>>> +
>>>>> +     if (chain->ifindex == dev->ifindex)
>>>>> +             return 1;
>>>>> +     return vxlan_is_in_l3mdev_chain(netdev_master_upper_dev_get(chain),
>>>>> +                                     dev);
>>>>> +}
>>>> 
>>>> This should return bool and true/false.
>>>> 
>>>> Also, why l3mdev in the name? None of the checks look at whether it is
>>>> an l3mdev master.
>>>> 
>>>> And again here, someone more familiar with the vxlan code should review it.
>>>> 
>>> 
>>> 
>>> I understand the need for patch 2. But I don't understand the need for
>>> the complexity this patch introduces (especially implicit down and up
>>> of the vxlan device).
>>> Alexis, If your underlay routing changes,  you can down and up the
>>> vxlan device from user-space correct ?. This should be true for any
>>> tunnel device.
>>> 
>> 
>> I believe this patch handles changes to the VRF association of the bridge.
>> 
> Re-reading the commit message, this handles changes in VRF association
> of the lower device.

Yes

> If the vxlan socket in general (vrf or not) can be bound to the lower
> device instead of the VRF then it simplifies things a lot.

It would indeed make things simpler, as this patch would mostly be useless.
There is no technical reason not to have done so, except for the conceptual
idea that you bind to the vrf device to add vrf-awareness.

It only removes part of the complexity, as handling a change of lower device
itself would still be needed (ip link set vxlan-blue type vxlan dev eth1)
to rebind the socket.

I will try to bind the sockets to the lower device to see if there are any
shortcomings to this method.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 0/3] Add VRF support for VXLAN underlay
  2018-11-20 14:23 [RFC v3 0/3] Add VRF support for VXLAN underlay Alexis Bauvin
                   ` (3 preceding siblings ...)
  2018-11-20 14:48 ` [RFC v3 0/3] Add VRF support for VXLAN underlay David Ahern
@ 2018-11-20 21:45 ` David Ahern
  2018-11-21 13:30   ` Alexis Bauvin
  4 siblings, 1 reply; 26+ messages in thread
From: David Ahern @ 2018-11-20 21:45 UTC (permalink / raw)
  To: Alexis Bauvin, roopa; +Cc: netdev, akherbouche

On 11/20/18 7:23 AM, Alexis Bauvin wrote:
> We are trying to isolate the VXLAN traffic from different VMs with VRF as shown
> in the schemas below:
> 
> +-------------------------+   +----------------------------+
> | +----------+            |   |     +------------+         |
> | |          |            |   |     |            |         |
> | | tap-red  |            |   |     |  tap-blue  |         |
> | |          |            |   |     |            |         |
> | +----+-----+            |   |     +-----+------+         |
> |      |                  |   |           |                |
> |      |                  |   |           |                |
> | +----+---+              |   |      +----+----+           |
> | |        |              |   |      |         |           |
> | | br-red |              |   |      | br-blue |           |
> | |        |              |   |      |         |           |
> | +----+---+              |   |      +----+----+           |
> |      |                  |   |           |                |
> |      |                  |   |           |                |
> |      |                  |   |           |                |
> | +----+--------+         |   |     +--------------+       |
> | |             |         |   |     |              |       |
> | |  vxlan-red  |         |   |     |  vxlan-blue  |       |
> | |             |         |   |     |              |       |
> | +------+------+         |   |     +-------+------+       |
> |        |                |   |             |              |
> |        |           VRF  |   |             |          VRF |
> |        |           red  |   |             |         blue |
> +-------------------------+   +----------------------------+

Roopa and I were discussing this setup and are puzzled by the VRF
association here. Does br-red and br-blue have an address? The commands
below do not show it and from our perspective seems odd for this
scenario. If it does not have an address, then there is no reason for
the VRF labeling.

Also, it would be good to have a unit test this case. Can you create a
shell script that creates the setup and runs a few tests verifying
connectivity? You can use network namespaces and veth pairs in place of
the VM with a tap device. From there the functionality is the same.
Tests can be initial VRF association for the vxlan lower device,
changing the VRF to another device, and then changing again back to
default VRF - checking proper connectivity for each.

Thanks

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 0/3] Add VRF support for VXLAN underlay
  2018-11-20 21:45 ` David Ahern
@ 2018-11-21 13:30   ` Alexis Bauvin
  2018-11-21 19:26     ` David Ahern
  0 siblings, 1 reply; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-21 13:30 UTC (permalink / raw)
  To: David Ahern, roopa; +Cc: netdev, akherbouche, Alexis Bauvin

Le 20 nov. 2018 à 22:45, David Ahern <dsa@cumulusnetworks.com> a écrit :
> 
> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
>> We are trying to isolate the VXLAN traffic from different VMs with VRF as shown
>> in the schemas below:
>> 
>> +-------------------------+   +----------------------------+
>> | +----------+            |   |     +------------+         |
>> | |          |            |   |     |            |         |
>> | | tap-red  |            |   |     |  tap-blue  |         |
>> | |          |            |   |     |            |         |
>> | +----+-----+            |   |     +-----+------+         |
>> |      |                  |   |           |                |
>> |      |                  |   |           |                |
>> | +----+---+              |   |      +----+----+           |
>> | |        |              |   |      |         |           |
>> | | br-red |              |   |      | br-blue |           |
>> | |        |              |   |      |         |           |
>> | +----+---+              |   |      +----+----+           |
>> |      |                  |   |           |                |
>> |      |                  |   |           |                |
>> |      |                  |   |           |                |
>> | +----+--------+         |   |     +--------------+       |
>> | |             |         |   |     |              |       |
>> | |  vxlan-red  |         |   |     |  vxlan-blue  |       |
>> | |             |         |   |     |              |       |
>> | +------+------+         |   |     +-------+------+       |
>> |        |                |   |             |              |
>> |        |           VRF  |   |             |          VRF |
>> |        |           red  |   |             |         blue |
>> +-------------------------+   +----------------------------+
> 
> Roopa and I were discussing this setup and are puzzled by the VRF
> association here. Does br-red and br-blue have an address? The commands
> below do not show it and from our perspective seems odd for this
> scenario. If it does not have an address, then there is no reason for
> the VRF labeling.

Yes, br-red and br-blue have an address. Addresses (both MAC and IP)
are anycast addresses, to make an anycast gateway for the VMs behind
the taps. This serves for a classical evpn symmetric distributed
routing, both between vxlans and towards external networks. I am using
FRR as a control plane to exchange bgp-evpn information between
hypervisors.

The schematic could have been more complete, including more bridges
and more taps in the vrf, with an added bridge+vxlan for the l3vni.
I did not include them as I wanted to focus on the underlay itself
and not on what vxlan was used for.

Here is a more complete example:
                                   +---------+
                                   |         |
                                   | vrf1000 |
                                   |         |
                                   +--+-+-+--+
                                      | | |
                +---------------------+ | +---------------+
                |                       |                 |
         +------+------+         +------+------+     +----+---+
         |             |         |             |     |        |
         |     br0     |         |     br1     |     | br1000 |
         | 10.0.0.1/24 |         | 10.0.1.1/24 |     |        |
         |             |         |             |     +----+---+
         +----+-+-+----+         +-----+-+-----+          |
              | | |                    | |                |
      +-------+ | +-------+         +--+ +---+            |
      |         |         |         |        |            |
   +--+---+ +---+--+ +----+---+ +---+--+ +---+----+ +-----+-----+
   |      | |      | |        | |      | |        | |           |
   | tap0 | | tap1 | | vxlan0 | | tap2 | | vxlan1 | | vxlan1000 |
   |      | |      | |        | |      | |        | |           |
   +------+ +------+ +--------+ +------+ +--------+ +-----------+
                          .                   .           .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.                    +---------+                  +--------------+
.                    |         |                  |              |
.                    | vrf1001 |                  | vrf-underlay |
.                    |         |                  |              |
.                    +---+-+---+                  +-------+------+
.                        | |                              |
.              +---------+ +----------+                   |
.              |                      |                   |
.       +------+------+          +----+---+               |
.       |             |          |        |               |
.       |     br2     |          | br1001 |               |
.       | 10.0.2.1/24 |          |        |               |
.       |             |          +----+---+               |
.       +----+-+-+----+               |                   |
.            | | |                    |                   |
.      +-----+ | +-------+            |                   |
.      |       |         |            |           +-------+-------+
.  +---+--+ +--+---+ +---+----+ +-----+-----+     |               |
.  |      | |      | |        | |           |     |   eth0.2030   |
.  | tap3 | | tap4 | | vxlan2 | | vxlan1001 |     | 172.16.0.2/24 |
.  |      | |      | |        | |           |     |               |
.  +------+ +------+ +--------+ +-----------+     +---------------+
.                         .           .                 ^  .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .
           (vxlan lower device is eth0.2030)               v
                                                 +----------------+
                                                 |                |
                                                 |      eth0      |
                                                 | 192.168.1.2/24 |
                                                 |                |
                                                 +----------------+

In this example, the underlay is rather simple. A more complex case
could be with several uplinks in the underlay vrf, and a dummy/loopback
with a /32 ip, used as a vxlan lower device and to bind routing daemons
(e.g. bgpd).

                  +--------------+
                  |              |
                  | vrf-underlay |
                  |              |
                  +-----+-+-+----+
                        | | |
        +---------------+ | +---------------+
        |                 |                 |
+-------+-------+ +-------+-------+ +-------+--------+
|               | |               | |                |
|     eth1      | |     eth2      | |     dummy0     |
| 172.16.0.2/31 | | 172.16.0.4/31 | | 192.168.0.1/32 |
|               | |               | |                |
+---------------+ +---------------+ +----------------+
                                            .
. . . . . . . . . . . . . . . . . . . . . . .
        (vxlan lower device is dummy0)

As for default vrf, it would contain for example management traffic. We
don’t want to mix the underlay routing tables to other traffics.

> Also, it would be good to have a unit test this case. Can you create a
> shell script that creates the setup and runs a few tests verifying
> connectivity? You can use network namespaces and veth pairs in place of
> the VM with a tap device. From there the functionality is the same.
> Tests can be initial VRF association for the vxlan lower device,
> changing the VRF to another device, and then changing again back to
> default VRF - checking proper connectivity for each.

Sure! I’ve just finished writing it, but I am not sure of the best way
so send it. I guess I will have to add it to a v4 of the patches, in
tools/testing/selftests/net?

> Thanks

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-20 17:09         ` David Ahern
@ 2018-11-21 14:05           ` Alexis Bauvin
  2018-11-21 19:28             ` David Ahern
  0 siblings, 1 reply; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-21 14:05 UTC (permalink / raw)
  To: David Ahern; +Cc: Roopa Prabhu, netdev, akherbouche

Le 20 nov. 2018 à 18:09, David Ahern <dsa@cumulusnetworks.com> a écrit :
> On 11/20/18 9:58 AM, Alexis Bauvin wrote:
>> A socket bound to vrf-blue listens on *:4789, thus owning the port. If moving an
>> underlay to the default vrf (ip link set dummy-b nomaster), a new socket will be
>> created, unbound to any interface and listening on *:4789. However, because it
>> will be in the default vrf, it will try to take ownership of port 4789 on ALL
>> vrfs, and fail because this port is already owned in vrf-blue for vxlan-a.
> 
> SO_REUSEPORT will fix that and incoming traffic through a vrf and
> default (non-)vrf will work. The recent changes by Vyatta provide even
> better isolation of default vrf and overlapping ports.

Did not think about that one, I will give it a shot.

There is one issue I can see with SO_REUSEPORT (if my understanding of it is
correct). From what I understood, enabling this option will balance incoming
connections (for TCP) / dgrams (for UDP) based on a 4-tuple hash (sip, dip,
sport, dport) between sockets listening on the same port.

If we have two separate vxlan fabrics, with one underlay in the default vrf, and
another in some random vrf. Since the socket for the default vrf would own the
port on all vrfs, the port would effectively be reused between the two vrfs.
Wouldn't vxlan packets be directed to "random" (as in not related to the vxlan
fabric itself) sockets, meaning a complete mix of the two fabrics? This would
imply a complete drop of the packets not directed to the correct socket.

I guess the Vyatta changes you are talking about are "vrf: allow simultaneous
service instances in default and other VRFs"? If so, it looks like it would
solve the default vrf problem, not even requiring SO_REUSEPORT.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 0/3] Add VRF support for VXLAN underlay
  2018-11-21 13:30   ` Alexis Bauvin
@ 2018-11-21 19:26     ` David Ahern
  2018-11-22  0:47       ` Alexis Bauvin
  0 siblings, 1 reply; 26+ messages in thread
From: David Ahern @ 2018-11-21 19:26 UTC (permalink / raw)
  To: Alexis Bauvin, roopa; +Cc: netdev, akherbouche

On 11/21/18 6:30 AM, Alexis Bauvin wrote:
> Le 20 nov. 2018 à 22:45, David Ahern <dsa@cumulusnetworks.com> a écrit :
>>
>> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
>>> We are trying to isolate the VXLAN traffic from different VMs with VRF as shown
>>> in the schemas below:
>>>
>>> +-------------------------+   +----------------------------+
>>> | +----------+            |   |     +------------+         |
>>> | |          |            |   |     |            |         |
>>> | | tap-red  |            |   |     |  tap-blue  |         |
>>> | |          |            |   |     |            |         |
>>> | +----+-----+            |   |     +-----+------+         |
>>> |      |                  |   |           |                |
>>> |      |                  |   |           |                |
>>> | +----+---+              |   |      +----+----+           |
>>> | |        |              |   |      |         |           |
>>> | | br-red |              |   |      | br-blue |           |
>>> | |        |              |   |      |         |           |
>>> | +----+---+              |   |      +----+----+           |
>>> |      |                  |   |           |                |
>>> |      |                  |   |           |                |
>>> |      |                  |   |           |                |
>>> | +----+--------+         |   |     +--------------+       |
>>> | |             |         |   |     |              |       |
>>> | |  vxlan-red  |         |   |     |  vxlan-blue  |       |
>>> | |             |         |   |     |              |       |
>>> | +------+------+         |   |     +-------+------+       |
>>> |        |                |   |             |              |
>>> |        |           VRF  |   |             |          VRF |
>>> |        |           red  |   |             |         blue |
>>> +-------------------------+   +----------------------------+
>>
>> Roopa and I were discussing this setup and are puzzled by the VRF
>> association here. Does br-red and br-blue have an address? The commands
>> below do not show it and from our perspective seems odd for this
>> scenario. If it does not have an address, then there is no reason for
>> the VRF labeling.
> 
> Yes, br-red and br-blue have an address. Addresses (both MAC and IP)
> are anycast addresses, to make an anycast gateway for the VMs behind
> the taps. This serves for a classical evpn symmetric distributed
> routing, both between vxlans and towards external networks. I am using
> FRR as a control plane to exchange bgp-evpn information between
> hypervisors.
> 
> The schematic could have been more complete, including more bridges
> and more taps in the vrf, with an added bridge+vxlan for the l3vni.
> I did not include them as I wanted to focus on the underlay itself
> and not on what vxlan was used for.
> 
> Here is a more complete example:
>                                    +---------+
>                                    |         |
>                                    | vrf1000 |
>                                    |         |
>                                    +--+-+-+--+
>                                       | | |
>                 +---------------------+ | +---------------+
>                 |                       |                 |
>          +------+------+         +------+------+     +----+---+
>          |             |         |             |     |        |
>          |     br0     |         |     br1     |     | br1000 |
>          | 10.0.0.1/24 |         | 10.0.1.1/24 |     |        |
>          |             |         |             |     +----+---+
>          +----+-+-+----+         +-----+-+-----+          |
>               | | |                    | |                |
>       +-------+ | +-------+         +--+ +---+            |
>       |         |         |         |        |            |
>    +--+---+ +---+--+ +----+---+ +---+--+ +---+----+ +-----+-----+
>    |      | |      | |        | |      | |        | |           |
>    | tap0 | | tap1 | | vxlan0 | | tap2 | | vxlan1 | | vxlan1000 |
>    |      | |      | |        | |      | |        | |           |
>    +------+ +------+ +--------+ +------+ +--------+ +-----------+
>                           .                   .           .
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> .
> .                    +---------+                  +--------------+
> .                    |         |                  |              |
> .                    | vrf1001 |                  | vrf-underlay |
> .                    |         |                  |              |
> .                    +---+-+---+                  +-------+------+
> .                        | |                              |
> .              +---------+ +----------+                   |
> .              |                      |                   |
> .       +------+------+          +----+---+               |
> .       |             |          |        |               |
> .       |     br2     |          | br1001 |               |
> .       | 10.0.2.1/24 |          |        |               |
> .       |             |          +----+---+               |
> .       +----+-+-+----+               |                   |
> .            | | |                    |                   |
> .      +-----+ | +-------+            |                   |
> .      |       |         |            |           +-------+-------+
> .  +---+--+ +--+---+ +---+----+ +-----+-----+     |               |
> .  |      | |      | |        | |           |     |   eth0.2030   |
> .  | tap3 | | tap4 | | vxlan2 | | vxlan1001 |     | 172.16.0.2/24 |
> .  |      | |      | |        | |           |     |               |
> .  +------+ +------+ +--------+ +-----------+     +---------------+
> .                         .           .                 ^  .
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .
>            (vxlan lower device is eth0.2030)               v
>                                                  +----------------+
>                                                  |                |
>                                                  |      eth0      |
>                                                  | 192.168.1.2/24 |
>                                                  |                |
>                                                  +----------------+
> 
> In this example, the underlay is rather simple. A more complex case
> could be with several uplinks in the underlay vrf, and a dummy/loopback
> with a /32 ip, used as a vxlan lower device and to bind routing daemons
> (e.g. bgpd).
> 
>                   +--------------+
>                   |              |
>                   | vrf-underlay |
>                   |              |
>                   +-----+-+-+----+
>                         | | |
>         +---------------+ | +---------------+
>         |                 |                 |
> +-------+-------+ +-------+-------+ +-------+--------+
> |               | |               | |                |
> |     eth1      | |     eth2      | |     dummy0     |
> | 172.16.0.2/31 | | 172.16.0.4/31 | | 192.168.0.1/32 |
> |               | |               | |                |
> +---------------+ +---------------+ +----------------+
>                                             .
> . . . . . . . . . . . . . . . . . . . . . . .
>         (vxlan lower device is dummy0)

Thanks for sending a more complete example. I forwarded it to some FRR
folks as well.

> 
> As for default vrf, it would contain for example management traffic. We
> don’t want to mix the underlay routing tables to other traffics.

Management VRF? :-)


> 
>> Also, it would be good to have a unit test this case. Can you create a
>> shell script that creates the setup and runs a few tests verifying
>> connectivity? You can use network namespaces and veth pairs in place of
>> the VM with a tap device. From there the functionality is the same.
>> Tests can be initial VRF association for the vxlan lower device,
>> changing the VRF to another device, and then changing again back to
>> default VRF - checking proper connectivity for each.
> 
> Sure! I’ve just finished writing it, but I am not sure of the best way
> so send it. I guess I will have to add it to a v4 of the patches, in
> tools/testing/selftests/net?

yes, that would be good. Thank you for writing it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-21 14:05           ` Alexis Bauvin
@ 2018-11-21 19:28             ` David Ahern
  2018-11-22  0:54               ` Alexis Bauvin
  0 siblings, 1 reply; 26+ messages in thread
From: David Ahern @ 2018-11-21 19:28 UTC (permalink / raw)
  To: Alexis Bauvin; +Cc: Roopa Prabhu, netdev, akherbouche

On 11/21/18 7:05 AM, Alexis Bauvin wrote:
> Le 20 nov. 2018 à 18:09, David Ahern <dsa@cumulusnetworks.com> a écrit :
>> On 11/20/18 9:58 AM, Alexis Bauvin wrote:
>>> A socket bound to vrf-blue listens on *:4789, thus owning the port. If moving an
>>> underlay to the default vrf (ip link set dummy-b nomaster), a new socket will be
>>> created, unbound to any interface and listening on *:4789. However, because it
>>> will be in the default vrf, it will try to take ownership of port 4789 on ALL
>>> vrfs, and fail because this port is already owned in vrf-blue for vxlan-a.
>>
>> SO_REUSEPORT will fix that and incoming traffic through a vrf and
>> default (non-)vrf will work. The recent changes by Vyatta provide even
>> better isolation of default vrf and overlapping ports.
> 
> Did not think about that one, I will give it a shot.
> 
> There is one issue I can see with SO_REUSEPORT (if my understanding of it is
> correct). From what I understood, enabling this option will balance incoming
> connections (for TCP) / dgrams (for UDP) based on a 4-tuple hash (sip, dip,
> sport, dport) between sockets listening on the same port.

AFAIK there is no balancing done. There is an order to which socket is
selected - and it includes the VRF device if relevant.

> 
> If we have two separate vxlan fabrics, with one underlay in the default vrf, and
> another in some random vrf. Since the socket for the default vrf would own the
> port on all vrfs, the port would effectively be reused between the two vrfs.
> Wouldn't vxlan packets be directed to "random" (as in not related to the vxlan
> fabric itself) sockets, meaning a complete mix of the two fabrics? This would
> imply a complete drop of the packets not directed to the correct socket.
> 
> I guess the Vyatta changes you are talking about are "vrf: allow simultaneous
> service instances in default and other VRFs"? If so, it looks like it would
> solve the default vrf problem, not even requiring SO_REUSEPORT.
> 

yes.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 0/3] Add VRF support for VXLAN underlay
  2018-11-21 19:26     ` David Ahern
@ 2018-11-22  0:47       ` Alexis Bauvin
  0 siblings, 0 replies; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-22  0:47 UTC (permalink / raw)
  To: David Ahern, roopa; +Cc: netdev, akherbouche

Le 21 nov. 2018 à 20:26, David Ahern <dsa@cumulusnetworks.com> a écrit :
> 
> On 11/21/18 6:30 AM, Alexis Bauvin wrote:
>> Le 20 nov. 2018 à 22:45, David Ahern <dsa@cumulusnetworks.com> a écrit :
>>> 
>>> On 11/20/18 7:23 AM, Alexis Bauvin wrote:
>>>> We are trying to isolate the VXLAN traffic from different VMs with VRF as shown
>>>> in the schemas below:
>>>> 
>>>> +-------------------------+   +----------------------------+
>>>> | +----------+            |   |     +------------+         |
>>>> | |          |            |   |     |            |         |
>>>> | | tap-red  |            |   |     |  tap-blue  |         |
>>>> | |          |            |   |     |            |         |
>>>> | +----+-----+            |   |     +-----+------+         |
>>>> |      |                  |   |           |                |
>>>> |      |                  |   |           |                |
>>>> | +----+---+              |   |      +----+----+           |
>>>> | |        |              |   |      |         |           |
>>>> | | br-red |              |   |      | br-blue |           |
>>>> | |        |              |   |      |         |           |
>>>> | +----+---+              |   |      +----+----+           |
>>>> |      |                  |   |           |                |
>>>> |      |                  |   |           |                |
>>>> |      |                  |   |           |                |
>>>> | +----+--------+         |   |     +--------------+       |
>>>> | |             |         |   |     |              |       |
>>>> | |  vxlan-red  |         |   |     |  vxlan-blue  |       |
>>>> | |             |         |   |     |              |       |
>>>> | +------+------+         |   |     +-------+------+       |
>>>> |        |                |   |             |              |
>>>> |        |           VRF  |   |             |          VRF |
>>>> |        |           red  |   |             |         blue |
>>>> +-------------------------+   +----------------------------+
>>> 
>>> Roopa and I were discussing this setup and are puzzled by the VRF
>>> association here. Does br-red and br-blue have an address? The commands
>>> below do not show it and from our perspective seems odd for this
>>> scenario. If it does not have an address, then there is no reason for
>>> the VRF labeling.
>> 
>> Yes, br-red and br-blue have an address. Addresses (both MAC and IP)
>> are anycast addresses, to make an anycast gateway for the VMs behind
>> the taps. This serves for a classical evpn symmetric distributed
>> routing, both between vxlans and towards external networks. I am using
>> FRR as a control plane to exchange bgp-evpn information between
>> hypervisors.
>> 
>> The schematic could have been more complete, including more bridges
>> and more taps in the vrf, with an added bridge+vxlan for the l3vni.
>> I did not include them as I wanted to focus on the underlay itself
>> and not on what vxlan was used for.
>> 
>> Here is a more complete example:
>>                                   +---------+
>>                                   |         |
>>                                   | vrf1000 |
>>                                   |         |
>>                                   +--+-+-+--+
>>                                      | | |
>>                +---------------------+ | +---------------+
>>                |                       |                 |
>>         +------+------+         +------+------+     +----+---+
>>         |             |         |             |     |        |
>>         |     br0     |         |     br1     |     | br1000 |
>>         | 10.0.0.1/24 |         | 10.0.1.1/24 |     |        |
>>         |             |         |             |     +----+---+
>>         +----+-+-+----+         +-----+-+-----+          |
>>              | | |                    | |                |
>>      +-------+ | +-------+         +--+ +---+            |
>>      |         |         |         |        |            |
>>   +--+---+ +---+--+ +----+---+ +---+--+ +---+----+ +-----+-----+
>>   |      | |      | |        | |      | |        | |           |
>>   | tap0 | | tap1 | | vxlan0 | | tap2 | | vxlan1 | | vxlan1000 |
>>   |      | |      | |        | |      | |        | |           |
>>   +------+ +------+ +--------+ +------+ +--------+ +-----------+
>>                          .                   .           .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> .
>> .                    +---------+                  +--------------+
>> .                    |         |                  |              |
>> .                    | vrf1001 |                  | vrf-underlay |
>> .                    |         |                  |              |
>> .                    +---+-+---+                  +-------+------+
>> .                        | |                              |
>> .              +---------+ +----------+                   |
>> .              |                      |                   |
>> .       +------+------+          +----+---+               |
>> .       |             |          |        |               |
>> .       |     br2     |          | br1001 |               |
>> .       | 10.0.2.1/24 |          |        |               |
>> .       |             |          +----+---+               |
>> .       +----+-+-+----+               |                   |
>> .            | | |                    |                   |
>> .      +-----+ | +-------+            |                   |
>> .      |       |         |            |           +-------+-------+
>> .  +---+--+ +--+---+ +---+----+ +-----+-----+     |               |
>> .  |      | |      | |        | |           |     |   eth0.2030   |
>> .  | tap3 | | tap4 | | vxlan2 | | vxlan1001 |     | 172.16.0.2/24 |
>> .  |      | |      | |        | |           |     |               |
>> .  +------+ +------+ +--------+ +-----------+     +---------------+
>> .                         .           .                 ^  .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .
>>           (vxlan lower device is eth0.2030)               v
>>                                                 +----------------+
>>                                                 |                |
>>                                                 |      eth0      |
>>                                                 | 192.168.1.2/24 |
>>                                                 |                |
>>                                                 +----------------+
>> 
>> In this example, the underlay is rather simple. A more complex case
>> could be with several uplinks in the underlay vrf, and a dummy/loopback
>> with a /32 ip, used as a vxlan lower device and to bind routing daemons
>> (e.g. bgpd).
>> 
>>                  +--------------+
>>                  |              |
>>                  | vrf-underlay |
>>                  |              |
>>                  +-----+-+-+----+
>>                        | | |
>>        +---------------+ | +---------------+
>>        |                 |                 |
>> +-------+-------+ +-------+-------+ +-------+--------+
>> |               | |               | |                |
>> |     eth1      | |     eth2      | |     dummy0     |
>> | 172.16.0.2/31 | | 172.16.0.4/31 | | 192.168.0.1/32 |
>> |               | |               | |                |
>> +---------------+ +---------------+ +----------------+
>>                                            .
>> . . . . . . . . . . . . . . . . . . . . . . .
>>        (vxlan lower device is dummy0)
> 
> Thanks for sending a more complete example. I forwarded it to some FRR
> folks as well.

My pleasure.

>> 
>> As for default vrf, it would contain for example management traffic. We
>> don’t want to mix the underlay routing tables to other traffics.
> 
> Management VRF? :-)

Definitely possible, indeed. This would be the best solution, if only I *could*.
Reality is, people are still suspicious of networking guys. The least we touch
their servers, the better they are. For hypervisors we can, but the rest of
them are a mix and match, and I need to be as generic as possible.

Telling others to get off the default vrf for their own management softwares
(for example) if they want to integrate into what I’m developing is quite the
challenge. I need to be as plug-and-play as possible, and an underlay vrf is the
best option to keep the default to whatever it was used for.

(And as weird as it might be, it is easier for me to tell them to upgrade to a
mainline kernel than to rewrite their networking setup)

>> 
>>> Also, it would be good to have a unit test this case. Can you create a
>>> shell script that creates the setup and runs a few tests verifying
>>> connectivity? You can use network namespaces and veth pairs in place of
>>> the VM with a tap device. From there the functionality is the same.
>>> Tests can be initial VRF association for the vxlan lower device,
>>> changing the VRF to another device, and then changing again back to
>>> default VRF - checking proper connectivity for each.
>> 
>> Sure! I’ve just finished writing it, but I am not sure of the best way
>> so send it. I guess I will have to add it to a v4 of the patches, in
>> tools/testing/selftests/net?
> 
> yes, that would be good. Thank you for writing it.

Will be sent in next patch series.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-21 19:28             ` David Ahern
@ 2018-11-22  0:54               ` Alexis Bauvin
  2018-11-22  1:48                 ` David Ahern
  0 siblings, 1 reply; 26+ messages in thread
From: Alexis Bauvin @ 2018-11-22  0:54 UTC (permalink / raw)
  To: David Ahern, Roopa Prabhu; +Cc: netdev, akherbouche

Le 21 nov. 2018 à 20:28, David Ahern <dsa@cumulusnetworks.com> a écrit :
> On 11/21/18 7:05 AM, Alexis Bauvin wrote:
>> Le 20 nov. 2018 à 18:09, David Ahern <dsa@cumulusnetworks.com> a écrit :
>>> On 11/20/18 9:58 AM, Alexis Bauvin wrote:
>>>> A socket bound to vrf-blue listens on *:4789, thus owning the port. If moving an
>>>> underlay to the default vrf (ip link set dummy-b nomaster), a new socket will be
>>>> created, unbound to any interface and listening on *:4789. However, because it
>>>> will be in the default vrf, it will try to take ownership of port 4789 on ALL
>>>> vrfs, and fail because this port is already owned in vrf-blue for vxlan-a.
>>> 
>>> SO_REUSEPORT will fix that and incoming traffic through a vrf and
>>> default (non-)vrf will work. The recent changes by Vyatta provide even
>>> better isolation of default vrf and overlapping ports.
>> 
>> Did not think about that one, I will give it a shot.
>> 
>> There is one issue I can see with SO_REUSEPORT (if my understanding of it is
>> correct). From what I understood, enabling this option will balance incoming
>> connections (for TCP) / dgrams (for UDP) based on a 4-tuple hash (sip, dip,
>> sport, dport) between sockets listening on the same port.
> 
> AFAIK there is no balancing done. There is an order to which socket is
> selected - and it includes the VRF device if relevant.

Maybe balance was not the correct word, "route" may be more appropriate. Still you
understood me, thanks for the details!

Yet, the "if relevant" part is interesting. Does enabling the
net.ipv4.udp_l3mdev_accept sysctl counts as making vrfs not releavant? In that case,
both sockets are treated equally, right?

>> 
>> If we have two separate vxlan fabrics, with one underlay in the default vrf, and
>> another in some random vrf. Since the socket for the default vrf would own the
>> port on all vrfs, the port would effectively be reused between the two vrfs.
>> Wouldn't vxlan packets be directed to "random" (as in not related to the vxlan
>> fabric itself) sockets, meaning a complete mix of the two fabrics? This would
>> imply a complete drop of the packets not directed to the correct socket.
>> 
>> I guess the Vyatta changes you are talking about are "vrf: allow simultaneous
>> service instances in default and other VRFs"? If so, it looks like it would
>> solve the default vrf problem, not even requiring SO_REUSEPORT.
>> 
> 
> yes.

So it would seem with the above that SO_REUSEPORT is a solution when not using
Vyatta's changes, which are a solution themselves (I will test those anyways).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v3 3/3] vxlan: handle underlay VRF changes
  2018-11-22  0:54               ` Alexis Bauvin
@ 2018-11-22  1:48                 ` David Ahern
  0 siblings, 0 replies; 26+ messages in thread
From: David Ahern @ 2018-11-22  1:48 UTC (permalink / raw)
  To: Alexis Bauvin, Roopa Prabhu; +Cc: netdev, akherbouche

On 11/21/18 5:54 PM, Alexis Bauvin wrote:
>>> There is one issue I can see with SO_REUSEPORT (if my understanding of it is
>>> correct). From what I understood, enabling this option will balance incoming
>>> connections (for TCP) / dgrams (for UDP) based on a 4-tuple hash (sip, dip,
>>> sport, dport) between sockets listening on the same port.
>>
>> AFAIK there is no balancing done. There is an order to which socket is
>> selected - and it includes the VRF device if relevant.
> 
> Maybe balance was not the correct word, "route" may be more appropriate. Still you
> understood me, thanks for the details!
> 
> Yet, the "if relevant" part is interesting. Does enabling the
> net.ipv4.udp_l3mdev_accept sysctl counts as making vrfs not releavant? In that case,
> both sockets are treated equally, right?

If udp_l3mdev_accept is disabled the default VRF is treated like a  real
VRF as opposed to no VRF (Vyatta's changes), meaning the scope of the
socket is just the VRF (or just the default VRF).

If udp_l3mdev_accept is enabled it allows an unbound UDP socket to work
across all VRFs.

Gives user's options for how to deploy their s/w especially when it
comes to all of the existing software that is has no idea about VRFs.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-11-22 12:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-20 14:23 [RFC v3 0/3] Add VRF support for VXLAN underlay Alexis Bauvin
2018-11-20 14:23 ` [RFC v3 1/3] udp_tunnel: add config option to bind to a device Alexis Bauvin
2018-11-20 14:23 ` [RFC v3 2/3] vxlan: add support for underlay in non-default VRF Alexis Bauvin
2018-11-20 14:57   ` David Ahern
2018-11-20 16:11     ` Alexis Bauvin
2018-11-20 15:25   ` Roopa Prabhu
2018-11-20 16:14     ` Alexis Bauvin
2018-11-20 14:23 ` [RFC v3 3/3] vxlan: handle underlay VRF changes Alexis Bauvin
2018-11-20 15:04   ` David Ahern
2018-11-20 15:35     ` Roopa Prabhu
2018-11-20 15:48       ` David Ahern
2018-11-20 16:13         ` David Ahern
2018-11-20 18:19           ` Alexis Bauvin
2018-11-20 16:58       ` Alexis Bauvin
2018-11-20 17:09         ` David Ahern
2018-11-21 14:05           ` Alexis Bauvin
2018-11-21 19:28             ` David Ahern
2018-11-22  0:54               ` Alexis Bauvin
2018-11-22  1:48                 ` David Ahern
2018-11-20 16:27     ` Alexis Bauvin
2018-11-20 16:54       ` David Ahern
2018-11-20 14:48 ` [RFC v3 0/3] Add VRF support for VXLAN underlay David Ahern
2018-11-20 21:45 ` David Ahern
2018-11-21 13:30   ` Alexis Bauvin
2018-11-21 19:26     ` David Ahern
2018-11-22  0:47       ` Alexis Bauvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.