All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2]  Bareudp Tunnel Module
@ 2019-10-08  9:48 Martin Varghese
  2019-10-08  9:48 ` [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc Martin Varghese
  2019-10-08  9:49 ` [PATCH net-next 2/2] Special handling for IP & MPLS Martin Varghese
  0 siblings, 2 replies; 36+ messages in thread
From: Martin Varghese @ 2019-10-08  9:48 UTC (permalink / raw)
  To: netdev, davem, corbet, scott.drennan, jbenc, martin.varghese
  Cc: Martin Varghese

There are various L3 encapsulation standards using UDP being discussed to
leverage the UDP based load balancing capability of different networks.
MPLSoUDP (https://tools.ietf.org/html/rfc7510)is one among them.

The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
a UDP tunnel.

Special Handling
----------------
The bareudp device supports special handling for MPLS & IP as they can have
multiple ethertypes.
MPLS procotcol can have ethertypes 0x8847 (unicast) & 0x8847 (multicast).
IP proctocol can have ethertypes 0x0800 (v4) & 0x866 (v6).
This special handling can be enabled only for ethertype 0x0800 & 0x88847 with a
flag called extended mode.

Usage
------

1. Device creation & deletion

a. ip link add dev bareudp0 type bareudp dstport 6635 ethertype 0x8847

This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
0x8847 (MPLS traffic).The destination port of the UDP header will be set to 6635
The device will listen on UDP port 6635 to receive traffic.

b. ip link delete bareudp0

2. Device creation with extended mode enabled

There are two ways to create a bareudp device for MPLS & IP with extended mode
enabled

a. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype 0x8847 extmode 1

b. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype mpls

Note - iproute2 & Selftests are implemented in seperate patches.


Martin (2):
  UDP tunnel encapsulation module for tunnelling different protocols    
    like MPLS,IP,NSH etc.
  Special handling for IP & MPLS.

 Documentation/networking/bareudp.txt |  41 ++
 drivers/net/Kconfig                  |  13 +
 drivers/net/Makefile                 |   1 +
 drivers/net/bareudp.c                | 998 +++++++++++++++++++++++++++++++++++
 include/net/bareudp.h                |  20 +
 include/uapi/linux/if_link.h         |  13 +
 6 files changed, 1086 insertions(+)
 create mode 100644 Documentation/networking/bareudp.txt
 create mode 100644 drivers/net/bareudp.c
 create mode 100644 include/net/bareudp.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-08  9:48 [PATCH net-next 0/2] Bareudp Tunnel Module Martin Varghese
@ 2019-10-08  9:48 ` Martin Varghese
  2019-10-08 14:06   ` Jonathan Corbet
                     ` (4 more replies)
  2019-10-08  9:49 ` [PATCH net-next 2/2] Special handling for IP & MPLS Martin Varghese
  1 sibling, 5 replies; 36+ messages in thread
From: Martin Varghese @ 2019-10-08  9:48 UTC (permalink / raw)
  To: netdev, davem, corbet, scott.drennan, jbenc, martin.varghese
  Cc: Martin Varghese

From: Martin <martin.varghese@nokia.com>

The Bareudp tunnel module provides a generic L3 encapsulation
tunnelling module for tunnelling different protocols like MPLS,
IP,NSH etc inside a UDP tunnel.

Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
---
 Documentation/networking/bareudp.txt |  23 +
 drivers/net/Kconfig                  |  13 +
 drivers/net/Makefile                 |   1 +
 drivers/net/bareudp.c                | 930 +++++++++++++++++++++++++++++++++++
 include/net/bareudp.h                |  19 +
 include/uapi/linux/if_link.h         |  12 +
 6 files changed, 998 insertions(+)
 create mode 100644 Documentation/networking/bareudp.txt
 create mode 100644 drivers/net/bareudp.c
 create mode 100644 include/net/bareudp.h

diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
new file mode 100644
index 0000000..d2530e2
--- /dev/null
+++ b/Documentation/networking/bareudp.txt
@@ -0,0 +1,23 @@
+Bare UDP Tunnelling Module Documentation
+========================================
+
+There are various L3 encapsulation standards using UDP being discussed to
+leverage the UDP based load balancing capability of different networks.
+MPLSoUDP (https://tools.ietf.org/html/rfc7510)is one among them.
+
+The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
+support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
+a UDP tunnel.
+
+Usage
+------
+
+1. Device creation & deletion
+
+a. ip link add dev bareudp0 type bareudp dstport 6635 ethertype 0x8847
+
+This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
+0x8847 (MPLS traffic).The destination port of the UDP header will be set to 6635
+The device will listen on UDP port 6635 to receive traffic.
+
+b. ip link delete bareudp0
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 48e209e..a389fac 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -215,6 +215,19 @@ config GENEVE
 	  To compile this driver as a module, choose M here: the module
 	  will be called geneve.
 
+config BAREUDP
+       tristate "Bare UDP  Encapsulation"
+       depends on INET && NET_UDP_TUNNEL
+       depends on IPV6 || !IPV6
+       select NET_IP_TUNNEL
+       select GRO_CELLS
+       help
+          This adds a bare udp tunnel module for tunnelling different
+          kind of traffic like MPLS, IP, etc. inside a UDP tunnel.
+
+          To compile this driver as a module, choose M here: the module
+          will be called bareudp.
+
 config GTP
 	tristate "GPRS Tunneling Protocol datapath (GTP-U)"
 	depends on INET
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 0d3ba05..0bb7de5 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
+obj-$(CONFIG_BAREUDP) += bareudp.o
 obj-$(CONFIG_GTP) += gtp.o
 obj-$(CONFIG_NLMON) += nlmon.o
 obj-$(CONFIG_NET_VRF) += vrf.o
diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
new file mode 100644
index 0000000..7e6813a
--- /dev/null
+++ b/drivers/net/bareudp.c
@@ -0,0 +1,930 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Bareudp: UDP  tunnel encasulation for different Payload types like
+ * MPLS, NSH, IP, etc.
+ * Copyright (c) 2019 Nokia, Inc.
+ * Authors:  Martin Varghese, <martin.varghese@nokia.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/etherdevice.h>
+#include <linux/hash.h>
+#include <net/dst_metadata.h>
+#include <net/gro_cells.h>
+#include <net/rtnetlink.h>
+#include <net/protocol.h>
+#include <net/udp_tunnel.h>
+#include <net/bareudp.h>
+
+#define BAREUDP_BASE_HLEN sizeof(struct udphdr)
+#define BAREUDP_IPV4_HLEN (ETH_HLEN + sizeof(struct iphdr) + \
+			   sizeof(struct udphdr))
+#define BAREUDP_IPV6_HLEN (ETH_HLEN + sizeof(struct ipv6hdr) + \
+			   sizeof(struct udphdr))
+
+static bool log_ecn_error = true;
+module_param(log_ecn_error, bool, 0644);
+MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
+
+/* per-network namespace private data for this module */
+
+static unsigned int bareudp_net_id;
+
+struct bareudp_net {
+	struct list_head        bareudp_list;
+};
+
+struct bareudp_sock {
+	struct socket           *sock;
+	struct rcu_head         rcu;
+	struct bareudp_dev      *bareudp;
+};
+
+/* Pseudo network device */
+struct bareudp_dev {
+	struct net         *net;        /* netns for packet i/o */
+	struct net_device  *dev;        /* netdev for bareudp tunnel */
+	__be16		   ethertype;
+	u16	           sport_min;
+	struct bareudp_conf conf;
+	struct bareudp_sock __rcu *sock4; /* IPv4 socket for bareudp tunnel */
+#if IS_ENABLED(CONFIG_IPV6)
+	struct bareudp_sock __rcu *sock6; /* IPv6 socket for bareudp tunnel */
+#endif
+	struct list_head   next;        /* bareudp node  on namespace list */
+	struct gro_cells   gro_cells;
+};
+
+static sa_family_t bareudp_get_sk_family(struct bareudp_sock *bs)
+{
+	return bs->sock->sk->sk_family;
+}
+
+static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
+{
+	struct bareudp_sock *bs;
+	struct ethhdr *eh;
+	struct bareudp_dev *bareudp;
+	struct metadata_dst *tun_dst = NULL;
+	struct pcpu_sw_netstats *stats;
+	unsigned int len;
+	int err = 0;
+	void *oiph;
+	u16 proto;
+
+	if (unlikely(!pskb_may_pull(skb, BAREUDP_BASE_HLEN)))
+		goto drop;
+
+	bs = rcu_dereference_sk_user_data(sk);
+	if (!bs)
+		goto drop;
+
+	bareudp = bs->bareudp;
+	proto = bareudp->ethertype;
+
+	if (iptunnel_pull_header(skb, BAREUDP_BASE_HLEN,
+				 proto,
+				 !net_eq(bareudp->net,
+					 dev_net(bareudp->dev)))) {
+		bareudp->dev->stats.rx_dropped++;
+		goto drop;
+	}
+	tun_dst = udp_tun_rx_dst(skb, bareudp_get_sk_family(bs), TUNNEL_KEY,
+				 0, 0);
+	if (!tun_dst) {
+		bareudp->dev->stats.rx_dropped++;
+		goto drop;
+	}
+	skb_dst_set(skb, &tun_dst->dst);
+
+	skb_push(skb, sizeof(struct ethhdr));
+	eh = (struct ethhdr *)skb->data;
+	eh->h_proto = proto;
+
+	skb_reset_mac_header(skb);
+	skb->protocol = eth_type_trans(skb, bareudp->dev);
+	skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+	oiph = skb_network_header(skb);
+	skb_reset_network_header(skb);
+
+	if (bareudp_get_sk_family(bs) == AF_INET)
+		err = IP_ECN_decapsulate(oiph, skb);
+#if IS_ENABLED(CONFIG_IPV6)
+	else
+		err = IP6_ECN_decapsulate(oiph, skb);
+#endif
+
+	if (unlikely(err)) {
+		if (log_ecn_error) {
+			if (bareudp_get_sk_family(bs) == AF_INET)
+				net_info_ratelimited("non-ECT from %pI4 "
+						     "with TOS=%#x\n",
+						     &((struct iphdr *)oiph)->saddr,
+						     ((struct iphdr *)oiph)->tos);
+#if IS_ENABLED(CONFIG_IPV6)
+			else
+				net_info_ratelimited("non-ECT from %pI6\n",
+						     &((struct ipv6hdr *)oiph)->saddr);
+#endif
+		}
+		if (err > 1) {
+			++bareudp->dev->stats.rx_frame_errors;
+			++bareudp->dev->stats.rx_errors;
+			goto drop;
+		}
+	}
+
+	len = skb->len;
+	err = gro_cells_receive(&bareudp->gro_cells, skb);
+	if (likely(err == NET_RX_SUCCESS)) {
+		stats = this_cpu_ptr(bareudp->dev->tstats);
+		u64_stats_update_begin(&stats->syncp);
+		stats->rx_packets++;
+		stats->rx_bytes += len;
+		u64_stats_update_end(&stats->syncp);
+	}
+	return 0;
+drop:
+	/* Consume bad packet */
+	kfree_skb(skb);
+
+	return 0;
+}
+
+static struct socket *bareudp_create_sock(struct net *net, bool ipv6,
+					  __be16 port)
+{
+	struct socket *sock;
+	struct udp_port_cfg udp_conf;
+	int err;
+
+	memset(&udp_conf, 0, sizeof(udp_conf));
+
+	if (ipv6) {
+		udp_conf.family = AF_INET6;
+		udp_conf.ipv6_v6only = 1;
+	} else {
+		udp_conf.family = AF_INET;
+		udp_conf.local_ip.s_addr = htonl(INADDR_ANY);
+	}
+
+	udp_conf.local_udp_port = port;
+
+	/* Open UDP socket */
+	err = udp_sock_create(net, &udp_conf, &sock);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	return sock;
+}
+
+static int bareudp_err_lookup(struct sock *sk, struct sk_buff *skb)
+{
+	return 0;
+}
+
+/* Create new listen socket if needed */
+static struct bareudp_sock *bareudp_socket_create(struct net *net, __be16 port,
+						  bool ipv6)
+{
+	struct bareudp_sock *bs;
+	struct socket *sock;
+	struct udp_tunnel_sock_cfg tunnel_cfg;
+
+	bs = kzalloc(sizeof(*bs), GFP_KERNEL);
+	if (!bs)
+		return ERR_PTR(-ENOMEM);
+
+	sock = bareudp_create_sock(net, ipv6, port);
+	if (IS_ERR(sock)) {
+		kfree(bs);
+		return ERR_CAST(sock);
+	}
+
+	bs->sock = sock;
+
+	/* Mark socket as an encapsulation socket */
+	memset(&tunnel_cfg, 0, sizeof(tunnel_cfg));
+	tunnel_cfg.sk_user_data = bs;
+	tunnel_cfg.encap_type = 1;
+	tunnel_cfg.encap_rcv = bareudp_udp_encap_recv;
+	tunnel_cfg.encap_err_lookup = bareudp_err_lookup;
+	tunnel_cfg.encap_destroy = NULL;
+	setup_udp_tunnel_sock(net, sock, &tunnel_cfg);
+	return bs;
+}
+
+static int bareudp_sock_add(struct bareudp_dev *bareudp, bool ipv6)
+{
+	struct net *net = bareudp->net;
+	struct bareudp_sock *bs;
+
+	bs = bareudp_socket_create(net, bareudp->conf.port, ipv6);
+	if (IS_ERR(bs))
+		return PTR_ERR(bs);
+#if IS_ENABLED(CONFIG_IPV6)
+	if (ipv6)
+		rcu_assign_pointer(bareudp->sock6, bs);
+	else
+#endif
+		rcu_assign_pointer(bareudp->sock4, bs);
+
+	bs->bareudp = bareudp;
+
+	return 0;
+}
+
+static void __bareudp_sock_release(struct bareudp_sock *bs)
+{
+	if (!bs)
+		return;
+
+	udp_tunnel_sock_release(bs->sock);
+	kfree_rcu(bs, rcu);
+}
+
+static void bareudp_sock_release(struct bareudp_dev *bareudp)
+{
+	struct bareudp_sock *bs4 = rtnl_dereference(bareudp->sock4);
+#if IS_ENABLED(CONFIG_IPV6)
+	struct bareudp_sock *bs6 = rtnl_dereference(bareudp->sock6);
+
+	rcu_assign_pointer(bareudp->sock6, NULL);
+#endif
+
+	rcu_assign_pointer(bareudp->sock4, NULL);
+	synchronize_net();
+
+	__bareudp_sock_release(bs4);
+#if IS_ENABLED(CONFIG_IPV6)
+	__bareudp_sock_release(bs6);
+#endif
+}
+
+static struct rtable *bareudp_get_v4_rt(struct sk_buff *skb,
+					struct net_device *dev,
+					struct bareudp_sock *bs4,
+					struct flowi4 *fl4,
+					const struct ip_tunnel_info *info)
+{
+	bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+	struct dst_cache *dst_cache;
+	struct rtable *rt = NULL;
+	__u8 tos;
+
+	if (!bs4)
+		return ERR_PTR(-EIO);
+
+	memset(fl4, 0, sizeof(*fl4));
+	fl4->flowi4_mark = skb->mark;
+	fl4->flowi4_proto = IPPROTO_UDP;
+	fl4->daddr = info->key.u.ipv4.dst;
+	fl4->saddr = info->key.u.ipv4.src;
+
+	tos = info->key.tos;
+	fl4->flowi4_tos = RT_TOS(tos);
+
+	dst_cache = (struct dst_cache *)&info->dst_cache;
+	if (use_cache) {
+		rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
+		if (rt)
+			return rt;
+	}
+	rt = ip_route_output_key(bareudp->net, fl4);
+	if (IS_ERR(rt)) {
+		netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
+		return ERR_PTR(-ENETUNREACH);
+	}
+	if (rt->dst.dev == dev) { /* is this necessary? */
+		netdev_dbg(dev, "circular route to %pI4\n", &fl4->daddr);
+		ip_rt_put(rt);
+		return ERR_PTR(-ELOOP);
+	}
+	if (use_cache)
+		dst_cache_set_ip4(dst_cache, &rt->dst, fl4->saddr);
+	return rt;
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static struct dst_entry *bareudp_get_v6_dst(struct sk_buff *skb,
+					    struct net_device *dev,
+					    struct bareudp_sock *bs6,
+					    struct flowi6 *fl6,
+					    const struct ip_tunnel_info *info)
+{
+	bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+	struct dst_entry *dst = NULL;
+	struct dst_cache *dst_cache;
+	__u8 prio;
+
+	if (!bs6)
+		return ERR_PTR(-EIO);
+
+	memset(fl6, 0, sizeof(*fl6));
+	fl6->flowi6_mark = skb->mark;
+	fl6->flowi6_proto = IPPROTO_UDP;
+	fl6->daddr = info->key.u.ipv6.dst;
+	fl6->saddr = info->key.u.ipv6.src;
+	prio = info->key.tos;
+
+	fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
+					   info->key.label);
+	dst_cache = (struct dst_cache *)&info->dst_cache;
+	if (use_cache) {
+		dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
+		if (dst)
+			return dst;
+	}
+	if (ipv6_stub->ipv6_dst_lookup(bareudp->net, bs6->sock->sk, &dst,
+				       fl6)) {
+		netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
+		return ERR_PTR(-ENETUNREACH);
+	}
+	if (dst->dev == dev) { /* is this necessary? */
+		netdev_dbg(dev, "circular route to %pI6\n", &fl6->daddr);
+		dst_release(dst);
+		return ERR_PTR(-ELOOP);
+	}
+
+	if (use_cache)
+		dst_cache_set_ip6(dst_cache, dst, &fl6->saddr);
+	return dst;
+}
+#endif
+
+static int bareudp_fill_metadata_dst(struct net_device *dev,
+				     struct sk_buff *skb)
+{
+	struct ip_tunnel_info *info = skb_tunnel_info(skb);
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+
+	if (ip_tunnel_info_af(info) == AF_INET) {
+		struct rtable *rt;
+		struct flowi4 fl4;
+		struct bareudp_sock *bs4 = rcu_dereference(bareudp->sock4);
+
+		rt = bareudp_get_v4_rt(skb, dev, bs4, &fl4, info);
+		if (IS_ERR(rt))
+			return PTR_ERR(rt);
+
+		ip_rt_put(rt);
+		info->key.u.ipv4.src = fl4.saddr;
+#if IS_ENABLED(CONFIG_IPV6)
+	} else if (ip_tunnel_info_af(info) == AF_INET6) {
+		struct dst_entry *dst;
+		struct flowi6 fl6;
+		struct bareudp_sock *bs6 = rcu_dereference(bareudp->sock6);
+
+		dst = bareudp_get_v6_dst(skb, dev, bs6, &fl6, info);
+		if (IS_ERR(dst))
+			return PTR_ERR(dst);
+
+		dst_release(dst);
+		info->key.u.ipv6.src = fl6.saddr;
+#endif
+	} else {
+		return -EINVAL;
+	}
+
+	info->key.tp_src = udp_flow_src_port(bareudp->net, skb,
+					     bareudp->sport_min,
+					     USHRT_MAX, true);
+	info->key.tp_dst = bareudp->conf.port;
+	return 0;
+}
+
+static int bareudp_xmit_skb(struct sk_buff *skb, struct net_device *dev,
+			    struct bareudp_dev *bareudp,
+			    const struct ip_tunnel_info *info)
+{
+	bool xnet = !net_eq(bareudp->net, dev_net(bareudp->dev));
+	struct bareudp_sock *bs4 = rcu_dereference(bareudp->sock4);
+	const struct ip_tunnel_key *key = &info->key;
+	bool udp_sum = !!(info->key.tun_flags & TUNNEL_CSUM);
+	int err;
+	struct rtable *rt;
+	struct flowi4 fl4;
+	__u8 tos, ttl;
+	__be16 sport;
+	__be16 df;
+	int min_headroom;
+
+	rt = bareudp_get_v4_rt(skb, dev, bs4, &fl4, info);
+	if (IS_ERR(rt))
+		return PTR_ERR(rt);
+
+	skb_tunnel_check_pmtu(skb, &rt->dst,
+			      BAREUDP_IPV4_HLEN + info->options_len);
+
+	sport = udp_flow_src_port(bareudp->net, skb,
+				  bareudp->sport_min, USHRT_MAX,
+				  true);
+	tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
+	ttl = key->ttl;
+	df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
+	skb_scrub_packet(skb, xnet);
+
+	if (!skb_pull(skb, skb->mac_len))
+		goto free_dst;
+
+	skb_reset_mac_header(skb);
+
+	min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len +
+		BAREUDP_BASE_HLEN + info->options_len + sizeof(struct iphdr);
+
+	err = skb_cow_head(skb, min_headroom);
+	if (unlikely(err))
+		goto free_dst;
+
+	err = udp_tunnel_handle_offloads(skb, udp_sum);
+	if (err)
+		goto free_dst;
+
+	skb_set_inner_protocol(skb, bareudp->ethertype);
+
+	udp_tunnel_xmit_skb(rt, bs4->sock->sk, skb, fl4.saddr, fl4.daddr,
+			    tos, ttl, df, sport, bareudp->conf.port,
+			    !net_eq(bareudp->net, dev_net(bareudp->dev)),
+			    !(info->key.tun_flags & TUNNEL_CSUM));
+	return 0;
+
+free_dst:
+	dst_release(&rt->dst);
+	return err;
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static int bareudp6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
+			     struct bareudp_dev *bareudp,
+			     const struct ip_tunnel_info *info)
+{
+	bool xnet = !net_eq(bareudp->net, dev_net(bareudp->dev));
+	struct bareudp_sock *bs6 = rcu_dereference(bareudp->sock6);
+	const struct ip_tunnel_key *key = &info->key;
+	bool udp_sum = !!(info->key.tun_flags & TUNNEL_CSUM);
+	struct dst_entry *dst = NULL;
+	struct flowi6 fl6;
+	__u8 prio, ttl;
+	__be16 sport;
+	int min_headroom;
+	int err;
+
+	dst = bareudp_get_v6_dst(skb, dev, bs6, &fl6, info);
+	if (IS_ERR(dst))
+		return PTR_ERR(dst);
+
+	skb_tunnel_check_pmtu(skb, dst, BAREUDP_IPV6_HLEN + info->options_len);
+
+	sport = udp_flow_src_port(bareudp->net, skb,
+				  bareudp->sport_min, USHRT_MAX,
+				  true);
+	prio = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
+	ttl = key->ttl;
+
+	skb_scrub_packet(skb, xnet);
+	if (!skb_pull(skb, skb->mac_len))
+		goto free_dst;
+
+	skb_reset_mac_header(skb);
+
+	min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len +
+		BAREUDP_BASE_HLEN + info->options_len + sizeof(struct iphdr);
+
+	err = skb_cow_head(skb, min_headroom);
+	if (unlikely(err))
+		goto free_dst;
+
+	err = udp_tunnel_handle_offloads(skb, udp_sum);
+	if (err)
+		goto free_dst;
+
+	udp_tunnel6_xmit_skb(dst, bs6->sock->sk, skb, dev,
+			     &fl6.saddr, &fl6.daddr, prio, ttl,
+			     info->key.label, sport, bareudp->conf.port,
+			     !(info->key.tun_flags & TUNNEL_CSUM));
+	return 0;
+
+free_dst:
+	dst_release(dst);
+	return err;
+
+}
+#endif
+
+static netdev_tx_t bareudp_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+	struct ip_tunnel_info *info = NULL;
+	int err;
+
+	if (skb->protocol != bareudp->ethertype) {
+		err = -EINVAL;
+		goto tx_error;
+	}
+
+	info = skb_tunnel_info(skb);
+	if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) {
+		err = -EINVAL;
+		goto tx_error;
+	}
+
+	rcu_read_lock();
+#if IS_ENABLED(CONFIG_IPV6)
+	if (info->mode & IP_TUNNEL_INFO_IPV6)
+		err = bareudp6_xmit_skb(skb, dev, bareudp, info);
+	else
+#endif
+		err = bareudp_xmit_skb(skb, dev, bareudp, info);
+
+	rcu_read_unlock();
+
+	if (likely(!err))
+		return NETDEV_TX_OK;
+tx_error:
+	dev_kfree_skb(skb);
+
+	if (err == -ELOOP)
+		dev->stats.collisions++;
+	else if (err == -ENETUNREACH)
+		dev->stats.tx_carrier_errors++;
+
+	dev->stats.tx_errors++;
+	return NETDEV_TX_OK;
+}
+
+static int bareudp_open(struct net_device *dev)
+{
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+	int ret = 0;
+
+#if IS_ENABLED(CONFIG_IPV6)
+	ret = bareudp_sock_add(bareudp, true);
+#endif
+	ret = bareudp_sock_add(bareudp, false);
+
+	return ret;
+}
+
+static int bareudp_stop(struct net_device *dev)
+{
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+
+	bareudp_sock_release(bareudp);
+	return 0;
+}
+
+static int bareudp_init(struct net_device *dev)
+{
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+	int err;
+
+	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	err = gro_cells_init(&bareudp->gro_cells, dev);
+	if (err) {
+		free_percpu(dev->tstats);
+		return err;
+	}
+	return 0;
+}
+
+static void bareudp_uninit(struct net_device *dev)
+{
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+
+	gro_cells_destroy(&bareudp->gro_cells);
+	free_percpu(dev->tstats);
+}
+
+static int bareudp_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (new_mtu > dev->max_mtu)
+		new_mtu = dev->max_mtu;
+	else if (new_mtu < dev->min_mtu)
+		new_mtu = dev->min_mtu;
+
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static const struct net_device_ops bareudp_netdev_ops = {
+	.ndo_init               = bareudp_init,
+	.ndo_uninit             = bareudp_uninit,
+	.ndo_open               = bareudp_open,
+	.ndo_stop               = bareudp_stop,
+	.ndo_start_xmit         = bareudp_xmit,
+	.ndo_get_stats64        = ip_tunnel_get_stats64,
+	.ndo_change_mtu         = bareudp_change_mtu,
+	.ndo_validate_addr      = eth_validate_addr,
+	.ndo_set_mac_address    = eth_mac_addr,
+	.ndo_fill_metadata_dst  = bareudp_fill_metadata_dst,
+};
+
+static const struct nla_policy bareudp_policy[IFLA_BAREUDP_MAX + 1] = {
+	[IFLA_BAREUDP_PORT]                = { .type = NLA_U16 },
+	[IFLA_BAREUDP_ETHERTYPE]	   = { .type = NLA_U16 },
+	[IFLA_BAREUDP_SRCPORT_MIN]         = { .type = NLA_U16 },
+};
+
+static int bareudp_validate(struct nlattr *tb[], struct nlattr *data[],
+			    struct netlink_ext_ack *extack)
+{
+	if (!data) {
+		NL_SET_ERR_MSG(extack,
+			       "Not enough attributes provided to perform the operation");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static void bareudp_get_drvinfo(struct net_device *dev,
+				struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, "bareudp", sizeof(drvinfo->driver));
+}
+
+static const struct ethtool_ops bareudp_ethtool_ops = {
+	.get_drvinfo    = bareudp_get_drvinfo,
+	.get_link       = ethtool_op_get_link,
+};
+
+/* Info for udev, that this is a virtual tunnel endpoint */
+static struct device_type bareudp_type = {
+	.name = "bareudp",
+};
+
+/* Initialize the device structure. */
+static void bareudp_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+
+	dev->netdev_ops = &bareudp_netdev_ops;
+	dev->ethtool_ops = &bareudp_ethtool_ops;
+	dev->needs_free_netdev = true;
+
+	SET_NETDEV_DEVTYPE(dev, &bareudp_type);
+
+	dev->features    |= NETIF_F_SG | NETIF_F_HW_CSUM;
+	dev->features    |= NETIF_F_RXCSUM;
+	dev->features    |= NETIF_F_GSO_SOFTWARE;
+
+	dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
+	dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+
+	/* MTU range: 68 - (something less than 65535) */
+	dev->min_mtu = ETH_MIN_MTU;
+	dev->max_mtu = IP_MAX_MTU - BAREUDP_BASE_HLEN - dev->hard_header_len;
+
+	netif_keep_dst(dev);
+	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
+	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE;
+	eth_hw_addr_random(dev);
+}
+
+static struct bareudp_dev *bareudp_find_dev(struct bareudp_net *bn,
+					    const struct bareudp_conf *conf)
+{
+	struct bareudp_dev *bareudp, *t = NULL;
+
+	list_for_each_entry(bareudp, &bn->bareudp_list, next) {
+		if (conf->port == bareudp->conf.port)
+			t = bareudp;
+	}
+	return t;
+}
+
+static int bareudp_configure(struct net *net, struct net_device *dev,
+			     struct bareudp_conf *conf)
+{
+	struct bareudp_net *bn = net_generic(net, bareudp_net_id);
+	struct bareudp_dev *t, *bareudp = netdev_priv(dev);
+	int err;
+
+	bareudp->net = net;
+	bareudp->dev = dev;
+	t = bareudp_find_dev(bn, conf);
+	if (t)
+		return -EBUSY;
+
+	bareudp->conf = *conf;
+	bareudp->ethertype = conf->ethertype;
+	bareudp->sport_min = conf->sport_min;
+	err = register_netdevice(dev);
+	if (err)
+		return err;
+
+	list_add(&bareudp->next, &bn->bareudp_list);
+	return 0;
+}
+
+static int bareudp2info(struct nlattr *data[], struct bareudp_conf *conf)
+{
+	if (!data[IFLA_BAREUDP_PORT] || !data[IFLA_BAREUDP_ETHERTYPE])
+		return -EINVAL;
+
+	if (data[IFLA_BAREUDP_PORT])
+		conf->port =  nla_get_u16(data[IFLA_BAREUDP_PORT]);
+
+	if (data[IFLA_BAREUDP_ETHERTYPE])
+		conf->ethertype =  nla_get_u16(data[IFLA_BAREUDP_ETHERTYPE]);
+
+	if (data[IFLA_BAREUDP_SRCPORT_MIN])
+		conf->sport_min =  nla_get_u16(data[IFLA_BAREUDP_SRCPORT_MIN]);
+
+	return 0;
+}
+
+static void bareudp_link_config(struct net_device *dev,
+				struct nlattr *tb[])
+{
+	if (tb[IFLA_MTU])
+		bareudp_change_mtu(dev, nla_get_u32(tb[IFLA_MTU]));
+}
+
+static int bareudp_newlink(struct net *net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[],
+			   struct netlink_ext_ack *extack)
+{
+	struct bareudp_conf conf;
+	int err;
+
+	err = bareudp2info(data, &conf);
+	if (err)
+		return err;
+
+	err = bareudp_configure(net, dev, &conf);
+	if (err)
+		return err;
+
+	bareudp_link_config(dev, tb);
+	return 0;
+}
+
+static void bareudp_dellink(struct net_device *dev, struct list_head *head)
+{
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+
+	list_del(&bareudp->next);
+	unregister_netdevice_queue(dev, head);
+}
+
+static size_t bareudp_get_size(const struct net_device *dev)
+{
+	return  nla_total_size(sizeof(__be16)) +  /* IFLA_BAREUDP_PORT */
+		nla_total_size(sizeof(__be16)) +  /* IFLA_BAREUDP_ETHERTYPE */
+		nla_total_size(sizeof(__u16))  +  /* IFLA_BAREUDP_SRCPORT_MIN */
+		0;
+}
+
+static int bareudp_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct bareudp_dev *bareudp = netdev_priv(dev);
+
+	if (nla_put_be16(skb, IFLA_BAREUDP_PORT, bareudp->conf.port))
+		goto nla_put_failure;
+	if (nla_put_be16(skb, IFLA_BAREUDP_ETHERTYPE, bareudp->conf.ethertype))
+		goto nla_put_failure;
+	if (nla_put_u16(skb, IFLA_BAREUDP_SRCPORT_MIN, bareudp->conf.sport_min))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
+static struct rtnl_link_ops bareudp_link_ops __read_mostly = {
+	.kind           = "bareudp",
+	.maxtype        = IFLA_BAREUDP_MAX,
+	.policy         = bareudp_policy,
+	.priv_size      = sizeof(struct bareudp_dev),
+	.setup          = bareudp_setup,
+	.validate       = bareudp_validate,
+	.newlink        = bareudp_newlink,
+	.dellink        = bareudp_dellink,
+	.get_size       = bareudp_get_size,
+	.fill_info      = bareudp_fill_info,
+};
+
+struct net_device *bareudp_dev_create(struct net *net, const char *name,
+				      u8 name_assign_type,
+				      struct bareudp_conf *conf)
+{
+	struct nlattr *tb[IFLA_MAX + 1];
+	struct net_device *dev;
+	LIST_HEAD(list_kill);
+	int err;
+
+	memset(tb, 0, sizeof(tb));
+	dev = rtnl_create_link(net, name, name_assign_type,
+			       &bareudp_link_ops, tb, NULL);
+	if (IS_ERR(dev))
+		return dev;
+
+	err = bareudp_configure(net, dev, conf);
+	if (err) {
+		free_netdev(dev);
+		return ERR_PTR(err);
+	}
+	err = bareudp_change_mtu(dev, IP_MAX_MTU);
+	if (err)
+		goto err;
+
+	err = rtnl_configure_link(dev, NULL);
+	if (err < 0)
+		goto err;
+
+	return dev;
+err:
+	bareudp_dellink(dev, &list_kill);
+	unregister_netdevice_many(&list_kill);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(bareudp_dev_create);
+
+static __net_init int bareudp_init_net(struct net *net)
+{
+	struct bareudp_net *bn = net_generic(net, bareudp_net_id);
+
+	INIT_LIST_HEAD(&bn->bareudp_list);
+	return 0;
+}
+
+static void bareudp_destroy_tunnels(struct net *net, struct list_head *head)
+{
+	struct bareudp_net *bn = net_generic(net, bareudp_net_id);
+	struct bareudp_dev *bareudp, *next;
+	struct net_device *dev, *aux;
+
+	/* gather any bareudp devices that were moved into this ns */
+	for_each_netdev_safe(net, dev, aux)
+		if (dev->rtnl_link_ops == &bareudp_link_ops)
+			unregister_netdevice_queue(dev, head);
+
+	/* now gather any other bareudp devices that were created in this ns */
+	list_for_each_entry_safe(bareudp, next, &bn->bareudp_list, next) {
+		/* If bareudp->dev is in the same netns, it was already added
+		 * to the list by the previous loop.
+		 */
+		if (!net_eq(dev_net(bareudp->dev), net))
+			unregister_netdevice_queue(bareudp->dev, head);
+	}
+}
+
+static void __net_exit bareudp_exit_batch_net(struct list_head *net_list)
+{
+	struct net *net;
+	LIST_HEAD(list);
+
+	rtnl_lock();
+	list_for_each_entry(net, net_list, exit_list)
+		bareudp_destroy_tunnels(net, &list);
+
+	/* unregister the devices gathered above */
+	unregister_netdevice_many(&list);
+	rtnl_unlock();
+}
+
+static struct pernet_operations bareudp_net_ops = {
+	.init = bareudp_init_net,
+	.exit_batch = bareudp_exit_batch_net,
+	.id   = &bareudp_net_id,
+	.size = sizeof(struct bareudp_net),
+};
+
+static int __init bareudp_init_module(void)
+{
+	int rc;
+
+	rc = register_pernet_subsys(&bareudp_net_ops);
+	if (rc)
+		goto out1;
+
+	rc = rtnl_link_register(&bareudp_link_ops);
+	if (rc)
+		goto out3;
+
+	return 0;
+out3:
+	unregister_pernet_subsys(&bareudp_net_ops);
+out1:
+	return rc;
+}
+late_initcall(bareudp_init_module);
+
+static void __exit bareudp_cleanup_module(void)
+{
+	rtnl_link_unregister(&bareudp_link_ops);
+	unregister_pernet_subsys(&bareudp_net_ops);
+}
+module_exit(bareudp_cleanup_module);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Martin Varghese <martin.varghese@nokia.com>");
+MODULE_DESCRIPTION("Interface driver for UDP encapsulated traffic");
diff --git a/include/net/bareudp.h b/include/net/bareudp.h
new file mode 100644
index 0000000..513fae6
--- /dev/null
+++ b/include/net/bareudp.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __NET_BAREUDP_H
+#define __NET_BAREUDP_H
+
+#include <linux/types.h>
+#include <linux/skbuff.h>
+
+struct bareudp_conf {
+	__be16 ethertype;
+	__be16 port;
+	u16 sport_min;
+};
+
+struct net_device *bareudp_dev_create(struct net *net, const char *name,
+				      u8 name_assign_type,
+				      struct bareudp_conf *info);
+
+#endif
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4a8c02c..012f7e8 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -580,6 +580,18 @@ enum ifla_geneve_df {
 	GENEVE_DF_MAX = __GENEVE_DF_END - 1,
 };
 
+/* Bareudp section  */
+enum {
+	IFLA_BAREUDP_UNSPEC,
+	IFLA_BAREUDP_PORT,
+	IFLA_BAREUDP_ETHERTYPE,
+	IFLA_BAREUDP_SRCPORT_MIN,
+	__IFLA_BAREUDP_MAX
+};
+
+#define IFLA_BAREUDP_MAX (__IFLA_BAREUDP_MAX - 1)
+
+
 /* PPP section */
 enum {
 	IFLA_PPP_UNSPEC,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next 2/2] Special handling for IP & MPLS.
  2019-10-08  9:48 [PATCH net-next 0/2] Bareudp Tunnel Module Martin Varghese
  2019-10-08  9:48 ` [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc Martin Varghese
@ 2019-10-08  9:49 ` Martin Varghese
  2019-10-08 14:58   ` Randy Dunlap
  2019-10-08 16:09   ` Willem de Bruijn
  1 sibling, 2 replies; 36+ messages in thread
From: Martin Varghese @ 2019-10-08  9:49 UTC (permalink / raw)
  To: netdev, davem, corbet, scott.drennan, jbenc, martin.varghese
  Cc: Martin Varghese

From: Martin <martin.varghese@nokia.com>

Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>

Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
---
 Documentation/networking/bareudp.txt | 18 ++++++++
 drivers/net/bareudp.c                | 82 +++++++++++++++++++++++++++++++++---
 include/net/bareudp.h                |  1 +
 include/uapi/linux/if_link.h         |  1 +
 4 files changed, 95 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
index d2530e2..4de1022 100644
--- a/Documentation/networking/bareudp.txt
+++ b/Documentation/networking/bareudp.txt
@@ -9,6 +9,15 @@ The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
 support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
 a UDP tunnel.
 
+Special Handling
+----------------
+The bareudp device supports special handling for MPLS & IP as they can have
+multiple ethertypes.
+MPLS procotcol can have ethertypes 0x8847 (unicast) & 0x8847 (multicast).
+IP proctocol can have ethertypes 0x0800 (v4) & 0x866 (v6).
+This special handling can be enabled only for ethertype 0x0800 & 0x88847 with a
+flag called extended mode.
+
 Usage
 ------
 
@@ -21,3 +30,12 @@ This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
 The device will listen on UDP port 6635 to receive traffic.
 
 b. ip link delete bareudp0
+
+2. Device creation with extended mode enabled
+
+There are two ways to create a bareudp device for MPLS & IP with extended mode
+enabled
+
+a. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype 0x8847 extmode 1
+
+b. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype mpls
diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
index 7e6813a..2a688da 100644
--- a/drivers/net/bareudp.c
+++ b/drivers/net/bareudp.c
@@ -48,6 +48,7 @@ struct bareudp_dev {
 	struct net_device  *dev;        /* netdev for bareudp tunnel */
 	__be16		   ethertype;
 	u16	           sport_min;
+	bool               ext_mode;
 	struct bareudp_conf conf;
 	struct bareudp_sock __rcu *sock4; /* IPv4 socket for bareudp tunnel */
 #if IS_ENABLED(CONFIG_IPV6)
@@ -82,15 +83,64 @@ static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 		goto drop;
 
 	bareudp = bs->bareudp;
-	proto = bareudp->ethertype;
+	if (!bareudp)
+		goto drop;
+
+	if (bareudp->ethertype == htons(ETH_P_IP)) {
+		struct iphdr *iphdr;
+
+		iphdr = (struct iphdr *)(skb->data + BAREUDP_BASE_HLEN);
+		if (iphdr->version == 4) {
+			proto = bareudp->ethertype;
+		} else if (bareudp->ext_mode && (iphdr->version == 6)) {
+			proto = htons(ETH_P_IPV6);
+		} else {
+			bareudp->dev->stats.rx_dropped++;
+			goto drop;
+		}
+	} else if (bareudp->ethertype == htons(ETH_P_MPLS_UC)) {
+		struct iphdr *tunnel_hdr;
+
+		tunnel_hdr = (struct iphdr *)skb_network_header(skb);
+		if (tunnel_hdr->version == 4) {
+			if (!ipv4_is_multicast(tunnel_hdr->daddr)) {
+				proto = bareudp->ethertype;
+			} else if (bareudp->ext_mode &&
+				   ipv4_is_multicast(tunnel_hdr->daddr)) {
+				proto = htons(ETH_P_MPLS_MC);
+			} else {
+				bareudp->dev->stats.rx_dropped++;
+				goto drop;
+			}
+		} else {
+			int addr_type;
+			struct ipv6hdr *tunnel_hdr_v6;
+
+			tunnel_hdr_v6 = (struct ipv6hdr *)skb_network_header(skb);
+			addr_type =
+			ipv6_addr_type((struct in6_addr *)&tunnel_hdr_v6->daddr);
+			if (!(addr_type & IPV6_ADDR_MULTICAST)) {
+				proto = bareudp->ethertype;
+			} else if (bareudp->ext_mode &&
+				   (addr_type & IPV6_ADDR_MULTICAST)) {
+				proto = htons(ETH_P_MPLS_MC);
+			} else {
+				bareudp->dev->stats.rx_dropped++;
+				goto drop;
+			}
+		}
+	} else {
+		proto = bareudp->ethertype;
+	}
 
 	if (iptunnel_pull_header(skb, BAREUDP_BASE_HLEN,
-				 proto,
-				 !net_eq(bareudp->net,
-					 dev_net(bareudp->dev)))) {
+				proto,
+				!net_eq(bareudp->net,
+					dev_net(bareudp->dev)))) {
 		bareudp->dev->stats.rx_dropped++;
 		goto drop;
 	}
+
 	tun_dst = udp_tun_rx_dst(skb, bareudp_get_sk_family(bs), TUNNEL_KEY,
 				 0, 0);
 	if (!tun_dst) {
@@ -522,10 +572,13 @@ static netdev_tx_t bareudp_xmit(struct sk_buff *skb, struct net_device *dev)
 	int err;
 
 	if (skb->protocol != bareudp->ethertype) {
-		err = -EINVAL;
-		goto tx_error;
+		if (!bareudp->ext_mode ||
+		    (skb->protocol !=  htons(ETH_P_MPLS_MC) &&
+		     skb->protocol !=  htons(ETH_P_IPV6))) {
+			err = -EINVAL;
+			goto tx_error;
+		}
 	}
-
 	info = skb_tunnel_info(skb);
 	if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) {
 		err = -EINVAL;
@@ -630,6 +683,7 @@ static int bareudp_change_mtu(struct net_device *dev, int new_mtu)
 	[IFLA_BAREUDP_PORT]                = { .type = NLA_U16 },
 	[IFLA_BAREUDP_ETHERTYPE]	   = { .type = NLA_U16 },
 	[IFLA_BAREUDP_SRCPORT_MIN]         = { .type = NLA_U16 },
+	[IFLA_BAREUDP_EXTMODE]             = { .type = NLA_FLAG },
 };
 
 static int bareudp_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -712,9 +766,15 @@ static int bareudp_configure(struct net *net, struct net_device *dev,
 	if (t)
 		return -EBUSY;
 
+	if (conf->ext_mode &&
+	    (conf->ethertype != htons(ETH_P_MPLS_UC) &&
+	     conf->ethertype != htons(ETH_P_IP)))
+		return -EINVAL;
+
 	bareudp->conf = *conf;
 	bareudp->ethertype = conf->ethertype;
 	bareudp->sport_min = conf->sport_min;
+	bareudp->ext_mode = conf->ext_mode;
 	err = register_netdevice(dev);
 	if (err)
 		return err;
@@ -737,6 +797,11 @@ static int bareudp2info(struct nlattr *data[], struct bareudp_conf *conf)
 	if (data[IFLA_BAREUDP_SRCPORT_MIN])
 		conf->sport_min =  nla_get_u16(data[IFLA_BAREUDP_SRCPORT_MIN]);
 
+	if (data[IFLA_BAREUDP_EXTMODE])
+		conf->ext_mode = true;
+	else
+		conf->ext_mode = false;
+
 	return 0;
 }
 
@@ -779,6 +844,7 @@ static size_t bareudp_get_size(const struct net_device *dev)
 	return  nla_total_size(sizeof(__be16)) +  /* IFLA_BAREUDP_PORT */
 		nla_total_size(sizeof(__be16)) +  /* IFLA_BAREUDP_ETHERTYPE */
 		nla_total_size(sizeof(__u16))  +  /* IFLA_BAREUDP_SRCPORT_MIN */
+		nla_total_size(0)              +  /* IFLA_BAREUDP_EXTMODE */
 		0;
 }
 
@@ -792,6 +858,8 @@ static int bareudp_fill_info(struct sk_buff *skb, const struct net_device *dev)
 		goto nla_put_failure;
 	if (nla_put_u16(skb, IFLA_BAREUDP_SRCPORT_MIN, bareudp->conf.sport_min))
 		goto nla_put_failure;
+	if (bareudp->ext_mode && nla_put_flag(skb, IFLA_BAREUDP_EXTMODE))
+		goto nla_put_failure;
 
 	return 0;
 
diff --git a/include/net/bareudp.h b/include/net/bareudp.h
index 513fae6..2c121d8 100644
--- a/include/net/bareudp.h
+++ b/include/net/bareudp.h
@@ -10,6 +10,7 @@ struct bareudp_conf {
 	__be16 ethertype;
 	__be16 port;
 	u16 sport_min;
+	bool ext_mode;
 };
 
 struct net_device *bareudp_dev_create(struct net *net, const char *name,
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 012f7e8..2b91c872 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -586,6 +586,7 @@ enum {
 	IFLA_BAREUDP_PORT,
 	IFLA_BAREUDP_ETHERTYPE,
 	IFLA_BAREUDP_SRCPORT_MIN,
+	IFLA_BAREUDP_EXTMODE,
 	__IFLA_BAREUDP_MAX
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-08  9:48 ` [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc Martin Varghese
@ 2019-10-08 14:06   ` Jonathan Corbet
  2019-10-08 14:57   ` Randy Dunlap
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Jonathan Corbet @ 2019-10-08 14:06 UTC (permalink / raw)
  To: Martin Varghese; +Cc: netdev, davem, scott.drennan, jbenc, martin.varghese

On Tue,  8 Oct 2019 15:18:53 +0530
Martin Varghese <martinvarghesenokia@gmail.com> wrote:

> diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> new file mode 100644
> index 0000000..d2530e2
> --- /dev/null
> +++ b/Documentation/networking/bareudp.txt
> @@ -0,0 +1,23 @@
> +Bare UDP Tunnelling Module Documentation
> +========================================
> +
> +There are various L3 encapsulation standards using UDP being discussed to
> +leverage the UDP based load balancing capability of different networks.
> +MPLSoUDP (https://tools.ietf.org/html/rfc7510)is one among them.
> +
> +The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
> +support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
> +a UDP tunnel.
> +
> +Usage
> +------
> +
> +1. Device creation & deletion
> +
> +a. ip link add dev bareudp0 type bareudp dstport 6635 ethertype 0x8847
> +
> +This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
> +0x8847 (MPLS traffic).The destination port of the UDP header will be set to 6635
> +The device will listen on UDP port 6635 to receive traffic.
> +
> +b. ip link delete bareudp0

Please add new documentation in the RST format if at all possible.  This
document is 95% RST already; doing the last bit of work will save somebody
else the effort of converting it in the future.

Thanks,

jon

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-08  9:48 ` [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc Martin Varghese
  2019-10-08 14:06   ` Jonathan Corbet
@ 2019-10-08 14:57   ` Randy Dunlap
  2019-10-08 16:04   ` Stephen Hemminger
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Randy Dunlap @ 2019-10-08 14:57 UTC (permalink / raw)
  To: Martin Varghese, netdev, davem, corbet, scott.drennan, jbenc,
	martin.varghese

On 10/8/19 2:48 AM, Martin Varghese wrote:
> From: Martin <martin.varghese@nokia.com>
> 
> The Bareudp tunnel module provides a generic L3 encapsulation
> tunnelling module for tunnelling different protocols like MPLS,
> IP,NSH etc inside a UDP tunnel.
> 
> Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> ---
>  Documentation/networking/bareudp.txt |  23 +
>  drivers/net/Kconfig                  |  13 +
>  drivers/net/Makefile                 |   1 +
>  drivers/net/bareudp.c                | 930 +++++++++++++++++++++++++++++++++++
>  include/net/bareudp.h                |  19 +
>  include/uapi/linux/if_link.h         |  12 +
>  6 files changed, 998 insertions(+)
>  create mode 100644 Documentation/networking/bareudp.txt
>  create mode 100644 drivers/net/bareudp.c
>  create mode 100644 include/net/bareudp.h
> 
> diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> new file mode 100644
> index 0000000..d2530e2
> --- /dev/null
> +++ b/Documentation/networking/bareudp.txt
> @@ -0,0 +1,23 @@
> +Bare UDP Tunnelling Module Documentation
> +========================================
> +
> +There are various L3 encapsulation standards using UDP being discussed to
> +leverage the UDP based load balancing capability of different networks.
> +MPLSoUDP (https://tools.ietf.org/html/rfc7510)is one among them.

add space after ')'.

> +
> +The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
> +support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
> +a UDP tunnel.
> +
> +Usage
> +------
> +
> +1. Device creation & deletion
> +
> +a. ip link add dev bareudp0 type bareudp dstport 6635 ethertype 0x8847
> +
> +This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
> +0x8847 (MPLS traffic).The destination port of the UDP header will be set to 6635

add space after '.'.
add ending '.'.

> +The device will listen on UDP port 6635 to receive traffic.
> +
> +b. ip link delete bareudp0
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 48e209e..a389fac 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -215,6 +215,19 @@ config GENEVE
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called geneve.
>  
> +config BAREUDP
> +       tristate "Bare UDP  Encapsulation"

	  drop one space between UDP and Encapsulation.

> +       depends on INET && NET_UDP_TUNNEL
> +       depends on IPV6 || !IPV6
> +       select NET_IP_TUNNEL
> +       select GRO_CELLS
> +       help
> +          This adds a bare udp tunnel module for tunnelling different

		s/udp/UDP/

> +          kind of traffic like MPLS, IP, etc. inside a UDP tunnel.
> +
> +          To compile this driver as a module, choose M here: the module
> +          will be called bareudp.
> +
>  config GTP
>  	tristate "GPRS Tunneling Protocol datapath (GTP-U)"
>  	depends on INET


-- 
~Randy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 2/2] Special handling for IP & MPLS.
  2019-10-08  9:49 ` [PATCH net-next 2/2] Special handling for IP & MPLS Martin Varghese
@ 2019-10-08 14:58   ` Randy Dunlap
  2019-10-08 16:09   ` Willem de Bruijn
  1 sibling, 0 replies; 36+ messages in thread
From: Randy Dunlap @ 2019-10-08 14:58 UTC (permalink / raw)
  To: Martin Varghese, netdev, davem, corbet, scott.drennan, jbenc,
	martin.varghese

On 10/8/19 2:49 AM, Martin Varghese wrote:
> From: Martin <martin.varghese@nokia.com>
> 
> Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> 
> Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>

drop one of those.

> ---
>  Documentation/networking/bareudp.txt | 18 ++++++++
>  drivers/net/bareudp.c                | 82 +++++++++++++++++++++++++++++++++---
>  include/net/bareudp.h                |  1 +
>  include/uapi/linux/if_link.h         |  1 +
>  4 files changed, 95 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> index d2530e2..4de1022 100644
> --- a/Documentation/networking/bareudp.txt
> +++ b/Documentation/networking/bareudp.txt
> @@ -9,6 +9,15 @@ The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
>  support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
>  a UDP tunnel.
>  
> +Special Handling
> +----------------
> +The bareudp device supports special handling for MPLS & IP as they can have
> +multiple ethertypes.
> +MPLS procotcol can have ethertypes 0x8847 (unicast) & 0x8847 (multicast).

                                                         0x8848

> +IP proctocol can have ethertypes 0x0800 (v4) & 0x866 (v6).
> +This special handling can be enabled only for ethertype 0x0800 & 0x88847 with a
> +flag called extended mode.
> +
>  Usage
>  ------
>  
> @@ -21,3 +30,12 @@ This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
>  The device will listen on UDP port 6635 to receive traffic.
>  
>  b. ip link delete bareudp0
> +
> +2. Device creation with extended mode enabled
> +
> +There are two ways to create a bareudp device for MPLS & IP with extended mode
> +enabled

end that sentence with a period. (or full stop)

> +
> +a. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype 0x8847 extmode 1
> +
> +b. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype mpls


-- 
~Randy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-08  9:48 ` [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc Martin Varghese
  2019-10-08 14:06   ` Jonathan Corbet
  2019-10-08 14:57   ` Randy Dunlap
@ 2019-10-08 16:04   ` Stephen Hemminger
  2019-10-08 16:05   ` Stephen Hemminger
  2019-10-08 16:28   ` Willem de Bruijn
  4 siblings, 0 replies; 36+ messages in thread
From: Stephen Hemminger @ 2019-10-08 16:04 UTC (permalink / raw)
  To: Martin Varghese
  Cc: netdev, davem, corbet, scott.drennan, jbenc, martin.varghese

On Tue,  8 Oct 2019 15:18:53 +0530
Martin Varghese <martinvarghesenokia@gmail.com> wrote:

> +#if IS_ENABLED(CONFIG_IPV6)
> +	ret = bareudp_sock_add(bareudp, true);
> +#endif
> +	ret = bareudp_sock_add(bareudp, false);
> +
if ipv6 returns error it will get overriden by the ipv4 value

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-08  9:48 ` [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc Martin Varghese
                     ` (2 preceding siblings ...)
  2019-10-08 16:04   ` Stephen Hemminger
@ 2019-10-08 16:05   ` Stephen Hemminger
  2019-10-08 16:28   ` Willem de Bruijn
  4 siblings, 0 replies; 36+ messages in thread
From: Stephen Hemminger @ 2019-10-08 16:05 UTC (permalink / raw)
  To: Martin Varghese
  Cc: netdev, davem, corbet, scott.drennan, jbenc, martin.varghese

On Tue,  8 Oct 2019 15:18:53 +0530
Martin Varghese <martinvarghesenokia@gmail.com> wrote:

> +	if (new_mtu > dev->max_mtu)
> +		new_mtu = dev->max_mtu;
> +	else if (new_mtu < dev->min_mtu)
> +		new_mtu = dev->min_mtu;
> +

These checks are already handled as errors in dev_set_mtu_ext()

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 2/2] Special handling for IP & MPLS.
  2019-10-08  9:49 ` [PATCH net-next 2/2] Special handling for IP & MPLS Martin Varghese
  2019-10-08 14:58   ` Randy Dunlap
@ 2019-10-08 16:09   ` Willem de Bruijn
  2019-10-09 13:38     ` Martin Varghese
  1 sibling, 1 reply; 36+ messages in thread
From: Willem de Bruijn @ 2019-10-08 16:09 UTC (permalink / raw)
  To: Martin Varghese
  Cc: Network Development, David Miller, corbet, scott.drennan,
	Jiri Benc, martin.varghese

On Tue, Oct 8, 2019 at 5:52 AM Martin Varghese
<martinvarghesenokia@gmail.com> wrote:
>
> From: Martin <martin.varghese@nokia.com>
>

This commit would need a commit message.

> Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
>
> Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> ---
>  Documentation/networking/bareudp.txt | 18 ++++++++
>  drivers/net/bareudp.c                | 82 +++++++++++++++++++++++++++++++++---
>  include/net/bareudp.h                |  1 +
>  include/uapi/linux/if_link.h         |  1 +
>  4 files changed, 95 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> index d2530e2..4de1022 100644
> --- a/Documentation/networking/bareudp.txt
> +++ b/Documentation/networking/bareudp.txt
> @@ -9,6 +9,15 @@ The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
>  support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
>  a UDP tunnel.
>
> +Special Handling
> +----------------
> +The bareudp device supports special handling for MPLS & IP as they can have
> +multiple ethertypes.

Special in what way?

> +MPLS procotcol can have ethertypes 0x8847 (unicast) & 0x8847 (multicast).

0x8848. Use ETH_P_MPLS_UC and ETH_P_MPLS_MC instead of hard coding constants.

> +IP proctocol can have ethertypes 0x0800 (v4) & 0x866 (v6).
> +This special handling can be enabled only for ethertype 0x0800 & 0x88847 with a

Again typo.

> +flag called extended mode.
> +
>  Usage
>  ------
>
> @@ -21,3 +30,12 @@ This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
>  The device will listen on UDP port 6635 to receive traffic.
>
>  b. ip link delete bareudp0
> +
> +2. Device creation with extended mode enabled
> +
> +There are two ways to create a bareudp device for MPLS & IP with extended mode
> +enabled
> +
> +a. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype 0x8847 extmode 1
> +
> +b. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype mpls
> diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
> index 7e6813a..2a688da 100644
> --- a/drivers/net/bareudp.c
> +++ b/drivers/net/bareudp.c
> @@ -48,6 +48,7 @@ struct bareudp_dev {
>         struct net_device  *dev;        /* netdev for bareudp tunnel */
>         __be16             ethertype;
>         u16                sport_min;
> +       bool               ext_mode;
>         struct bareudp_conf conf;
>         struct bareudp_sock __rcu *sock4; /* IPv4 socket for bareudp tunnel */
>  #if IS_ENABLED(CONFIG_IPV6)
> @@ -82,15 +83,64 @@ static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
>                 goto drop;
>
>         bareudp = bs->bareudp;
> -       proto = bareudp->ethertype;
> +       if (!bareudp)
> +               goto drop;
> +
> +       if (bareudp->ethertype == htons(ETH_P_IP)) {
> +               struct iphdr *iphdr;
> +
> +               iphdr = (struct iphdr *)(skb->data + BAREUDP_BASE_HLEN);
> +               if (iphdr->version == 4) {
> +                       proto = bareudp->ethertype;
> +               } else if (bareudp->ext_mode && (iphdr->version == 6)) {
> +                       proto = htons(ETH_P_IPV6);

Verified packet length before reading at offset? Why does v6 needs
extended mode, while v4 does not?

For any packet encapsulated in UDP, the inner packet type will be
unknown, so needs to be configured on the device. That is not a
special feature. FOU gives an example. My main concern is that this
introduces a lot of code that nearly duplicates existing tunneling
support. It is not clear to me that existing logic cannot be
reused/extended.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-08  9:48 ` [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc Martin Varghese
                     ` (3 preceding siblings ...)
  2019-10-08 16:05   ` Stephen Hemminger
@ 2019-10-08 16:28   ` Willem de Bruijn
  2019-10-09 12:48     ` Martin Varghese
  2019-10-17 13:20     ` Martin Varghese
  4 siblings, 2 replies; 36+ messages in thread
From: Willem de Bruijn @ 2019-10-08 16:28 UTC (permalink / raw)
  To: Martin Varghese
  Cc: Network Development, David Miller, corbet, scott.drennan,
	Jiri Benc, martin.varghese

On Tue, Oct 8, 2019 at 5:51 AM Martin Varghese
<martinvarghesenokia@gmail.com> wrote:
>
> From: Martin <martin.varghese@nokia.com>
>
> The Bareudp tunnel module provides a generic L3 encapsulation
> tunnelling module for tunnelling different protocols like MPLS,
> IP,NSH etc inside a UDP tunnel.
>
> Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> ---
>  Documentation/networking/bareudp.txt |  23 +
>  drivers/net/Kconfig                  |  13 +
>  drivers/net/Makefile                 |   1 +
>  drivers/net/bareudp.c                | 930 +++++++++++++++++++++++++++++++++++
>  include/net/bareudp.h                |  19 +
>  include/uapi/linux/if_link.h         |  12 +
>  6 files changed, 998 insertions(+)
>  create mode 100644 Documentation/networking/bareudp.txt
>  create mode 100644 drivers/net/bareudp.c
>  create mode 100644 include/net/bareudp.h
>
> diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> new file mode 100644
> index 0000000..d2530e2
> --- /dev/null
> +++ b/Documentation/networking/bareudp.txt
> @@ -0,0 +1,23 @@
> +Bare UDP Tunnelling Module Documentation
> +========================================
> +
> +There are various L3 encapsulation standards using UDP being discussed to
> +leverage the UDP based load balancing capability of different networks.
> +MPLSoUDP (https://tools.ietf.org/html/rfc7510)is one among them.
> +
> +The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
> +support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
> +a UDP tunnel.

This patch set introduces a lot of code, much of which duplicates
existing tunnel devices, whether FOU using ipip tunnels and
fou_build_header or separate devices like vxlan and geneve. Let's try
to reuse what's there (and tested).

How does this differ from foo-over-udp (FOU). In supporting MPLS
alongside IP? If so, can it reuse the existing code, possibly with
patches to that existing tunnel logic?

I happened to have been playing around with MPLS in GRE recently. Have
not tried the same over UDP tunnels, but most of it should apply?

  ip tunnel add gre1 mode gre local 192.168.1.1 remote 192.168.1.2 dev eth0
  ip addr add 192.168.101.1 peer 192.168.101.2 dev gre1
  ip link set dev gre1 up

  sysctl net.mpls.conf.gre1.input=1
  sysctl net.mpls.platform_labels=17
  ip addr add 192.168.201.1/24 dev gre1
  ip route add 192.168.202.0/24 encap mpls 17 dev gre1
  ip -f mpls route add 16 dev lo


> +static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> +{
> +       struct bareudp_sock *bs;
> +       struct ethhdr *eh;
> +       struct bareudp_dev *bareudp;
> +       struct metadata_dst *tun_dst = NULL;
> +       struct pcpu_sw_netstats *stats;
> +       unsigned int len;
> +       int err = 0;
> +       void *oiph;
> +       u16 proto;
> +
> +       if (unlikely(!pskb_may_pull(skb, BAREUDP_BASE_HLEN)))
> +               goto drop;
> +
> +       bs = rcu_dereference_sk_user_data(sk);
> +       if (!bs)
> +               goto drop;
> +
> +       bareudp = bs->bareudp;
> +       proto = bareudp->ethertype;
> +
> +       if (iptunnel_pull_header(skb, BAREUDP_BASE_HLEN,
> +                                proto,
> +                                !net_eq(bareudp->net,
> +                                        dev_net(bareudp->dev)))) {
> +               bareudp->dev->stats.rx_dropped++;
> +               goto drop;
> +       }
> +       tun_dst = udp_tun_rx_dst(skb, bareudp_get_sk_family(bs), TUNNEL_KEY,
> +                                0, 0);
> +       if (!tun_dst) {
> +               bareudp->dev->stats.rx_dropped++;
> +               goto drop;
> +       }
> +       skb_dst_set(skb, &tun_dst->dst);

Is this dst metadata a firm requirement? It is optional in vxlan, say.
If here, too, please split such parts off into separate follow-on
patches.


> +       skb_push(skb, sizeof(struct ethhdr));
> +       eh = (struct ethhdr *)skb->data;
> +       eh->h_proto = proto;
> +
> +       skb_reset_mac_header(skb);
> +       skb->protocol = eth_type_trans(skb, bareudp->dev);
> +       skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> +       oiph = skb_network_header(skb);
> +       skb_reset_network_header(skb);
> +
> +       if (bareudp_get_sk_family(bs) == AF_INET)

This should be derived from packet contents, not socket state.
Although the one implies the other, I imagine.

> +static struct rtable *bareudp_get_v4_rt(struct sk_buff *skb,
> +                                       struct net_device *dev,
> +                                       struct bareudp_sock *bs4,
> +                                       struct flowi4 *fl4,
> +                                       const struct ip_tunnel_info *info)
> +{
> +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> +       struct bareudp_dev *bareudp = netdev_priv(dev);
> +       struct dst_cache *dst_cache;
> +       struct rtable *rt = NULL;
> +       __u8 tos;
> +
> +       if (!bs4)
> +               return ERR_PTR(-EIO);
> +
> +       memset(fl4, 0, sizeof(*fl4));
> +       fl4->flowi4_mark = skb->mark;
> +       fl4->flowi4_proto = IPPROTO_UDP;
> +       fl4->daddr = info->key.u.ipv4.dst;
> +       fl4->saddr = info->key.u.ipv4.src;
> +
> +       tos = info->key.tos;
> +       fl4->flowi4_tos = RT_TOS(tos);
> +
> +       dst_cache = (struct dst_cache *)&info->dst_cache;
> +       if (use_cache) {
> +               rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
> +               if (rt)
> +                       return rt;
> +       }
> +       rt = ip_route_output_key(bareudp->net, fl4);
> +       if (IS_ERR(rt)) {
> +               netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
> +               return ERR_PTR(-ENETUNREACH);
> +       }
> +       if (rt->dst.dev == dev) { /* is this necessary? */
> +               netdev_dbg(dev, "circular route to %pI4\n", &fl4->daddr);
> +               ip_rt_put(rt);
> +               return ERR_PTR(-ELOOP);
> +       }
> +       if (use_cache)
> +               dst_cache_set_ip4(dst_cache, &rt->dst, fl4->saddr);
> +       return rt;
> +}
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +static struct dst_entry *bareudp_get_v6_dst(struct sk_buff *skb,
> +                                           struct net_device *dev,
> +                                           struct bareudp_sock *bs6,
> +                                           struct flowi6 *fl6,
> +                                           const struct ip_tunnel_info *info)
> +{
> +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> +       struct bareudp_dev *bareudp = netdev_priv(dev);
> +       struct dst_entry *dst = NULL;
> +       struct dst_cache *dst_cache;
> +       __u8 prio;
> +
> +       if (!bs6)
> +               return ERR_PTR(-EIO);
> +
> +       memset(fl6, 0, sizeof(*fl6));
> +       fl6->flowi6_mark = skb->mark;
> +       fl6->flowi6_proto = IPPROTO_UDP;
> +       fl6->daddr = info->key.u.ipv6.dst;
> +       fl6->saddr = info->key.u.ipv6.src;
> +       prio = info->key.tos;
> +
> +       fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
> +                                          info->key.label);
> +       dst_cache = (struct dst_cache *)&info->dst_cache;
> +       if (use_cache) {
> +               dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
> +               if (dst)
> +                       return dst;
> +       }
> +       if (ipv6_stub->ipv6_dst_lookup(bareudp->net, bs6->sock->sk, &dst,
> +                                      fl6)) {
> +               netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
> +               return ERR_PTR(-ENETUNREACH);
> +       }
> +       if (dst->dev == dev) { /* is this necessary? */
> +               netdev_dbg(dev, "circular route to %pI6\n", &fl6->daddr);
> +               dst_release(dst);
> +               return ERR_PTR(-ELOOP);
> +       }
> +
> +       if (use_cache)
> +               dst_cache_set_ip6(dst_cache, dst, &fl6->saddr);
> +       return dst;
> +}
> +#endif

The route lookup logic is very similar to vxlan_get_route and
vxlan6_get_route. Can be reused?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-08 16:28   ` Willem de Bruijn
@ 2019-10-09 12:48     ` Martin Varghese
  2019-10-09 14:58       ` Willem de Bruijn
  2019-10-17 13:20     ` Martin Varghese
  1 sibling, 1 reply; 36+ messages in thread
From: Martin Varghese @ 2019-10-09 12:48 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, corbet, scott.drennan,
	Jiri Benc, martin.varghese

On Tue, Oct 08, 2019 at 12:28:23PM -0400, Willem de Bruijn wrote:
> On Tue, Oct 8, 2019 at 5:51 AM Martin Varghese
> <martinvarghesenokia@gmail.com> wrote:
> >
> > From: Martin <martin.varghese@nokia.com>
> >
> > The Bareudp tunnel module provides a generic L3 encapsulation
> > tunnelling module for tunnelling different protocols like MPLS,
> > IP,NSH etc inside a UDP tunnel.
> >
> > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > ---
> >  Documentation/networking/bareudp.txt |  23 +
> >  drivers/net/Kconfig                  |  13 +
> >  drivers/net/Makefile                 |   1 +
> >  drivers/net/bareudp.c                | 930 +++++++++++++++++++++++++++++++++++
> >  include/net/bareudp.h                |  19 +
> >  include/uapi/linux/if_link.h         |  12 +
> >  6 files changed, 998 insertions(+)
> >  create mode 100644 Documentation/networking/bareudp.txt
> >  create mode 100644 drivers/net/bareudp.c
> >  create mode 100644 include/net/bareudp.h
> >
> > diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> > new file mode 100644
> > index 0000000..d2530e2
> > --- /dev/null
> > +++ b/Documentation/networking/bareudp.txt
> > @@ -0,0 +1,23 @@
> > +Bare UDP Tunnelling Module Documentation
> > +========================================
> > +
> > +There are various L3 encapsulation standards using UDP being discussed to
> > +leverage the UDP based load balancing capability of different networks.
> > +MPLSoUDP (https://tools.ietf.org/html/rfc7510)is one among them.
> > +
> > +The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
> > +support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
> > +a UDP tunnel.
> 
> This patch set introduces a lot of code, much of which duplicates
> existing tunnel devices, whether FOU using ipip tunnels and
> fou_build_header or separate devices like vxlan and geneve. Let's try
> to reuse what's there (and tested).
> 
> How does this differ from foo-over-udp (FOU). In supporting MPLS
> alongside IP? If so, can it reuse the existing code, possibly with
> patches to that existing tunnel logic?
> 
> I happened to have been playing around with MPLS in GRE recently. Have
> not tried the same over UDP tunnels, but most of it should apply?
> 
>   ip tunnel add gre1 mode gre local 192.168.1.1 remote 192.168.1.2 dev eth0
>   ip addr add 192.168.101.1 peer 192.168.101.2 dev gre1
>   ip link set dev gre1 up
> 
>   sysctl net.mpls.conf.gre1.input=1
>   sysctl net.mpls.platform_labels=17
>   ip addr add 192.168.201.1/24 dev gre1
>   ip route add 192.168.202.0/24 encap mpls 17 dev gre1
>   ip -f mpls route add 16 dev lo
> 
> 
unlike FOU which does l4 encapsulation, here we are trying to acheive l3 encapsulation.
For Example,  What is needed for MPLSoUDP is to take packets such as:

eth header | mpls header | payload

and encapsulate them to:

eth header | ip header | udp header | mpls header | payload
<--------- outer ------------------> <---- inner --------->

which is later decapsulated back to:

eth header | mpls header | payload

Note that in the decapsulated case, the ethertype in the eth header is ETH_P_MPLS_UC, while in the encapsulated case, the ethertype in the eth header is ETH_P_IP. When decapsulating, ETH_P_MPLS_UC is
restored based on the configured tunnel parameters.

This cannot be done with FOU. FOU needs an ipproto, not ethertype.

During receive FOU dispatches packet to l4 handlers which makes it impossible to patch it to get it working for l3 protocols like MPLS.

> > +static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> > +{
> > +       struct bareudp_sock *bs;
> > +       struct ethhdr *eh;
> > +       struct bareudp_dev *bareudp;
> > +       struct metadata_dst *tun_dst = NULL;
> > +       struct pcpu_sw_netstats *stats;
> > +       unsigned int len;
> > +       int err = 0;
> > +       void *oiph;
> > +       u16 proto;
> > +
> > +       if (unlikely(!pskb_may_pull(skb, BAREUDP_BASE_HLEN)))
> > +               goto drop;
> > +
> > +       bs = rcu_dereference_sk_user_data(sk);
> > +       if (!bs)
> > +               goto drop;
> > +
> > +       bareudp = bs->bareudp;
> > +       proto = bareudp->ethertype;
> > +
> > +       if (iptunnel_pull_header(skb, BAREUDP_BASE_HLEN,
> > +                                proto,
> > +                                !net_eq(bareudp->net,
> > +                                        dev_net(bareudp->dev)))) {
> > +               bareudp->dev->stats.rx_dropped++;
> > +               goto drop;
> > +       }
> > +       tun_dst = udp_tun_rx_dst(skb, bareudp_get_sk_family(bs), TUNNEL_KEY,
> > +                                0, 0);
> > +       if (!tun_dst) {
> > +               bareudp->dev->stats.rx_dropped++;
> > +               goto drop;
> > +       }
> > +       skb_dst_set(skb, &tun_dst->dst);
> 
> Is this dst metadata a firm requirement? It is optional in vxlan, say.
> If here, too, please split such parts off into separate follow-on
> patches.
> 
>
Yes it is. 
> > +       skb_push(skb, sizeof(struct ethhdr));
> > +       eh = (struct ethhdr *)skb->data;
> > +       eh->h_proto = proto;
> > +
> > +       skb_reset_mac_header(skb);
> > +       skb->protocol = eth_type_trans(skb, bareudp->dev);
> > +       skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > +       oiph = skb_network_header(skb);
> > +       skb_reset_network_header(skb);
> > +
> > +       if (bareudp_get_sk_family(bs) == AF_INET)
> 
> This should be derived from packet contents, not socket state.
> Although the one implies the other, I imagine.
> 
> > +static struct rtable *bareudp_get_v4_rt(struct sk_buff *skb,
> > +                                       struct net_device *dev,
> > +                                       struct bareudp_sock *bs4,
> > +                                       struct flowi4 *fl4,
> > +                                       const struct ip_tunnel_info *info)
> > +{
> > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > +       struct dst_cache *dst_cache;
> > +       struct rtable *rt = NULL;
> > +       __u8 tos;
> > +
> > +       if (!bs4)
> > +               return ERR_PTR(-EIO);
> > +
> > +       memset(fl4, 0, sizeof(*fl4));
> > +       fl4->flowi4_mark = skb->mark;
> > +       fl4->flowi4_proto = IPPROTO_UDP;
> > +       fl4->daddr = info->key.u.ipv4.dst;
> > +       fl4->saddr = info->key.u.ipv4.src;
> > +
> > +       tos = info->key.tos;
> > +       fl4->flowi4_tos = RT_TOS(tos);
> > +
> > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > +       if (use_cache) {
> > +               rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
> > +               if (rt)
> > +                       return rt;
> > +       }
> > +       rt = ip_route_output_key(bareudp->net, fl4);
> > +       if (IS_ERR(rt)) {
> > +               netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
> > +               return ERR_PTR(-ENETUNREACH);
> > +       }
> > +       if (rt->dst.dev == dev) { /* is this necessary? */
> > +               netdev_dbg(dev, "circular route to %pI4\n", &fl4->daddr);
> > +               ip_rt_put(rt);
> > +               return ERR_PTR(-ELOOP);
> > +       }
> > +       if (use_cache)
> > +               dst_cache_set_ip4(dst_cache, &rt->dst, fl4->saddr);
> > +       return rt;
> > +}
> > +
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +static struct dst_entry *bareudp_get_v6_dst(struct sk_buff *skb,
> > +                                           struct net_device *dev,
> > +                                           struct bareudp_sock *bs6,
> > +                                           struct flowi6 *fl6,
> > +                                           const struct ip_tunnel_info *info)
> > +{
> > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > +       struct dst_entry *dst = NULL;
> > +       struct dst_cache *dst_cache;
> > +       __u8 prio;
> > +
> > +       if (!bs6)
> > +               return ERR_PTR(-EIO);
> > +
> > +       memset(fl6, 0, sizeof(*fl6));
> > +       fl6->flowi6_mark = skb->mark;
> > +       fl6->flowi6_proto = IPPROTO_UDP;
> > +       fl6->daddr = info->key.u.ipv6.dst;
> > +       fl6->saddr = info->key.u.ipv6.src;
> > +       prio = info->key.tos;
> > +
> > +       fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
> > +                                          info->key.label);
> > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > +       if (use_cache) {
> > +               dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
> > +               if (dst)
> > +                       return dst;
> > +       }
> > +       if (ipv6_stub->ipv6_dst_lookup(bareudp->net, bs6->sock->sk, &dst,
> > +                                      fl6)) {
> > +               netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
> > +               return ERR_PTR(-ENETUNREACH);
> > +       }
> > +       if (dst->dev == dev) { /* is this necessary? */
> > +               netdev_dbg(dev, "circular route to %pI6\n", &fl6->daddr);
> > +               dst_release(dst);
> > +               return ERR_PTR(-ELOOP);
> > +       }
> > +
> > +       if (use_cache)
> > +               dst_cache_set_ip6(dst_cache, dst, &fl6->saddr);
> > +       return dst;
> > +}
> > +#endif
> 
> The route lookup logic is very similar to vxlan_get_route and
> vxlan6_get_route. Can be reused?

Yes  we could. We can move these to a common place. include/net/ip_tunnels.h ? 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 2/2] Special handling for IP & MPLS.
  2019-10-08 16:09   ` Willem de Bruijn
@ 2019-10-09 13:38     ` Martin Varghese
  2019-10-09 15:06       ` Willem de Bruijn
  0 siblings, 1 reply; 36+ messages in thread
From: Martin Varghese @ 2019-10-09 13:38 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, corbet, scott.drennan,
	Jiri Benc, martin.varghese

On Tue, Oct 08, 2019 at 12:09:49PM -0400, Willem de Bruijn wrote:
> On Tue, Oct 8, 2019 at 5:52 AM Martin Varghese
> <martinvarghesenokia@gmail.com> wrote:
> >
> > From: Martin <martin.varghese@nokia.com>
> >
> 
> This commit would need a commit message.
> 
> > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> >
> > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > ---
> >  Documentation/networking/bareudp.txt | 18 ++++++++
> >  drivers/net/bareudp.c                | 82 +++++++++++++++++++++++++++++++++---
> >  include/net/bareudp.h                |  1 +
> >  include/uapi/linux/if_link.h         |  1 +
> >  4 files changed, 95 insertions(+), 7 deletions(-)
> >
> > diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> > index d2530e2..4de1022 100644
> > --- a/Documentation/networking/bareudp.txt
> > +++ b/Documentation/networking/bareudp.txt
> > @@ -9,6 +9,15 @@ The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
> >  support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
> >  a UDP tunnel.
> >
> > +Special Handling
> > +----------------
> > +The bareudp device supports special handling for MPLS & IP as they can have
> > +multiple ethertypes.
> 
> Special in what way?
> 
The bareudp device associates a L3 protocol (ethertype) with a UDP port.
For some protocols like MPLS,IP there exists multiplle ethertypes.
IPV6 and IPV4 ethertypes for IP and MPLS unicast & Multicast ethertypes for
MPLS. There coud be use cases where both MPLS unicast and multicast traffic
need to be tunnelled using the same bareudp device.Similarly for ipv4 and ipv6.

This problem is solved by introducing a flag called extended mode which should be used 
be with IPv4 and MPLS unicast ethertypes.

The extended mode flag when used with IPV4 ethertype enables the bareudp device to
recognize & support IPV4 & v6.

The extended mode flag when used with MPLS unicast ethertype enables bareudp device
to recognize & support MPLS unicast & multicast.

> > +MPLS procotcol can have ethertypes 0x8847 (unicast) & 0x8847 (multicast).
> 
> 0x8848. Use ETH_P_MPLS_UC and ETH_P_MPLS_MC instead of hard coding constants.
> 
> > +IP proctocol can have ethertypes 0x0800 (v4) & 0x866 (v6).
> > +This special handling can be enabled only for ethertype 0x0800 & 0x88847 with a
> 
> Again typo.
> 
> > +flag called extended mode.
> > +
> >  Usage
> >  ------
> >
> > @@ -21,3 +30,12 @@ This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
> >  The device will listen on UDP port 6635 to receive traffic.
> >
> >  b. ip link delete bareudp0
> > +
> > +2. Device creation with extended mode enabled
> > +
> > +There are two ways to create a bareudp device for MPLS & IP with extended mode
> > +enabled
> > +
> > +a. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype 0x8847 extmode 1
> > +
> > +b. ip link add dev  bareudp0 type bareudp dstport 6635 ethertype mpls
> > diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
> > index 7e6813a..2a688da 100644
> > --- a/drivers/net/bareudp.c
> > +++ b/drivers/net/bareudp.c
> > @@ -48,6 +48,7 @@ struct bareudp_dev {
> >         struct net_device  *dev;        /* netdev for bareudp tunnel */
> >         __be16             ethertype;
> >         u16                sport_min;
> > +       bool               ext_mode;
> >         struct bareudp_conf conf;
> >         struct bareudp_sock __rcu *sock4; /* IPv4 socket for bareudp tunnel */
> >  #if IS_ENABLED(CONFIG_IPV6)
> > @@ -82,15 +83,64 @@ static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> >                 goto drop;
> >
> >         bareudp = bs->bareudp;
> > -       proto = bareudp->ethertype;
> > +       if (!bareudp)
> > +               goto drop;
> > +
> > +       if (bareudp->ethertype == htons(ETH_P_IP)) {
> > +               struct iphdr *iphdr;
> > +
> > +               iphdr = (struct iphdr *)(skb->data + BAREUDP_BASE_HLEN);
> > +               if (iphdr->version == 4) {
> > +                       proto = bareudp->ethertype;
> > +               } else if (bareudp->ext_mode && (iphdr->version == 6)) {
> > +                       proto = htons(ETH_P_IPV6);
> 
> Verified packet length before reading at offset? Why does v6 needs
> extended mode, while v4 does not?
>
Explained above.
 
> For any packet encapsulated in UDP, the inner packet type will be
> unknown, so needs to be configured on the device. That is not a
> special feature. FOU gives an example. My main concern is that this
> introduces a lot of code that nearly duplicates existing tunneling
> support. It is not clear to me that existing logic cannot be
> reused/extended.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-09 12:48     ` Martin Varghese
@ 2019-10-09 14:58       ` Willem de Bruijn
  2019-10-09 15:21         ` Willem de Bruijn
  2019-10-09 15:42         ` Jiri Benc
  0 siblings, 2 replies; 36+ messages in thread
From: Willem de Bruijn @ 2019-10-09 14:58 UTC (permalink / raw)
  To: Martin Varghese
  Cc: Willem de Bruijn, Network Development, David Miller, corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Wed, Oct 9, 2019 at 8:48 AM Martin Varghese
<martinvarghesenokia@gmail.com> wrote:
>
> On Tue, Oct 08, 2019 at 12:28:23PM -0400, Willem de Bruijn wrote:
> > On Tue, Oct 8, 2019 at 5:51 AM Martin Varghese
> > <martinvarghesenokia@gmail.com> wrote:
> > >
> > > From: Martin <martin.varghese@nokia.com>
> > >
> > > The Bareudp tunnel module provides a generic L3 encapsulation
> > > tunnelling module for tunnelling different protocols like MPLS,
> > > IP,NSH etc inside a UDP tunnel.
> > >
> > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > > +++ b/Documentation/networking/bareudp.txt
> > > @@ -0,0 +1,23 @@
> > > +Bare UDP Tunnelling Module Documentation
> > > +========================================
> > > +
> > > +There are various L3 encapsulation standards using UDP being discussed to
> > > +leverage the UDP based load balancing capability of different networks.
> > > +MPLSoUDP (https://tools.ietf.org/html/rfc7510)is one among them.
> > > +
> > > +The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
> > > +support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
> > > +a UDP tunnel.
> >
> > This patch set introduces a lot of code, much of which duplicates
> > existing tunnel devices, whether FOU using ipip tunnels and
> > fou_build_header or separate devices like vxlan and geneve. Let's try
> > to reuse what's there (and tested).
> >
> > How does this differ from foo-over-udp (FOU). In supporting MPLS
> > alongside IP? If so, can it reuse the existing code, possibly with
> > patches to that existing tunnel logic?
> >
> > I happened to have been playing around with MPLS in GRE recently. Have
> > not tried the same over UDP tunnels, but most of it should apply?
> >
> >   ip tunnel add gre1 mode gre local 192.168.1.1 remote 192.168.1.2 dev eth0
> >   ip addr add 192.168.101.1 peer 192.168.101.2 dev gre1
> >   ip link set dev gre1 up
> >
> >   sysctl net.mpls.conf.gre1.input=1
> >   sysctl net.mpls.platform_labels=17
> >   ip addr add 192.168.201.1/24 dev gre1
> >   ip route add 192.168.202.0/24 encap mpls 17 dev gre1
> >   ip -f mpls route add 16 dev lo
> >
> >
> unlike FOU which does l4 encapsulation, here we are trying to acheive l3 encapsulation.
> For Example,  What is needed for MPLSoUDP is to take packets such as:
>
> eth header | mpls header | payload
>
> and encapsulate them to:
>
> eth header | ip header | udp header | mpls header | payload
> <--------- outer ------------------> <---- inner --------->
>
> which is later decapsulated back to:
>
> eth header | mpls header | payload
>
> Note that in the decapsulated case, the ethertype in the eth header is ETH_P_MPLS_UC, while in the encapsulated case, the ethertype in the eth header is ETH_P_IP. When decapsulating, ETH_P_MPLS_UC is
> restored based on the configured tunnel parameters.
>
> This cannot be done with FOU. FOU needs an ipproto, not ethertype.
>
> During receive FOU dispatches packet to l4 handlers which makes it impossible to patch it to get it working for l3 protocols like MPLS.

Yes, this needs a call to gro_cells_receive like geneve_udp_encap_recv
and vxlan_rcv, instead of returning a negative value back to
udp_queue_rcv_one_skb. Though that's not a major change.

Transmit is easier. The example I shared already encapsulates packets
with MPLs and sends them through a tunnel device. Admittedly using
mpls routing. But the ip tunneling infrastructure supports adding
arbitrary inner headers using ip_tunnel_encap_ops.build_header.
net/ipv4/fou.c could be extended with a mpls_build_header alongside
fou_build_header and gue_build_header?

Extending this existing code has more benefits than code reuse (and
thereby fewer bugs), it also allows reusing the existing GRO logic,
say.

At a high level, I think we should try hard to avoid duplicating
tunnel code for every combination of inner and outer protocol. Geneve
and vxlan on the one hand and generic ip tunnel and FOU on the other
implement most of these features. Indeed, already implements the
IPxIPx, just not MPLS. It will be less code to add MPLS support based
on existing infrastructure.

> > The route lookup logic is very similar to vxlan_get_route and
> > vxlan6_get_route. Can be reused?
>
> Yes  we could. We can move these to a common place. include/net/ip_tunnels.h ?

I think it will be preferable to work the other way around and extend
existing ip tunnel infra to add MPLS.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 2/2] Special handling for IP & MPLS.
  2019-10-09 13:38     ` Martin Varghese
@ 2019-10-09 15:06       ` Willem de Bruijn
  2019-10-09 15:19         ` Willem de Bruijn
  0 siblings, 1 reply; 36+ messages in thread
From: Willem de Bruijn @ 2019-10-09 15:06 UTC (permalink / raw)
  To: Martin Varghese
  Cc: Willem de Bruijn, Network Development, David Miller, corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Wed, Oct 9, 2019 at 9:39 AM Martin Varghese
<martinvarghesenokia@gmail.com> wrote:
>
> On Tue, Oct 08, 2019 at 12:09:49PM -0400, Willem de Bruijn wrote:
> > On Tue, Oct 8, 2019 at 5:52 AM Martin Varghese
> > <martinvarghesenokia@gmail.com> wrote:
> > >
> > > From: Martin <martin.varghese@nokia.com>
> > >
> >
> > This commit would need a commit message.
> >
> > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > >
> > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > > ---
> > >  Documentation/networking/bareudp.txt | 18 ++++++++
> > >  drivers/net/bareudp.c                | 82 +++++++++++++++++++++++++++++++++---
> > >  include/net/bareudp.h                |  1 +
> > >  include/uapi/linux/if_link.h         |  1 +
> > >  4 files changed, 95 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> > > index d2530e2..4de1022 100644
> > > --- a/Documentation/networking/bareudp.txt
> > > +++ b/Documentation/networking/bareudp.txt
> > > @@ -9,6 +9,15 @@ The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
> > >  support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
> > >  a UDP tunnel.
> > >
> > > +Special Handling
> > > +----------------
> > > +The bareudp device supports special handling for MPLS & IP as they can have
> > > +multiple ethertypes.
> >
> > Special in what way?
> >
> The bareudp device associates a L3 protocol (ethertype) with a UDP port.
> For some protocols like MPLS,IP there exists multiplle ethertypes.
> IPV6 and IPV4 ethertypes for IP and MPLS unicast & Multicast ethertypes for
> MPLS. There coud be use cases where both MPLS unicast and multicast traffic
> need to be tunnelled using the same bareudp device.Similarly for ipv4 and ipv6.

IP is already solved. I would focus on MPLS.

Also, the days where IPv6 is optional (and needs IPv4 enabled) are
behind us, really.

Maybe just let the admin explicitly specify MPLS unicast, multicast or
both, instead of defining a new extended label.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 2/2] Special handling for IP & MPLS.
  2019-10-09 15:06       ` Willem de Bruijn
@ 2019-10-09 15:19         ` Willem de Bruijn
  0 siblings, 0 replies; 36+ messages in thread
From: Willem de Bruijn @ 2019-10-09 15:19 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin Varghese, Network Development, David Miller, corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Wed, Oct 9, 2019 at 11:06 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Oct 9, 2019 at 9:39 AM Martin Varghese
> <martinvarghesenokia@gmail.com> wrote:
> >
> > On Tue, Oct 08, 2019 at 12:09:49PM -0400, Willem de Bruijn wrote:
> > > On Tue, Oct 8, 2019 at 5:52 AM Martin Varghese
> > > <martinvarghesenokia@gmail.com> wrote:
> > > >
> > > > From: Martin <martin.varghese@nokia.com>
> > > >
> > >
> > > This commit would need a commit message.
> > >
> > > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > > >
> > > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > > > ---
> > > >  Documentation/networking/bareudp.txt | 18 ++++++++
> > > >  drivers/net/bareudp.c                | 82 +++++++++++++++++++++++++++++++++---
> > > >  include/net/bareudp.h                |  1 +
> > > >  include/uapi/linux/if_link.h         |  1 +
> > > >  4 files changed, 95 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> > > > index d2530e2..4de1022 100644
> > > > --- a/Documentation/networking/bareudp.txt
> > > > +++ b/Documentation/networking/bareudp.txt
> > > > @@ -9,6 +9,15 @@ The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
> > > >  support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
> > > >  a UDP tunnel.
> > > >
> > > > +Special Handling
> > > > +----------------
> > > > +The bareudp device supports special handling for MPLS & IP as they can have
> > > > +multiple ethertypes.
> > >
> > > Special in what way?
> > >
> > The bareudp device associates a L3 protocol (ethertype) with a UDP port.
> > For some protocols like MPLS,IP there exists multiplle ethertypes.
> > IPV6 and IPV4 ethertypes for IP and MPLS unicast & Multicast ethertypes for
> > MPLS. There coud be use cases where both MPLS unicast and multicast traffic
> > need to be tunnelled using the same bareudp device.Similarly for ipv4 and ipv6.
>
> IP is already solved. I would focus on MPLS.
>
> Also, the days where IPv6 is optional (and needs IPv4 enabled) are
> behind us, really.

Ah sorry, there is nothing stopping someone from creating an
ETH_P_IPV6 only device before this path.

> Maybe just let the admin explicitly specify MPLS unicast, multicast or
> both, instead of defining a new extended label.

Deriving the inner protocol type from the outer protocol mode sounds
fine, indeed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-09 14:58       ` Willem de Bruijn
@ 2019-10-09 15:21         ` Willem de Bruijn
  2019-10-09 15:42         ` Jiri Benc
  1 sibling, 0 replies; 36+ messages in thread
From: Willem de Bruijn @ 2019-10-09 15:21 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin Varghese, Network Development, David Miller, corbet,
	scott.drennan, Jiri Benc, martin.varghese

> > Yes  we could. We can move these to a common place. include/net/ip_tunnels.h ?
>
> I think it will be preferable to work the other way around and extend
> existing ip tunnel infra to add MPLS.

But that may not be entirely feasible, not sure yet. Else, indeed
moving such helpers to a common location sounds good.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-09 14:58       ` Willem de Bruijn
  2019-10-09 15:21         ` Willem de Bruijn
@ 2019-10-09 15:42         ` Jiri Benc
  2019-10-09 16:15           ` Willem de Bruijn
  2019-10-18 20:03           ` Tom Herbert
  1 sibling, 2 replies; 36+ messages in thread
From: Jiri Benc @ 2019-10-09 15:42 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin Varghese, Network Development, David Miller, corbet,
	scott.drennan, martin.varghese

On Wed, 9 Oct 2019 10:58:51 -0400, Willem de Bruijn wrote:
> Yes, this needs a call to gro_cells_receive like geneve_udp_encap_recv
> and vxlan_rcv, instead of returning a negative value back to
> udp_queue_rcv_one_skb. Though that's not a major change.

Willem, how would you do that? The whole fou code including its netlink
API is built around the assumption that it's L4 encapsulation. I see no
way to extend it to do L3 encapsulation without major hacks - in fact,
you'll add all that Martin's patch adds, including a new netlink API,
but just mix that into fou.c.

> Transmit is easier. The example I shared already encapsulates packets
> with MPLs and sends them through a tunnel device. Admittedly using
> mpls routing. But the ip tunneling infrastructure supports adding
> arbitrary inner headers using ip_tunnel_encap_ops.build_header.
> net/ipv4/fou.c could be extended with a mpls_build_header alongside
> fou_build_header and gue_build_header?

The nice thing about the bareudp tunnel is that it's generic. It's just
(outer) UDP header followed by (inner) L3 header. You can use it for
NSH over UDP. Or for multitude of other purposes.

What you're proposing is MPLS only. And it would require similar amount
of code as the bareudp generic solution. It also doesn't fit fou
design: MPLS is not an IP protocol. Trying to add it there is like
forcing a round peg into a square hole. Just look at the code.
Start with parse_nl_config.

> Extending this existing code has more benefits than code reuse (and
> thereby fewer bugs), it also allows reusing the existing GRO logic,
> say.

In this case, it's the opposite: trying to retrofit L3 encapsulation
into fou is going to introduce bugs.

Moreover, given the expected usage of bareudp, the nice benefit is it's
lwtunnel only. No need to carry all the baggage of standalone
configuration fou has.

> At a high level, I think we should try hard to avoid duplicating
> tunnel code for every combination of inner and outer protocol.

Yet you're proposing to do exactly that with special casing MPLS in fou.

I'd say bareudp is as generic as you could get it. It allows any L3
protocol inside UDP. You can hardly make it more generic than that.
With ETH_P_TEB, it could also encapsulate Ethernet (that is, L2).

> Geneve
> and vxlan on the one hand and generic ip tunnel and FOU on the other
> implement most of these features. Indeed, already implements the
> IPxIPx, just not MPLS. It will be less code to add MPLS support based
> on existing infrastructure.

You're mixing apples with oranges. IP tunnels and FOU encapsulate L4
payload. VXLAN and Geneve encapsulate L2 payload (or L3 payload for
VXLAN-GPE).

I agree that VXLAN, Geneve and the proposed bareudp module have
duplicated code. The solution is to factoring it out to a common
location.

> I think it will be preferable to work the other way around and extend
> existing ip tunnel infra to add MPLS.

You see, that's the problem: this is not IP tunneling.

 Jiri

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-09 15:42         ` Jiri Benc
@ 2019-10-09 16:15           ` Willem de Bruijn
  2019-10-18 20:03           ` Tom Herbert
  1 sibling, 0 replies; 36+ messages in thread
From: Willem de Bruijn @ 2019-10-09 16:15 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Willem de Bruijn, Martin Varghese, Network Development,
	David Miller, corbet, scott.drennan, martin.varghese

On Wed, Oct 9, 2019 at 11:42 AM Jiri Benc <jbenc@redhat.com> wrote:
>
> On Wed, 9 Oct 2019 10:58:51 -0400, Willem de Bruijn wrote:
> > Yes, this needs a call to gro_cells_receive like geneve_udp_encap_recv
> > and vxlan_rcv, instead of returning a negative value back to
> > udp_queue_rcv_one_skb. Though that's not a major change.
>
> Willem, how would you do that? The whole fou code including its netlink
> API is built around the assumption that it's L4 encapsulation. I see no
> way to extend it to do L3 encapsulation without major hacks - in fact,
> you'll add all that Martin's patch adds, including a new netlink API,
> but just mix that into fou.c.
>
> > Transmit is easier. The example I shared already encapsulates packets
> > with MPLs and sends them through a tunnel device. Admittedly using
> > mpls routing. But the ip tunneling infrastructure supports adding
> > arbitrary inner headers using ip_tunnel_encap_ops.build_header.
> > net/ipv4/fou.c could be extended with a mpls_build_header alongside
> > fou_build_header and gue_build_header?
>
> The nice thing about the bareudp tunnel is that it's generic. It's just
> (outer) UDP header followed by (inner) L3 header. You can use it for
> NSH over UDP. Or for multitude of other purposes.
>
> What you're proposing is MPLS only. And it would require similar amount
> of code as the bareudp generic solution. It also doesn't fit fou
> design: MPLS is not an IP protocol. Trying to add it there is like
> forcing a round peg into a square hole. Just look at the code.
> Start with parse_nl_config.
>
> > Extending this existing code has more benefits than code reuse (and
> > thereby fewer bugs), it also allows reusing the existing GRO logic,
> > say.
>
> In this case, it's the opposite: trying to retrofit L3 encapsulation
> into fou is going to introduce bugs.
>
> Moreover, given the expected usage of bareudp, the nice benefit is it's
> lwtunnel only. No need to carry all the baggage of standalone
> configuration fou has.

That's an interesting point. Is lwtunnel support actually
something that should carry over to fou?

> > At a high level, I think we should try hard to avoid duplicating
> > tunnel code for every combination of inner and outer protocol.
>
> Yet you're proposing to do exactly that with special casing MPLS in fou.
>
> I'd say bareudp is as generic as you could get it. It allows any L3
> protocol inside UDP. You can hardly make it more generic than that.
> With ETH_P_TEB, it could also encapsulate Ethernet (that is, L2).
>
> > Geneve
> > and vxlan on the one hand and generic ip tunnel and FOU on the other
> > implement most of these features. Indeed, already implements the
> > IPxIPx, just not MPLS. It will be less code to add MPLS support based
> > on existing infrastructure.
>
> You're mixing apples with oranges. IP tunnels and FOU encapsulate L4
> payload.

FOU encapsulates L3, no different than IPIP or GRE?

> VXLAN and Geneve encapsulate L2 payload (or L3 payload for
> VXLAN-GPE).
>
> I agree that VXLAN, Geneve and the proposed bareudp module have
> duplicated code. The solution is to factoring it out to a common
> location.

That sounds fine. The crux is that there really are use cases besides MPLS.

> > I think it will be preferable to work the other way around and extend
> > existing ip tunnel infra to add MPLS.
>
> You see, that's the problem: this is not IP tunneling.

 GRE also supports ETH_P_TEB, using the shared ip tunnel infrastructure.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-08 16:28   ` Willem de Bruijn
  2019-10-09 12:48     ` Martin Varghese
@ 2019-10-17 13:20     ` Martin Varghese
  2019-10-17 20:48       ` Willem de Bruijn
  1 sibling, 1 reply; 36+ messages in thread
From: Martin Varghese @ 2019-10-17 13:20 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, corbet, scott.drennan,
	Jiri Benc, martin.varghese

On Tue, Oct 08, 2019 at 12:28:23PM -0400, Willem de Bruijn wrote:
> On Tue, Oct 8, 2019 at 5:51 AM Martin Varghese
> <martinvarghesenokia@gmail.com> wrote:
> >
> > From: Martin <martin.varghese@nokia.com>
> >
> > The Bareudp tunnel module provides a generic L3 encapsulation
> > tunnelling module for tunnelling different protocols like MPLS,
> > IP,NSH etc inside a UDP tunnel.
> >
> > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > ---
> >  Documentation/networking/bareudp.txt |  23 +
> >  drivers/net/Kconfig                  |  13 +
> >  drivers/net/Makefile                 |   1 +
> >  drivers/net/bareudp.c                | 930 +++++++++++++++++++++++++++++++++++
> >  include/net/bareudp.h                |  19 +
> >  include/uapi/linux/if_link.h         |  12 +
> >  6 files changed, 998 insertions(+)
> >  create mode 100644 Documentation/networking/bareudp.txt
> >  create mode 100644 drivers/net/bareudp.c
> >  create mode 100644 include/net/bareudp.h
> >
> > diff --git a/Documentation/networking/bareudp.txt b/Documentation/networking/bareudp.txt
> > new file mode 100644
> > index 0000000..d2530e2
> > --- /dev/null
> > +++ b/Documentation/networking/bareudp.txt
> > @@ -0,0 +1,23 @@
> > +Bare UDP Tunnelling Module Documentation
> > +========================================
> > +
> > +There are various L3 encapsulation standards using UDP being discussed to
> > +leverage the UDP based load balancing capability of different networks.
> > +MPLSoUDP (https://tools.ietf.org/html/rfc7510)is one among them.
> > +
> > +The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
> > +support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
> > +a UDP tunnel.
> 
> This patch set introduces a lot of code, much of which duplicates
> existing tunnel devices, whether FOU using ipip tunnels and
> fou_build_header or separate devices like vxlan and geneve. Let's try
> to reuse what's there (and tested).
> 
> How does this differ from foo-over-udp (FOU). In supporting MPLS
> alongside IP? If so, can it reuse the existing code, possibly with
> patches to that existing tunnel logic?
> 
> I happened to have been playing around with MPLS in GRE recently. Have
> not tried the same over UDP tunnels, but most of it should apply?
> 
>   ip tunnel add gre1 mode gre local 192.168.1.1 remote 192.168.1.2 dev eth0
>   ip addr add 192.168.101.1 peer 192.168.101.2 dev gre1
>   ip link set dev gre1 up
> 
>   sysctl net.mpls.conf.gre1.input=1
>   sysctl net.mpls.platform_labels=17
>   ip addr add 192.168.201.1/24 dev gre1
>   ip route add 192.168.202.0/24 encap mpls 17 dev gre1
>   ip -f mpls route add 16 dev lo
> 
> 
> > +static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> > +{
> > +       struct bareudp_sock *bs;
> > +       struct ethhdr *eh;
> > +       struct bareudp_dev *bareudp;
> > +       struct metadata_dst *tun_dst = NULL;
> > +       struct pcpu_sw_netstats *stats;
> > +       unsigned int len;
> > +       int err = 0;
> > +       void *oiph;
> > +       u16 proto;
> > +
> > +       if (unlikely(!pskb_may_pull(skb, BAREUDP_BASE_HLEN)))
> > +               goto drop;
> > +
> > +       bs = rcu_dereference_sk_user_data(sk);
> > +       if (!bs)
> > +               goto drop;
> > +
> > +       bareudp = bs->bareudp;
> > +       proto = bareudp->ethertype;
> > +
> > +       if (iptunnel_pull_header(skb, BAREUDP_BASE_HLEN,
> > +                                proto,
> > +                                !net_eq(bareudp->net,
> > +                                        dev_net(bareudp->dev)))) {
> > +               bareudp->dev->stats.rx_dropped++;
> > +               goto drop;
> > +       }
> > +       tun_dst = udp_tun_rx_dst(skb, bareudp_get_sk_family(bs), TUNNEL_KEY,
> > +                                0, 0);
> > +       if (!tun_dst) {
> > +               bareudp->dev->stats.rx_dropped++;
> > +               goto drop;
> > +       }
> > +       skb_dst_set(skb, &tun_dst->dst);
> 
> Is this dst metadata a firm requirement? It is optional in vxlan, say.
> If here, too, please split such parts off into separate follow-on
> patches.
> 
> 
> > +       skb_push(skb, sizeof(struct ethhdr));
> > +       eh = (struct ethhdr *)skb->data;
> > +       eh->h_proto = proto;
> > +
> > +       skb_reset_mac_header(skb);
> > +       skb->protocol = eth_type_trans(skb, bareudp->dev);
> > +       skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > +       oiph = skb_network_header(skb);
> > +       skb_reset_network_header(skb);
> > +
> > +       if (bareudp_get_sk_family(bs) == AF_INET)
> 
> This should be derived from packet contents, not socket state.
> Although the one implies the other, I imagine.
>

The IP Stack check IP headers & puts the packet in the correct socket, hence checking the ip headers again is reduntant correct?
In geneve & vxlan it is done the same way.

 
> > +static struct rtable *bareudp_get_v4_rt(struct sk_buff *skb,
> > +                                       struct net_device *dev,
> > +                                       struct bareudp_sock *bs4,
> > +                                       struct flowi4 *fl4,
> > +                                       const struct ip_tunnel_info *info)
> > +{
> > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > +       struct dst_cache *dst_cache;
> > +       struct rtable *rt = NULL;
> > +       __u8 tos;
> > +
> > +       if (!bs4)
> > +               return ERR_PTR(-EIO);
> > +
> > +       memset(fl4, 0, sizeof(*fl4));
> > +       fl4->flowi4_mark = skb->mark;
> > +       fl4->flowi4_proto = IPPROTO_UDP;
> > +       fl4->daddr = info->key.u.ipv4.dst;
> > +       fl4->saddr = info->key.u.ipv4.src;
> > +
> > +       tos = info->key.tos;
> > +       fl4->flowi4_tos = RT_TOS(tos);
> > +
> > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > +       if (use_cache) {
> > +               rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
> > +               if (rt)
> > +                       return rt;
> > +       }
> > +       rt = ip_route_output_key(bareudp->net, fl4);
> > +       if (IS_ERR(rt)) {
> > +               netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
> > +               return ERR_PTR(-ENETUNREACH);
> > +       }
> > +       if (rt->dst.dev == dev) { /* is this necessary? */
> > +               netdev_dbg(dev, "circular route to %pI4\n", &fl4->daddr);
> > +               ip_rt_put(rt);
> > +               return ERR_PTR(-ELOOP);
> > +       }
> > +       if (use_cache)
> > +               dst_cache_set_ip4(dst_cache, &rt->dst, fl4->saddr);
> > +       return rt;
> > +}
> > +
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +static struct dst_entry *bareudp_get_v6_dst(struct sk_buff *skb,
> > +                                           struct net_device *dev,
> > +                                           struct bareudp_sock *bs6,
> > +                                           struct flowi6 *fl6,
> > +                                           const struct ip_tunnel_info *info)
> > +{
> > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > +       struct dst_entry *dst = NULL;
> > +       struct dst_cache *dst_cache;
> > +       __u8 prio;
> > +
> > +       if (!bs6)
> > +               return ERR_PTR(-EIO);
> > +
> > +       memset(fl6, 0, sizeof(*fl6));
> > +       fl6->flowi6_mark = skb->mark;
> > +       fl6->flowi6_proto = IPPROTO_UDP;
> > +       fl6->daddr = info->key.u.ipv6.dst;
> > +       fl6->saddr = info->key.u.ipv6.src;
> > +       prio = info->key.tos;
> > +
> > +       fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
> > +                                          info->key.label);
> > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > +       if (use_cache) {
> > +               dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
> > +               if (dst)
> > +                       return dst;
> > +       }
> > +       if (ipv6_stub->ipv6_dst_lookup(bareudp->net, bs6->sock->sk, &dst,
> > +                                      fl6)) {
> > +               netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
> > +               return ERR_PTR(-ENETUNREACH);
> > +       }
> > +       if (dst->dev == dev) { /* is this necessary? */
> > +               netdev_dbg(dev, "circular route to %pI6\n", &fl6->daddr);
> > +               dst_release(dst);
> > +               return ERR_PTR(-ELOOP);
> > +       }
> > +
> > +       if (use_cache)
> > +               dst_cache_set_ip6(dst_cache, dst, &fl6->saddr);
> > +       return dst;
> > +}
> > +#endif
> 
> The route lookup logic is very similar to vxlan_get_route and
> vxlan6_get_route. Can be reused?

I had a look at the vxlan & geneve and it seems the corresponding functions  in those modules are tightly coupled  to the rest of the module design.
More specifically wrt the ttl inheritance & the caching behaviour. It may not be possible for those modules to use a new generic API unless without a change in those module design.

The bareudp module is a generic L3 encapsulation module. It could be used to tunnel different l3 protocols. TTL Inheritance behaviour when tunnelled
could be different for these inner protocols. Hence moving  this function to a common place will make it tough to change it later when a need arises for a new protocol

Otherwise we should have more generic function which takes the  generic IP header params as arguments. Then the point is we don’t need a function like that
We can just fill up "struct flowi4" and call ip_route_output_key or dst_cache_get_ip4 to get the route table entry


Thanks
Martin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-17 13:20     ` Martin Varghese
@ 2019-10-17 20:48       ` Willem de Bruijn
  2019-10-18  8:20         ` Martin Varghese
  0 siblings, 1 reply; 36+ messages in thread
From: Willem de Bruijn @ 2019-10-17 20:48 UTC (permalink / raw)
  To: Martin Varghese
  Cc: Willem de Bruijn, Network Development, David Miller, corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Thu, Oct 17, 2019 at 9:20 AM Martin Varghese
<martinvarghesenokia@gmail.com> wrote:
>
> On Tue, Oct 08, 2019 at 12:28:23PM -0400, Willem de Bruijn wrote:
> > On Tue, Oct 8, 2019 at 5:51 AM Martin Varghese
> > <martinvarghesenokia@gmail.com> wrote:
> > >
> > > From: Martin <martin.varghese@nokia.com>
> > >
> > > The Bareudp tunnel module provides a generic L3 encapsulation
> > > tunnelling module for tunnelling different protocols like MPLS,
> > > IP,NSH etc inside a UDP tunnel.
> > >
> > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > > ---

> > > +static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> > > +{
> >
> >
> > > +       skb_push(skb, sizeof(struct ethhdr));
> > > +       eh = (struct ethhdr *)skb->data;
> > > +       eh->h_proto = proto;
> > > +
> > > +       skb_reset_mac_header(skb);
> > > +       skb->protocol = eth_type_trans(skb, bareudp->dev);
> > > +       skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > > +       oiph = skb_network_header(skb);
> > > +       skb_reset_network_header(skb);
> > > +
> > > +       if (bareudp_get_sk_family(bs) == AF_INET)
> >
> > This should be derived from packet contents, not socket state.
> > Although the one implies the other, I imagine.
> >
>
> The IP Stack check IP headers & puts the packet in the correct socket, hence checking the ip headers again is reduntant correct?

This parses the inner packet after decapsulation. The protocol stack
has selected the socket based on the outer packet, right?

I guess the correctness comes from the administrator having configured
the bareudp for this protocol type, so implicitly guarantees that no
other inner packets will appear.

Also, the oiph pointer is a bit fragile now that a new mac header is
constructed in the space that used to hold the encapsulation headers.
I suppose it only updates eth->h_proto, which lies in the former udp
header. More fundamentally, is moving the mac header needed at all, if
the stack correctly uses skb_mac_header whenever it accesses also
after decapsulation?

> In geneve & vxlan it is done the same way.
>
>
> > > +static struct rtable *bareudp_get_v4_rt(struct sk_buff *skb,
> > > +                                       struct net_device *dev,
> > > +                                       struct bareudp_sock *bs4,
> > > +                                       struct flowi4 *fl4,
> > > +                                       const struct ip_tunnel_info *info)
> > > +{
> > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > +       struct dst_cache *dst_cache;
> > > +       struct rtable *rt = NULL;
> > > +       __u8 tos;
> > > +
> > > +       if (!bs4)
> > > +               return ERR_PTR(-EIO);
> > > +
> > > +       memset(fl4, 0, sizeof(*fl4));
> > > +       fl4->flowi4_mark = skb->mark;
> > > +       fl4->flowi4_proto = IPPROTO_UDP;
> > > +       fl4->daddr = info->key.u.ipv4.dst;
> > > +       fl4->saddr = info->key.u.ipv4.src;
> > > +
> > > +       tos = info->key.tos;
> > > +       fl4->flowi4_tos = RT_TOS(tos);
> > > +
> > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > +       if (use_cache) {
> > > +               rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
> > > +               if (rt)
> > > +                       return rt;
> > > +       }
> > > +       rt = ip_route_output_key(bareudp->net, fl4);
> > > +       if (IS_ERR(rt)) {
> > > +               netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
> > > +               return ERR_PTR(-ENETUNREACH);
> > > +       }
> > > +       if (rt->dst.dev == dev) { /* is this necessary? */
> > > +               netdev_dbg(dev, "circular route to %pI4\n", &fl4->daddr);
> > > +               ip_rt_put(rt);
> > > +               return ERR_PTR(-ELOOP);
> > > +       }
> > > +       if (use_cache)
> > > +               dst_cache_set_ip4(dst_cache, &rt->dst, fl4->saddr);
> > > +       return rt;
> > > +}
> > > +
> > > +#if IS_ENABLED(CONFIG_IPV6)
> > > +static struct dst_entry *bareudp_get_v6_dst(struct sk_buff *skb,
> > > +                                           struct net_device *dev,
> > > +                                           struct bareudp_sock *bs6,
> > > +                                           struct flowi6 *fl6,
> > > +                                           const struct ip_tunnel_info *info)
> > > +{
> > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > +       struct dst_entry *dst = NULL;
> > > +       struct dst_cache *dst_cache;
> > > +       __u8 prio;
> > > +
> > > +       if (!bs6)
> > > +               return ERR_PTR(-EIO);
> > > +
> > > +       memset(fl6, 0, sizeof(*fl6));
> > > +       fl6->flowi6_mark = skb->mark;
> > > +       fl6->flowi6_proto = IPPROTO_UDP;
> > > +       fl6->daddr = info->key.u.ipv6.dst;
> > > +       fl6->saddr = info->key.u.ipv6.src;
> > > +       prio = info->key.tos;
> > > +
> > > +       fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
> > > +                                          info->key.label);
> > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > +       if (use_cache) {
> > > +               dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
> > > +               if (dst)
> > > +                       return dst;
> > > +       }
> > > +       if (ipv6_stub->ipv6_dst_lookup(bareudp->net, bs6->sock->sk, &dst,
> > > +                                      fl6)) {
> > > +               netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
> > > +               return ERR_PTR(-ENETUNREACH);
> > > +       }
> > > +       if (dst->dev == dev) { /* is this necessary? */
> > > +               netdev_dbg(dev, "circular route to %pI6\n", &fl6->daddr);
> > > +               dst_release(dst);
> > > +               return ERR_PTR(-ELOOP);
> > > +       }
> > > +
> > > +       if (use_cache)
> > > +               dst_cache_set_ip6(dst_cache, dst, &fl6->saddr);
> > > +       return dst;
> > > +}
> > > +#endif
> >
> > The route lookup logic is very similar to vxlan_get_route and
> > vxlan6_get_route. Can be reused?
>
> I had a look at the vxlan & geneve and it seems the corresponding functions  in those modules are tightly coupled  to the rest of the module design.
> More specifically wrt the ttl inheritance & the caching behaviour. It may not be possible for those modules to use a new generic API unless without a change in those module design.

bareudp_get_v4_rt is identical to geneve_get_v4_rt down to the comment
aside from

        if ((tos == 1) && !geneve->collect_md) {
                tos = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
                use_cache = false;
        }

Same for bareudp_get_v6_dst and geneve_get_v6_dst.

Worst case that one branch could be made conditional on a boolean
argument? Maybe this collect_md part (eventually) makes sense to
bareudp, as well.



> The bareudp module is a generic L3 encapsulation module. It could be used to tunnel different l3 protocols. TTL Inheritance behaviour when tunnelled
> could be different for these inner protocols. Hence moving  this function to a common place will make it tough to change it later when a need arises for a new protocol
>
> Otherwise we should have more generic function which takes the  generic IP header params as arguments. Then the point is we don’t need a function like that
> We can just fill up "struct flowi4" and call ip_route_output_key or dst_cache_get_ip4 to get the route table entry
>
>
> Thanks
> Martin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-17 20:48       ` Willem de Bruijn
@ 2019-10-18  8:20         ` Martin Varghese
  2019-10-18 14:59           ` Willem de Bruijn
  0 siblings, 1 reply; 36+ messages in thread
From: Martin Varghese @ 2019-10-18  8:20 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, corbet, scott.drennan,
	Jiri Benc, martin.varghese

On Thu, Oct 17, 2019 at 04:48:26PM -0400, Willem de Bruijn wrote:
> On Thu, Oct 17, 2019 at 9:20 AM Martin Varghese
> <martinvarghesenokia@gmail.com> wrote:
> >
> > On Tue, Oct 08, 2019 at 12:28:23PM -0400, Willem de Bruijn wrote:
> > > On Tue, Oct 8, 2019 at 5:51 AM Martin Varghese
> > > <martinvarghesenokia@gmail.com> wrote:
> > > >
> > > > From: Martin <martin.varghese@nokia.com>
> > > >
> > > > The Bareudp tunnel module provides a generic L3 encapsulation
> > > > tunnelling module for tunnelling different protocols like MPLS,
> > > > IP,NSH etc inside a UDP tunnel.
> > > >
> > > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > > > ---
> 
> > > > +static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> > > > +{
> > >
> > >
> > > > +       skb_push(skb, sizeof(struct ethhdr));
> > > > +       eh = (struct ethhdr *)skb->data;
> > > > +       eh->h_proto = proto;
> > > > +
> > > > +       skb_reset_mac_header(skb);
> > > > +       skb->protocol = eth_type_trans(skb, bareudp->dev);
> > > > +       skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > > > +       oiph = skb_network_header(skb);
> > > > +       skb_reset_network_header(skb);
> > > > +
> > > > +       if (bareudp_get_sk_family(bs) == AF_INET)
> > >
> > > This should be derived from packet contents, not socket state.
> > > Although the one implies the other, I imagine.
> > >
> >
> > The IP Stack check IP headers & puts the packet in the correct socket, hence checking the ip headers again is reduntant correct?
> 
> This parses the inner packet after decapsulation. The protocol stack
> has selected the socket based on the outer packet, right?
> 

The check on socket  " if (bareudp_get_sk_family(bs) == AF_INET)"  was to find out the outer header was ipv4 and v6.
Based on that TOS/ECN of outer header is derived from oiph->tos for ipv4 and using ipv6_get_dsfield(oipv6h) for ipv6.
The TOS/ECN  of inner header are derived in funtions IP_ECN_decapsulate  & IP6_ECN_decapsulate.And they are derived from packet.
> I guess the correctness comes from the administrator having configured
> the bareudp for this protocol type, so implicitly guarantees that no
> other inner packets will appear.
> 
Yes that is correct.

> Also, the oiph pointer is a bit fragile now that a new mac header is
> constructed in the space that used to hold the encapsulation headers.
> I suppose it only updates eth->h_proto, which lies in the former udp
> header. More fundamentally, is moving the mac header needed at all, if
> the stack correctly uses skb_mac_header whenever it accesses also
> after decapsulation?
>

We need to move ethernet header. As there could be cases where the packet from a bareudp device is redirected via
other physical interface to a different network node for further processing.
I agree that oiph pointer is fragile, but since we are updating only proto field we are not corrupting the oiph.
But we can do ethernet header update once the oiph is no more used.It would entail setting the skb->protocol before calling IP_ECN_decapsulate 


 
> > In geneve & vxlan it is done the same way.
> >
> >
> > > > +static struct rtable *bareudp_get_v4_rt(struct sk_buff *skb,
> > > > +                                       struct net_device *dev,
> > > > +                                       struct bareudp_sock *bs4,
> > > > +                                       struct flowi4 *fl4,
> > > > +                                       const struct ip_tunnel_info *info)
> > > > +{
> > > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > > +       struct dst_cache *dst_cache;
> > > > +       struct rtable *rt = NULL;
> > > > +       __u8 tos;
> > > > +
> > > > +       if (!bs4)
> > > > +               return ERR_PTR(-EIO);
> > > > +
> > > > +       memset(fl4, 0, sizeof(*fl4));
> > > > +       fl4->flowi4_mark = skb->mark;
> > > > +       fl4->flowi4_proto = IPPROTO_UDP;
> > > > +       fl4->daddr = info->key.u.ipv4.dst;
> > > > +       fl4->saddr = info->key.u.ipv4.src;
> > > > +
> > > > +       tos = info->key.tos;
> > > > +       fl4->flowi4_tos = RT_TOS(tos);
> > > > +
> > > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > > +       if (use_cache) {
> > > > +               rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
> > > > +               if (rt)
> > > > +                       return rt;
> > > > +       }
> > > > +       rt = ip_route_output_key(bareudp->net, fl4);
> > > > +       if (IS_ERR(rt)) {
> > > > +               netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
> > > > +               return ERR_PTR(-ENETUNREACH);
> > > > +       }
> > > > +       if (rt->dst.dev == dev) { /* is this necessary? */
> > > > +               netdev_dbg(dev, "circular route to %pI4\n", &fl4->daddr);
> > > > +               ip_rt_put(rt);
> > > > +               return ERR_PTR(-ELOOP);
> > > > +       }
> > > > +       if (use_cache)
> > > > +               dst_cache_set_ip4(dst_cache, &rt->dst, fl4->saddr);
> > > > +       return rt;
> > > > +}
> > > > +
> > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > +static struct dst_entry *bareudp_get_v6_dst(struct sk_buff *skb,
> > > > +                                           struct net_device *dev,
> > > > +                                           struct bareudp_sock *bs6,
> > > > +                                           struct flowi6 *fl6,
> > > > +                                           const struct ip_tunnel_info *info)
> > > > +{
> > > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > > +       struct dst_entry *dst = NULL;
> > > > +       struct dst_cache *dst_cache;
> > > > +       __u8 prio;
> > > > +
> > > > +       if (!bs6)
> > > > +               return ERR_PTR(-EIO);
> > > > +
> > > > +       memset(fl6, 0, sizeof(*fl6));
> > > > +       fl6->flowi6_mark = skb->mark;
> > > > +       fl6->flowi6_proto = IPPROTO_UDP;
> > > > +       fl6->daddr = info->key.u.ipv6.dst;
> > > > +       fl6->saddr = info->key.u.ipv6.src;
> > > > +       prio = info->key.tos;
> > > > +
> > > > +       fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
> > > > +                                          info->key.label);
> > > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > > +       if (use_cache) {
> > > > +               dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
> > > > +               if (dst)
> > > > +                       return dst;
> > > > +       }
> > > > +       if (ipv6_stub->ipv6_dst_lookup(bareudp->net, bs6->sock->sk, &dst,
> > > > +                                      fl6)) {
> > > > +               netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
> > > > +               return ERR_PTR(-ENETUNREACH);
> > > > +       }
> > > > +       if (dst->dev == dev) { /* is this necessary? */
> > > > +               netdev_dbg(dev, "circular route to %pI6\n", &fl6->daddr);
> > > > +               dst_release(dst);
> > > > +               return ERR_PTR(-ELOOP);
> > > > +       }
> > > > +
> > > > +       if (use_cache)
> > > > +               dst_cache_set_ip6(dst_cache, dst, &fl6->saddr);
> > > > +       return dst;
> > > > +}
> > > > +#endif
> > >
> > > The route lookup logic is very similar to vxlan_get_route and
> > > vxlan6_get_route. Can be reused?
> >
> > I had a look at the vxlan & geneve and it seems the corresponding functions  in those modules are tightly coupled  to the rest of the module design.
> > More specifically wrt the ttl inheritance & the caching behaviour. It may not be possible for those modules to use a new generic API unless without a change in those module design.
> 
> bareudp_get_v4_rt is identical to geneve_get_v4_rt down to the comment
> aside from
> 
>         if ((tos == 1) && !geneve->collect_md) {
>                 tos = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
>                 use_cache = false;
>         }
> 
> Same for bareudp_get_v6_dst and geneve_get_v6_dst.
> 
> Worst case that one branch could be made conditional on a boolean
> argument? Maybe this collect_md part (eventually) makes sense to
> bareudp, as well.
> 
> 
Unlike Geneve, bareudp module is  a generic L3 encapsulation module and it could be used to tunnel different L3 protocols.
TTL inheritance requirements for these protocols will be different when tunnelled. For Example - TTL inheritance for MPLS & IP are different.
And moving this function to a common place will make it tough for Geneve & bareudp if a new L3 protocol with new TTL inheritance requirements shows up


> 
> > The bareudp module is a generic L3 encapsulation module. It could be used to tunnel different l3 protocols. TTL Inheritance behaviour when tunnelled
> > could be different for these inner protocols. Hence moving  this function to a common place will make it tough to change it later when a need arises for a new protocol
> >
> > Otherwise we should have more generic function which takes the  generic IP header params as arguments. Then the point is we don’t need a function like that
> > We can just fill up "struct flowi4" and call ip_route_output_key or dst_cache_get_ip4 to get the route table entry
> >
> >
> > Thanks
> > Martin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-18  8:20         ` Martin Varghese
@ 2019-10-18 14:59           ` Willem de Bruijn
  2019-10-23  2:40             ` Martin Varghese
  2019-11-07 13:38             ` Martin Varghese
  0 siblings, 2 replies; 36+ messages in thread
From: Willem de Bruijn @ 2019-10-18 14:59 UTC (permalink / raw)
  To: Martin Varghese
  Cc: Willem de Bruijn, Network Development, David Miller, corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Fri, Oct 18, 2019 at 4:20 AM Martin Varghese
<martinvarghesenokia@gmail.com> wrote:
>
> On Thu, Oct 17, 2019 at 04:48:26PM -0400, Willem de Bruijn wrote:
> > On Thu, Oct 17, 2019 at 9:20 AM Martin Varghese
> > <martinvarghesenokia@gmail.com> wrote:
> > >
> > > On Tue, Oct 08, 2019 at 12:28:23PM -0400, Willem de Bruijn wrote:
> > > > On Tue, Oct 8, 2019 at 5:51 AM Martin Varghese
> > > > <martinvarghesenokia@gmail.com> wrote:
> > > > >
> > > > > From: Martin <martin.varghese@nokia.com>
> > > > >
> > > > > The Bareudp tunnel module provides a generic L3 encapsulation
> > > > > tunnelling module for tunnelling different protocols like MPLS,
> > > > > IP,NSH etc inside a UDP tunnel.
> > > > >
> > > > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > > > > ---
> >
> > > > > +static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> > > > > +{
> > > >
> > > >
> > > > > +       skb_push(skb, sizeof(struct ethhdr));
> > > > > +       eh = (struct ethhdr *)skb->data;
> > > > > +       eh->h_proto = proto;
> > > > > +
> > > > > +       skb_reset_mac_header(skb);
> > > > > +       skb->protocol = eth_type_trans(skb, bareudp->dev);
> > > > > +       skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > > > > +       oiph = skb_network_header(skb);
> > > > > +       skb_reset_network_header(skb);
> > > > > +
> > > > > +       if (bareudp_get_sk_family(bs) == AF_INET)
> > > >
> > > > This should be derived from packet contents, not socket state.
> > > > Although the one implies the other, I imagine.
> > > >
> > >
> > > The IP Stack check IP headers & puts the packet in the correct socket, hence checking the ip headers again is reduntant correct?
> >
> > This parses the inner packet after decapsulation. The protocol stack
> > has selected the socket based on the outer packet, right?
> >
>
> The check on socket  " if (bareudp_get_sk_family(bs) == AF_INET)"  was to find out the outer header was ipv4 and v6.
> Based on that TOS/ECN of outer header is derived from oiph->tos for ipv4 and using ipv6_get_dsfield(oipv6h) for ipv6.
> The TOS/ECN  of inner header are derived in funtions IP_ECN_decapsulate  & IP6_ECN_decapsulate.And they are derived from packet.
> > I guess the correctness comes from the administrator having configured
> > the bareudp for this protocol type, so implicitly guarantees that no
> > other inner packets will appear.
> >
> Yes that is correct.
>
> > Also, the oiph pointer is a bit fragile now that a new mac header is
> > constructed in the space that used to hold the encapsulation headers.
> > I suppose it only updates eth->h_proto, which lies in the former udp
> > header. More fundamentally, is moving the mac header needed at all, if
> > the stack correctly uses skb_mac_header whenever it accesses also
> > after decapsulation?
> >
>
> We need to move ethernet header. As there could be cases where the packet from a bareudp device is redirected via
> other physical interface to a different network node for further processing.
> I agree that oiph pointer is fragile, but since we are updating only proto field we are not corrupting the oiph.
> But we can do ethernet header update once the oiph is no more used.It would entail setting the skb->protocol before calling IP_ECN_decapsulate
>
>
>
> > > In geneve & vxlan it is done the same way.

I see, yes, geneve does the same thing.

> > >
> > >
> > > > > +static struct rtable *bareudp_get_v4_rt(struct sk_buff *skb,
> > > > > +                                       struct net_device *dev,
> > > > > +                                       struct bareudp_sock *bs4,
> > > > > +                                       struct flowi4 *fl4,
> > > > > +                                       const struct ip_tunnel_info *info)
> > > > > +{
> > > > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > > > +       struct dst_cache *dst_cache;
> > > > > +       struct rtable *rt = NULL;
> > > > > +       __u8 tos;
> > > > > +
> > > > > +       if (!bs4)
> > > > > +               return ERR_PTR(-EIO);
> > > > > +
> > > > > +       memset(fl4, 0, sizeof(*fl4));
> > > > > +       fl4->flowi4_mark = skb->mark;
> > > > > +       fl4->flowi4_proto = IPPROTO_UDP;
> > > > > +       fl4->daddr = info->key.u.ipv4.dst;
> > > > > +       fl4->saddr = info->key.u.ipv4.src;
> > > > > +
> > > > > +       tos = info->key.tos;
> > > > > +       fl4->flowi4_tos = RT_TOS(tos);
> > > > > +
> > > > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > > > +       if (use_cache) {
> > > > > +               rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
> > > > > +               if (rt)
> > > > > +                       return rt;
> > > > > +       }
> > > > > +       rt = ip_route_output_key(bareudp->net, fl4);
> > > > > +       if (IS_ERR(rt)) {
> > > > > +               netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
> > > > > +               return ERR_PTR(-ENETUNREACH);
> > > > > +       }
> > > > > +       if (rt->dst.dev == dev) { /* is this necessary? */
> > > > > +               netdev_dbg(dev, "circular route to %pI4\n", &fl4->daddr);
> > > > > +               ip_rt_put(rt);
> > > > > +               return ERR_PTR(-ELOOP);
> > > > > +       }
> > > > > +       if (use_cache)
> > > > > +               dst_cache_set_ip4(dst_cache, &rt->dst, fl4->saddr);
> > > > > +       return rt;
> > > > > +}
> > > > > +
> > > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > > +static struct dst_entry *bareudp_get_v6_dst(struct sk_buff *skb,
> > > > > +                                           struct net_device *dev,
> > > > > +                                           struct bareudp_sock *bs6,
> > > > > +                                           struct flowi6 *fl6,
> > > > > +                                           const struct ip_tunnel_info *info)
> > > > > +{
> > > > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > > > +       struct dst_entry *dst = NULL;
> > > > > +       struct dst_cache *dst_cache;
> > > > > +       __u8 prio;
> > > > > +
> > > > > +       if (!bs6)
> > > > > +               return ERR_PTR(-EIO);
> > > > > +
> > > > > +       memset(fl6, 0, sizeof(*fl6));
> > > > > +       fl6->flowi6_mark = skb->mark;
> > > > > +       fl6->flowi6_proto = IPPROTO_UDP;
> > > > > +       fl6->daddr = info->key.u.ipv6.dst;
> > > > > +       fl6->saddr = info->key.u.ipv6.src;
> > > > > +       prio = info->key.tos;
> > > > > +
> > > > > +       fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
> > > > > +                                          info->key.label);
> > > > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > > > +       if (use_cache) {
> > > > > +               dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
> > > > > +               if (dst)
> > > > > +                       return dst;
> > > > > +       }
> > > > > +       if (ipv6_stub->ipv6_dst_lookup(bareudp->net, bs6->sock->sk, &dst,
> > > > > +                                      fl6)) {
> > > > > +               netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
> > > > > +               return ERR_PTR(-ENETUNREACH);
> > > > > +       }
> > > > > +       if (dst->dev == dev) { /* is this necessary? */
> > > > > +               netdev_dbg(dev, "circular route to %pI6\n", &fl6->daddr);
> > > > > +               dst_release(dst);
> > > > > +               return ERR_PTR(-ELOOP);
> > > > > +       }
> > > > > +
> > > > > +       if (use_cache)
> > > > > +               dst_cache_set_ip6(dst_cache, dst, &fl6->saddr);
> > > > > +       return dst;
> > > > > +}
> > > > > +#endif
> > > >
> > > > The route lookup logic is very similar to vxlan_get_route and
> > > > vxlan6_get_route. Can be reused?
> > >
> > > I had a look at the vxlan & geneve and it seems the corresponding functions  in those modules are tightly coupled  to the rest of the module design.
> > > More specifically wrt the ttl inheritance & the caching behaviour. It may not be possible for those modules to use a new generic API unless without a change in those module design.
> >
> > bareudp_get_v4_rt is identical to geneve_get_v4_rt down to the comment
> > aside from
> >
> >         if ((tos == 1) && !geneve->collect_md) {
> >                 tos = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
> >                 use_cache = false;
> >         }
> >
> > Same for bareudp_get_v6_dst and geneve_get_v6_dst.
> >
> > Worst case that one branch could be made conditional on a boolean
> > argument? Maybe this collect_md part (eventually) makes sense to
> > bareudp, as well.
> >
> >
> Unlike Geneve, bareudp module is  a generic L3 encapsulation module and it could be used to tunnel different L3 protocols.
> TTL inheritance requirements for these protocols will be different when tunnelled. For Example - TTL inheritance for MPLS & IP are different.
> And moving this function to a common place will make it tough for Geneve & bareudp if a new L3 protocol with new TTL inheritance requirements shows up

But that is not in geneve_get_v4_rt and its bareudp/v6_dst variants.

I do think that with close scrutiny there is a lot more room for code
deduplication. Just look at the lower half of geneve_rx and
bareudp_udp_encap_recv, for instance. This, too, is identical down to
the comments. Indeed, is it fair to say that geneve was taken as the
basis for this device?

That said, even just avoiding duplicating those routing functions
would be a good start.

I'm harping on this because in other examples in the past where a new
device was created by duplicating instead of factoring out code
implementations diverge over time in bad ways due to optimizations,
features and most importantly bugfixes being applied only to one
instance or the other. See for instance tun.c and tap.c.

Unrelated, an ipv6 socket can receive both ipv4 and ipv6 traffic if
not setting the v6only bit, so does the device need to have separate
sock4 and sock6 members? Both sockets currently lead to the same
bareudp_udp_encap_recv callback function.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-09 15:42         ` Jiri Benc
  2019-10-09 16:15           ` Willem de Bruijn
@ 2019-10-18 20:03           ` Tom Herbert
  2019-10-21 17:18             ` Jiri Benc
  1 sibling, 1 reply; 36+ messages in thread
From: Tom Herbert @ 2019-10-18 20:03 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Willem de Bruijn, Martin Varghese, Network Development,
	David Miller, Jonathan Corbet, scott.drennan, martin.varghese

On Wed, Oct 9, 2019 at 8:42 AM Jiri Benc <jbenc@redhat.com> wrote:
>
> On Wed, 9 Oct 2019 10:58:51 -0400, Willem de Bruijn wrote:
> > Yes, this needs a call to gro_cells_receive like geneve_udp_encap_recv
> > and vxlan_rcv, instead of returning a negative value back to
> > udp_queue_rcv_one_skb. Though that's not a major change.
>
> Willem, how would you do that? The whole fou code including its netlink
> API is built around the assumption that it's L4 encapsulation. I see no
> way to extend it to do L3 encapsulation without major hacks - in fact,
> you'll add all that Martin's patch adds, including a new netlink API,
> but just mix that into fou.c.
>
More specifically fou allows encapsulation of anything that has an IP
protocol number. That includes an L3 protocols that have been assigned
a number (e.g. MPLS, GRE, IPv6, IPv4, EtherIP). So the only need for
an alternate method to do L3 encapsulation would be for those
protocols that don't have an IP protocol number assignment.
Presumably, those just have an EtherType. In that case, it seems
simple enough to just extend fou to processed an encapsulated
EtherType. This should be little more than modifying the "struct fou"
to hold 16 bit EtherType (union with protocol), adding
FOU_ENCAP_ETHER, corresponding attribute, and then populate
appropriate receive functions for the socket.

Tom

> > Transmit is easier. The example I shared already encapsulates packets
> > with MPLs and sends them through a tunnel device. Admittedly using
> > mpls routing. But the ip tunneling infrastructure supports adding
> > arbitrary inner headers using ip_tunnel_encap_ops.build_header.
> > net/ipv4/fou.c could be extended with a mpls_build_header alongside
> > fou_build_header and gue_build_header?
>
> The nice thing about the bareudp tunnel is that it's generic. It's just
> (outer) UDP header followed by (inner) L3 header. You can use it for
> NSH over UDP. Or for multitude of other purposes.
>
> What you're proposing is MPLS only. And it would require similar amount
> of code as the bareudp generic solution. It also doesn't fit fou
> design: MPLS is not an IP protocol. Trying to add it there is like
> forcing a round peg into a square hole. Just look at the code.
> Start with parse_nl_config.
>
> > Extending this existing code has more benefits than code reuse (and
> > thereby fewer bugs), it also allows reusing the existing GRO logic,
> > say.
>
> In this case, it's the opposite: trying to retrofit L3 encapsulation
> into fou is going to introduce bugs.
>
> Moreover, given the expected usage of bareudp, the nice benefit is it's
> lwtunnel only. No need to carry all the baggage of standalone
> configuration fou has.
>
> > At a high level, I think we should try hard to avoid duplicating
> > tunnel code for every combination of inner and outer protocol.
>
> Yet you're proposing to do exactly that with special casing MPLS in fou.
>
> I'd say bareudp is as generic as you could get it. It allows any L3
> protocol inside UDP. You can hardly make it more generic than that.
> With ETH_P_TEB, it could also encapsulate Ethernet (that is, L2).
>
> > Geneve
> > and vxlan on the one hand and generic ip tunnel and FOU on the other
> > implement most of these features. Indeed, already implements the
> > IPxIPx, just not MPLS. It will be less code to add MPLS support based
> > on existing infrastructure.
>
> You're mixing apples with oranges. IP tunnels and FOU encapsulate L4
> payload. VXLAN and Geneve encapsulate L2 payload (or L3 payload for
> VXLAN-GPE).
>
> I agree that VXLAN, Geneve and the proposed bareudp module have
> duplicated code. The solution is to factoring it out to a common
> location.
>
> > I think it will be preferable to work the other way around and extend
> > existing ip tunnel infra to add MPLS.
>
> You see, that's the problem: this is not IP tunneling.
>
>  Jiri

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-18 20:03           ` Tom Herbert
@ 2019-10-21 17:18             ` Jiri Benc
  0 siblings, 0 replies; 36+ messages in thread
From: Jiri Benc @ 2019-10-21 17:18 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Willem de Bruijn, Martin Varghese, Network Development,
	David Miller, Jonathan Corbet, scott.drennan, martin.varghese

On Fri, 18 Oct 2019 13:03:56 -0700, Tom Herbert wrote:
> More specifically fou allows encapsulation of anything that has an IP
> protocol number. That includes an L3 protocols that have been assigned
> a number (e.g. MPLS, GRE, IPv6, IPv4, EtherIP). So the only need for
> an alternate method to do L3 encapsulation would be for those
> protocols that don't have an IP protocol number assignment.
> Presumably, those just have an EtherType. In that case, it seems
> simple enough to just extend fou to processed an encapsulated
> EtherType. This should be little more than modifying the "struct fou"
> to hold 16 bit EtherType (union with protocol), adding
> FOU_ENCAP_ETHER, corresponding attribute, and then populate
> appropriate receive functions for the socket.

How do you suggest to plug that in? We need the received inner packets
to be "encapsulated" in an Ethernet header; the current approach of
"ip link add type ipip|sit encap fou" does not work here. Are you
suggesting to add another rtnl link type? How would it look like?
"ip link add type ethernet encap fou" does not sound appealing,
especially since "type ethernet" would have no meaning without the
"encap fou".

Thanks,

 Jiri


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-18 14:59           ` Willem de Bruijn
@ 2019-10-23  2:40             ` Martin Varghese
  2019-11-07 13:38             ` Martin Varghese
  1 sibling, 0 replies; 36+ messages in thread
From: Martin Varghese @ 2019-10-23  2:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, corbet, scott.drennan,
	Jiri Benc, martin.varghese

On Fri, Oct 18, 2019 at 10:59:47AM -0400, Willem de Bruijn wrote:
> On Fri, Oct 18, 2019 at 4:20 AM Martin Varghese
> <martinvarghesenokia@gmail.com> wrote:
> >
> > On Thu, Oct 17, 2019 at 04:48:26PM -0400, Willem de Bruijn wrote:
> > > On Thu, Oct 17, 2019 at 9:20 AM Martin Varghese
> > > <martinvarghesenokia@gmail.com> wrote:
> > > >
> > > > On Tue, Oct 08, 2019 at 12:28:23PM -0400, Willem de Bruijn wrote:
> > > > > On Tue, Oct 8, 2019 at 5:51 AM Martin Varghese
> > > > > <martinvarghesenokia@gmail.com> wrote:
> > > > > >
> > > > > > From: Martin <martin.varghese@nokia.com>
> > > > > >
> > > > > > The Bareudp tunnel module provides a generic L3 encapsulation
> > > > > > tunnelling module for tunnelling different protocols like MPLS,
> > > > > > IP,NSH etc inside a UDP tunnel.
> > > > > >
> > > > > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > > > > > ---
> > >
> > > > > > +static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> > > > > > +{
> > > > >
> > > > >
> > > > > > +       skb_push(skb, sizeof(struct ethhdr));
> > > > > > +       eh = (struct ethhdr *)skb->data;
> > > > > > +       eh->h_proto = proto;
> > > > > > +
> > > > > > +       skb_reset_mac_header(skb);
> > > > > > +       skb->protocol = eth_type_trans(skb, bareudp->dev);
> > > > > > +       skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > > > > > +       oiph = skb_network_header(skb);
> > > > > > +       skb_reset_network_header(skb);
> > > > > > +
> > > > > > +       if (bareudp_get_sk_family(bs) == AF_INET)
> > > > >
> > > > > This should be derived from packet contents, not socket state.
> > > > > Although the one implies the other, I imagine.
> > > > >
> > > >
> > > > The IP Stack check IP headers & puts the packet in the correct socket, hence checking the ip headers again is reduntant correct?
> > >
> > > This parses the inner packet after decapsulation. The protocol stack
> > > has selected the socket based on the outer packet, right?
> > >
> >
> > The check on socket  " if (bareudp_get_sk_family(bs) == AF_INET)"  was to find out the outer header was ipv4 and v6.
> > Based on that TOS/ECN of outer header is derived from oiph->tos for ipv4 and using ipv6_get_dsfield(oipv6h) for ipv6.
> > The TOS/ECN  of inner header are derived in funtions IP_ECN_decapsulate  & IP6_ECN_decapsulate.And they are derived from packet.
> > > I guess the correctness comes from the administrator having configured
> > > the bareudp for this protocol type, so implicitly guarantees that no
> > > other inner packets will appear.
> > >
> > Yes that is correct.
> >
> > > Also, the oiph pointer is a bit fragile now that a new mac header is
> > > constructed in the space that used to hold the encapsulation headers.
> > > I suppose it only updates eth->h_proto, which lies in the former udp
> > > header. More fundamentally, is moving the mac header needed at all, if
> > > the stack correctly uses skb_mac_header whenever it accesses also
> > > after decapsulation?
> > >
> >
> > We need to move ethernet header. As there could be cases where the packet from a bareudp device is redirected via
> > other physical interface to a different network node for further processing.
> > I agree that oiph pointer is fragile, but since we are updating only proto field we are not corrupting the oiph.
> > But we can do ethernet header update once the oiph is no more used.It would entail setting the skb->protocol before calling IP_ECN_decapsulate
> >
> >
> >
> > > > In geneve & vxlan it is done the same way.
> 
> I see, yes, geneve does the same thing.
> 
> > > >
> > > >
> > > > > > +static struct rtable *bareudp_get_v4_rt(struct sk_buff *skb,
> > > > > > +                                       struct net_device *dev,
> > > > > > +                                       struct bareudp_sock *bs4,
> > > > > > +                                       struct flowi4 *fl4,
> > > > > > +                                       const struct ip_tunnel_info *info)
> > > > > > +{
> > > > > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > > > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > > > > +       struct dst_cache *dst_cache;
> > > > > > +       struct rtable *rt = NULL;
> > > > > > +       __u8 tos;
> > > > > > +
> > > > > > +       if (!bs4)
> > > > > > +               return ERR_PTR(-EIO);
> > > > > > +
> > > > > > +       memset(fl4, 0, sizeof(*fl4));
> > > > > > +       fl4->flowi4_mark = skb->mark;
> > > > > > +       fl4->flowi4_proto = IPPROTO_UDP;
> > > > > > +       fl4->daddr = info->key.u.ipv4.dst;
> > > > > > +       fl4->saddr = info->key.u.ipv4.src;
> > > > > > +
> > > > > > +       tos = info->key.tos;
> > > > > > +       fl4->flowi4_tos = RT_TOS(tos);
> > > > > > +
> > > > > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > > > > +       if (use_cache) {
> > > > > > +               rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
> > > > > > +               if (rt)
> > > > > > +                       return rt;
> > > > > > +       }
> > > > > > +       rt = ip_route_output_key(bareudp->net, fl4);
> > > > > > +       if (IS_ERR(rt)) {
> > > > > > +               netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
> > > > > > +               return ERR_PTR(-ENETUNREACH);
> > > > > > +       }
> > > > > > +       if (rt->dst.dev == dev) { /* is this necessary? */
> > > > > > +               netdev_dbg(dev, "circular route to %pI4\n", &fl4->daddr);
> > > > > > +               ip_rt_put(rt);
> > > > > > +               return ERR_PTR(-ELOOP);
> > > > > > +       }
> > > > > > +       if (use_cache)
> > > > > > +               dst_cache_set_ip4(dst_cache, &rt->dst, fl4->saddr);
> > > > > > +       return rt;
> > > > > > +}
> > > > > > +
> > > > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > > > +static struct dst_entry *bareudp_get_v6_dst(struct sk_buff *skb,
> > > > > > +                                           struct net_device *dev,
> > > > > > +                                           struct bareudp_sock *bs6,
> > > > > > +                                           struct flowi6 *fl6,
> > > > > > +                                           const struct ip_tunnel_info *info)
> > > > > > +{
> > > > > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > > > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > > > > +       struct dst_entry *dst = NULL;
> > > > > > +       struct dst_cache *dst_cache;
> > > > > > +       __u8 prio;
> > > > > > +
> > > > > > +       if (!bs6)
> > > > > > +               return ERR_PTR(-EIO);
> > > > > > +
> > > > > > +       memset(fl6, 0, sizeof(*fl6));
> > > > > > +       fl6->flowi6_mark = skb->mark;
> > > > > > +       fl6->flowi6_proto = IPPROTO_UDP;
> > > > > > +       fl6->daddr = info->key.u.ipv6.dst;
> > > > > > +       fl6->saddr = info->key.u.ipv6.src;
> > > > > > +       prio = info->key.tos;
> > > > > > +
> > > > > > +       fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
> > > > > > +                                          info->key.label);
> > > > > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > > > > +       if (use_cache) {
> > > > > > +               dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
> > > > > > +               if (dst)
> > > > > > +                       return dst;
> > > > > > +       }
> > > > > > +       if (ipv6_stub->ipv6_dst_lookup(bareudp->net, bs6->sock->sk, &dst,
> > > > > > +                                      fl6)) {
> > > > > > +               netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
> > > > > > +               return ERR_PTR(-ENETUNREACH);
> > > > > > +       }
> > > > > > +       if (dst->dev == dev) { /* is this necessary? */
> > > > > > +               netdev_dbg(dev, "circular route to %pI6\n", &fl6->daddr);
> > > > > > +               dst_release(dst);
> > > > > > +               return ERR_PTR(-ELOOP);
> > > > > > +       }
> > > > > > +
> > > > > > +       if (use_cache)
> > > > > > +               dst_cache_set_ip6(dst_cache, dst, &fl6->saddr);
> > > > > > +       return dst;
> > > > > > +}
> > > > > > +#endif
> > > > >
> > > > > The route lookup logic is very similar to vxlan_get_route and
> > > > > vxlan6_get_route. Can be reused?
> > > >
> > > > I had a look at the vxlan & geneve and it seems the corresponding functions  in those modules are tightly coupled  to the rest of the module design.
> > > > More specifically wrt the ttl inheritance & the caching behaviour. It may not be possible for those modules to use a new generic API unless without a change in those module design.
> > >
> > > bareudp_get_v4_rt is identical to geneve_get_v4_rt down to the comment
> > > aside from
> > >
> > >         if ((tos == 1) && !geneve->collect_md) {
> > >                 tos = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
> > >                 use_cache = false;
> > >         }
> > >
> > > Same for bareudp_get_v6_dst and geneve_get_v6_dst.
> > >
> > > Worst case that one branch could be made conditional on a boolean
> > > argument? Maybe this collect_md part (eventually) makes sense to
> > > bareudp, as well.
> > >
> > >
> > Unlike Geneve, bareudp module is  a generic L3 encapsulation module and it could be used to tunnel different L3 protocols.
> > TTL inheritance requirements for these protocols will be different when tunnelled. For Example - TTL inheritance for MPLS & IP are different.
> > And moving this function to a common place will make it tough for Geneve & bareudp if a new L3 protocol with new TTL inheritance requirements shows up
> 
> But that is not in geneve_get_v4_rt and its bareudp/v6_dst variants.
>
Geneve has a TTL inheritance code in the function

if ((tos == 1) && !geneve->collect_md) {
                 tos = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
                 use_cache = false;
         }
 
> I do think that with close scrutiny there is a lot more room for code
> deduplication. Just look at the lower half of geneve_rx and
> bareudp_udp_encap_recv, for instance. This, too, is identical down to
> the comments. Indeed, is it fair to say that geneve was taken as the
> basis for this device?
> 
Yes it is
> That said, even just avoiding duplicating those routing functions
> would be a good start.
> 

I propose to have a generic route function with the  below prototype

iptunnel_get_v4_rt(struct sk_buff *skb,struct net_device *dev,struct bareudp_sock *bs4,struct flowi4 *fl4,
                   const struct ip_tunnel_info *info
                   bool use_cache  )

And another patch series for other drivers to use this new function

> I'm harping on this because in other examples in the past where a new
> device was created by duplicating instead of factoring out code
> implementations diverge over time in bad ways due to optimizations,
> features and most importantly bugfixes being applied only to one
> instance or the other. See for instance tun.c and tap.c.
> 
> Unrelated, an ipv6 socket can receive both ipv4 and ipv6 traffic if
> not setting the v6only bit, so does the device need to have separate
> sock4 and sock6 members? Both sockets currently lead to the same
> bareudp_udp_encap_recv callback function.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-10-18 14:59           ` Willem de Bruijn
  2019-10-23  2:40             ` Martin Varghese
@ 2019-11-07 13:38             ` Martin Varghese
  2019-11-07 15:53               ` Willem de Bruijn
  1 sibling, 1 reply; 36+ messages in thread
From: Martin Varghese @ 2019-11-07 13:38 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, corbet, scott.drennan,
	Jiri Benc, martin.varghese

On Fri, Oct 18, 2019 at 10:59:47AM -0400, Willem de Bruijn wrote:
> On Fri, Oct 18, 2019 at 4:20 AM Martin Varghese
> <martinvarghesenokia@gmail.com> wrote:
> >
> > On Thu, Oct 17, 2019 at 04:48:26PM -0400, Willem de Bruijn wrote:
> > > On Thu, Oct 17, 2019 at 9:20 AM Martin Varghese
> > > <martinvarghesenokia@gmail.com> wrote:
> > > >
> > > > On Tue, Oct 08, 2019 at 12:28:23PM -0400, Willem de Bruijn wrote:
> > > > > On Tue, Oct 8, 2019 at 5:51 AM Martin Varghese
> > > > > <martinvarghesenokia@gmail.com> wrote:
> > > > > >
> > > > > > From: Martin <martin.varghese@nokia.com>
> > > > > >
> > > > > > The Bareudp tunnel module provides a generic L3 encapsulation
> > > > > > tunnelling module for tunnelling different protocols like MPLS,
> > > > > > IP,NSH etc inside a UDP tunnel.
> > > > > >
> > > > > > Signed-off-by: Martin Varghese <martinvarghesenokia@gmail.com>
> > > > > > ---
> > >
> > > > > > +static int bareudp_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> > > > > > +{
> > > > >
> > > > >
> > > > > > +       skb_push(skb, sizeof(struct ethhdr));
> > > > > > +       eh = (struct ethhdr *)skb->data;
> > > > > > +       eh->h_proto = proto;
> > > > > > +
> > > > > > +       skb_reset_mac_header(skb);
> > > > > > +       skb->protocol = eth_type_trans(skb, bareudp->dev);
> > > > > > +       skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > > > > > +       oiph = skb_network_header(skb);
> > > > > > +       skb_reset_network_header(skb);
> > > > > > +
> > > > > > +       if (bareudp_get_sk_family(bs) == AF_INET)
> > > > >
> > > > > This should be derived from packet contents, not socket state.
> > > > > Although the one implies the other, I imagine.
> > > > >
> > > >
> > > > The IP Stack check IP headers & puts the packet in the correct socket, hence checking the ip headers again is reduntant correct?
> > >
> > > This parses the inner packet after decapsulation. The protocol stack
> > > has selected the socket based on the outer packet, right?
> > >
> >
> > The check on socket  " if (bareudp_get_sk_family(bs) == AF_INET)"  was to find out the outer header was ipv4 and v6.
> > Based on that TOS/ECN of outer header is derived from oiph->tos for ipv4 and using ipv6_get_dsfield(oipv6h) for ipv6.
> > The TOS/ECN  of inner header are derived in funtions IP_ECN_decapsulate  & IP6_ECN_decapsulate.And they are derived from packet.
> > > I guess the correctness comes from the administrator having configured
> > > the bareudp for this protocol type, so implicitly guarantees that no
> > > other inner packets will appear.
> > >
> > Yes that is correct.
> >
> > > Also, the oiph pointer is a bit fragile now that a new mac header is
> > > constructed in the space that used to hold the encapsulation headers.
> > > I suppose it only updates eth->h_proto, which lies in the former udp
> > > header. More fundamentally, is moving the mac header needed at all, if
> > > the stack correctly uses skb_mac_header whenever it accesses also
> > > after decapsulation?
> > >
> >
> > We need to move ethernet header. As there could be cases where the packet from a bareudp device is redirected via
> > other physical interface to a different network node for further processing.
> > I agree that oiph pointer is fragile, but since we are updating only proto field we are not corrupting the oiph.
> > But we can do ethernet header update once the oiph is no more used.It would entail setting the skb->protocol before calling IP_ECN_decapsulate
> >
> >
> >
> > > > In geneve & vxlan it is done the same way.
> 
> I see, yes, geneve does the same thing.
> 
> > > >
> > > >
> > > > > > +static struct rtable *bareudp_get_v4_rt(struct sk_buff *skb,
> > > > > > +                                       struct net_device *dev,
> > > > > > +                                       struct bareudp_sock *bs4,
> > > > > > +                                       struct flowi4 *fl4,
> > > > > > +                                       const struct ip_tunnel_info *info)
> > > > > > +{
> > > > > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > > > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > > > > +       struct dst_cache *dst_cache;
> > > > > > +       struct rtable *rt = NULL;
> > > > > > +       __u8 tos;
> > > > > > +
> > > > > > +       if (!bs4)
> > > > > > +               return ERR_PTR(-EIO);
> > > > > > +
> > > > > > +       memset(fl4, 0, sizeof(*fl4));
> > > > > > +       fl4->flowi4_mark = skb->mark;
> > > > > > +       fl4->flowi4_proto = IPPROTO_UDP;
> > > > > > +       fl4->daddr = info->key.u.ipv4.dst;
> > > > > > +       fl4->saddr = info->key.u.ipv4.src;
> > > > > > +
> > > > > > +       tos = info->key.tos;
> > > > > > +       fl4->flowi4_tos = RT_TOS(tos);
> > > > > > +
> > > > > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > > > > +       if (use_cache) {
> > > > > > +               rt = dst_cache_get_ip4(dst_cache, &fl4->saddr);
> > > > > > +               if (rt)
> > > > > > +                       return rt;
> > > > > > +       }
> > > > > > +       rt = ip_route_output_key(bareudp->net, fl4);
> > > > > > +       if (IS_ERR(rt)) {
> > > > > > +               netdev_dbg(dev, "no route to %pI4\n", &fl4->daddr);
> > > > > > +               return ERR_PTR(-ENETUNREACH);
> > > > > > +       }
> > > > > > +       if (rt->dst.dev == dev) { /* is this necessary? */
> > > > > > +               netdev_dbg(dev, "circular route to %pI4\n", &fl4->daddr);
> > > > > > +               ip_rt_put(rt);
> > > > > > +               return ERR_PTR(-ELOOP);
> > > > > > +       }
> > > > > > +       if (use_cache)
> > > > > > +               dst_cache_set_ip4(dst_cache, &rt->dst, fl4->saddr);
> > > > > > +       return rt;
> > > > > > +}
> > > > > > +
> > > > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > > > +static struct dst_entry *bareudp_get_v6_dst(struct sk_buff *skb,
> > > > > > +                                           struct net_device *dev,
> > > > > > +                                           struct bareudp_sock *bs6,
> > > > > > +                                           struct flowi6 *fl6,
> > > > > > +                                           const struct ip_tunnel_info *info)
> > > > > > +{
> > > > > > +       bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
> > > > > > +       struct bareudp_dev *bareudp = netdev_priv(dev);
> > > > > > +       struct dst_entry *dst = NULL;
> > > > > > +       struct dst_cache *dst_cache;
> > > > > > +       __u8 prio;
> > > > > > +
> > > > > > +       if (!bs6)
> > > > > > +               return ERR_PTR(-EIO);
> > > > > > +
> > > > > > +       memset(fl6, 0, sizeof(*fl6));
> > > > > > +       fl6->flowi6_mark = skb->mark;
> > > > > > +       fl6->flowi6_proto = IPPROTO_UDP;
> > > > > > +       fl6->daddr = info->key.u.ipv6.dst;
> > > > > > +       fl6->saddr = info->key.u.ipv6.src;
> > > > > > +       prio = info->key.tos;
> > > > > > +
> > > > > > +       fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
> > > > > > +                                          info->key.label);
> > > > > > +       dst_cache = (struct dst_cache *)&info->dst_cache;
> > > > > > +       if (use_cache) {
> > > > > > +               dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
> > > > > > +               if (dst)
> > > > > > +                       return dst;
> > > > > > +       }
> > > > > > +       if (ipv6_stub->ipv6_dst_lookup(bareudp->net, bs6->sock->sk, &dst,
> > > > > > +                                      fl6)) {
> > > > > > +               netdev_dbg(dev, "no route to %pI6\n", &fl6->daddr);
> > > > > > +               return ERR_PTR(-ENETUNREACH);
> > > > > > +       }
> > > > > > +       if (dst->dev == dev) { /* is this necessary? */
> > > > > > +               netdev_dbg(dev, "circular route to %pI6\n", &fl6->daddr);
> > > > > > +               dst_release(dst);
> > > > > > +               return ERR_PTR(-ELOOP);
> > > > > > +       }
> > > > > > +
> > > > > > +       if (use_cache)
> > > > > > +               dst_cache_set_ip6(dst_cache, dst, &fl6->saddr);
> > > > > > +       return dst;
> > > > > > +}
> > > > > > +#endif
> > > > >
> > > > > The route lookup logic is very similar to vxlan_get_route and
> > > > > vxlan6_get_route. Can be reused?
> > > >
> > > > I had a look at the vxlan & geneve and it seems the corresponding functions  in those modules are tightly coupled  to the rest of the module design.
> > > > More specifically wrt the ttl inheritance & the caching behaviour. It may not be possible for those modules to use a new generic API unless without a change in those module design.
> > >
> > > bareudp_get_v4_rt is identical to geneve_get_v4_rt down to the comment
> > > aside from
> > >
> > >         if ((tos == 1) && !geneve->collect_md) {
> > >                 tos = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
> > >                 use_cache = false;
> > >         }
> > >
> > > Same for bareudp_get_v6_dst and geneve_get_v6_dst.
> > >
> > > Worst case that one branch could be made conditional on a boolean
> > > argument? Maybe this collect_md part (eventually) makes sense to
> > > bareudp, as well.
> > >
> > >
> > Unlike Geneve, bareudp module is  a generic L3 encapsulation module and it could be used to tunnel different L3 protocols.
> > TTL inheritance requirements for these protocols will be different when tunnelled. For Example - TTL inheritance for MPLS & IP are different.
> > And moving this function to a common place will make it tough for Geneve & bareudp if a new L3 protocol with new TTL inheritance requirements shows up
> 
> But that is not in geneve_get_v4_rt and its bareudp/v6_dst variants.
> 
> I do think that with close scrutiny there is a lot more room for code
> deduplication. Just look at the lower half of geneve_rx and
> bareudp_udp_encap_recv, for instance. This, too, is identical down to
> the comments. Indeed, is it fair to say that geneve was taken as the
> basis for this device?
> 
> That said, even just avoiding duplicating those routing functions
> would be a good start.
> 
> I'm harping on this because in other examples in the past where a new
> device was created by duplicating instead of factoring out code
> implementations diverge over time in bad ways due to optimizations,
> features and most importantly bugfixes being applied only to one
> instance or the other. See for instance tun.c and tap.c.
> 
> Unrelated, an ipv6 socket can receive both ipv4 and ipv6 traffic if
> not setting the v6only bit, so does the device need to have separate
> sock4 and sock6 members? Both sockets currently lead to the same
> bareudp_udp_encap_recv callback function.

I was checking this.AF_INET6 allows v6 and v4 mapped v6 address.
And it doesnot allow both at the same time.So we need both 
sockets to support v4 and v6 at the same time.correct ?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-07 13:38             ` Martin Varghese
@ 2019-11-07 15:53               ` Willem de Bruijn
  2019-11-07 16:12                 ` Martin Varghese
  0 siblings, 1 reply; 36+ messages in thread
From: Willem de Bruijn @ 2019-11-07 15:53 UTC (permalink / raw)
  To: Martin Varghese
  Cc: Network Development, David Miller, Jonathan Corbet,
	scott.drennan, Jiri Benc, martin.varghese

> > I do think that with close scrutiny there is a lot more room for code
> > deduplication. Just look at the lower half of geneve_rx and
> > bareudp_udp_encap_recv, for instance. This, too, is identical down to
> > the comments. Indeed, is it fair to say that geneve was taken as the
> > basis for this device?
> >
> > That said, even just avoiding duplicating those routing functions
> > would be a good start.
> >
> > I'm harping on this because in other examples in the past where a new
> > device was created by duplicating instead of factoring out code
> > implementations diverge over time in bad ways due to optimizations,
> > features and most importantly bugfixes being applied only to one
> > instance or the other. See for instance tun.c and tap.c.
> >
> > Unrelated, an ipv6 socket can receive both ipv4 and ipv6 traffic if
> > not setting the v6only bit, so does the device need to have separate
> > sock4 and sock6 members? Both sockets currently lead to the same
> > bareudp_udp_encap_recv callback function.
>
> I was checking this.AF_INET6 allows v6 and v4 mapped v6 address.
> And it doesnot allow both at the same time.So we need both
> sockets to support v4 and v6 at the same time.correct ?

bareudp_create_sock currently creates an inet socket listening on
INADDR_ANY and an inet6 socket listening on in6addr_any with v6only.
If so, just the latter without v6only should offer the same.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-07 15:53               ` Willem de Bruijn
@ 2019-11-07 16:12                 ` Martin Varghese
  2019-11-07 16:35                   ` Willem de Bruijn
  0 siblings, 1 reply; 36+ messages in thread
From: Martin Varghese @ 2019-11-07 16:12 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Jonathan Corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Thu, Nov 07, 2019 at 10:53:47AM -0500, Willem de Bruijn wrote:
> > > I do think that with close scrutiny there is a lot more room for code
> > > deduplication. Just look at the lower half of geneve_rx and
> > > bareudp_udp_encap_recv, for instance. This, too, is identical down to
> > > the comments. Indeed, is it fair to say that geneve was taken as the
> > > basis for this device?
> > >
> > > That said, even just avoiding duplicating those routing functions
> > > would be a good start.
> > >
> > > I'm harping on this because in other examples in the past where a new
> > > device was created by duplicating instead of factoring out code
> > > implementations diverge over time in bad ways due to optimizations,
> > > features and most importantly bugfixes being applied only to one
> > > instance or the other. See for instance tun.c and tap.c.
> > >
> > > Unrelated, an ipv6 socket can receive both ipv4 and ipv6 traffic if
> > > not setting the v6only bit, so does the device need to have separate
> > > sock4 and sock6 members? Both sockets currently lead to the same
> > > bareudp_udp_encap_recv callback function.
> >
> > I was checking this.AF_INET6 allows v6 and v4 mapped v6 address.
> > And it doesnot allow both at the same time.So we need both
> > sockets to support v4 and v6 at the same time.correct ?
> 
> bareudp_create_sock currently creates an inet socket listening on
> INADDR_ANY and an inet6 socket listening on in6addr_any with v6only.
> If so, just the latter without v6only should offer the same.

To receive and ipv4 packet in AF_INET6 packet we need to pass v4 address 
in v6 format( v4 mapped v6 address). Is it not ?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-07 16:12                 ` Martin Varghese
@ 2019-11-07 16:35                   ` Willem de Bruijn
  2019-11-07 17:31                     ` Jiri Benc
  2019-11-11 16:02                     ` Martin Varghese
  0 siblings, 2 replies; 36+ messages in thread
From: Willem de Bruijn @ 2019-11-07 16:35 UTC (permalink / raw)
  To: Martin Varghese
  Cc: Network Development, David Miller, Jonathan Corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Thu, Nov 7, 2019 at 11:12 AM Martin Varghese
<martinvarghesenokia@gmail.com> wrote:
>
> On Thu, Nov 07, 2019 at 10:53:47AM -0500, Willem de Bruijn wrote:
> > > > I do think that with close scrutiny there is a lot more room for code
> > > > deduplication. Just look at the lower half of geneve_rx and
> > > > bareudp_udp_encap_recv, for instance. This, too, is identical down to
> > > > the comments. Indeed, is it fair to say that geneve was taken as the
> > > > basis for this device?
> > > >
> > > > That said, even just avoiding duplicating those routing functions
> > > > would be a good start.
> > > >
> > > > I'm harping on this because in other examples in the past where a new
> > > > device was created by duplicating instead of factoring out code
> > > > implementations diverge over time in bad ways due to optimizations,
> > > > features and most importantly bugfixes being applied only to one
> > > > instance or the other. See for instance tun.c and tap.c.
> > > >
> > > > Unrelated, an ipv6 socket can receive both ipv4 and ipv6 traffic if
> > > > not setting the v6only bit, so does the device need to have separate
> > > > sock4 and sock6 members? Both sockets currently lead to the same
> > > > bareudp_udp_encap_recv callback function.
> > >
> > > I was checking this.AF_INET6 allows v6 and v4 mapped v6 address.
> > > And it doesnot allow both at the same time.So we need both
> > > sockets to support v4 and v6 at the same time.correct ?
> >
> > bareudp_create_sock currently creates an inet socket listening on
> > INADDR_ANY and an inet6 socket listening on in6addr_any with v6only.
> > If so, just the latter without v6only should offer the same.
>
> To receive and ipv4 packet in AF_INET6 packet we need to pass v4 address
> in v6 format( v4 mapped v6 address). Is it not ?

If the bareudp device binds to a specific port on all local addresses,
which I think it's doing judging from what it passes to udp_sock_create
(but I may very well be missing something), then in6addr_any alone will
suffice to receive both v6 and v4 packets.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-07 16:35                   ` Willem de Bruijn
@ 2019-11-07 17:31                     ` Jiri Benc
  2019-11-07 18:59                       ` Willem de Bruijn
  2019-11-11 16:02                     ` Martin Varghese
  1 sibling, 1 reply; 36+ messages in thread
From: Jiri Benc @ 2019-11-07 17:31 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin Varghese, Network Development, David Miller,
	Jonathan Corbet, scott.drennan, martin.varghese

On Thu, 7 Nov 2019 11:35:07 -0500, Willem de Bruijn wrote:
> If the bareudp device binds to a specific port on all local addresses,
> which I think it's doing judging from what it passes to udp_sock_create
> (but I may very well be missing something), then in6addr_any alone will
> suffice to receive both v6 and v4 packets.

This will not work when IPv6 is disabled, either by the kernel config
or administratively. We do need to have two sockets.

 Jiri


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-07 17:31                     ` Jiri Benc
@ 2019-11-07 18:59                       ` Willem de Bruijn
  2019-11-07 19:05                         ` Jiri Benc
  0 siblings, 1 reply; 36+ messages in thread
From: Willem de Bruijn @ 2019-11-07 18:59 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Martin Varghese, Network Development, David Miller,
	Jonathan Corbet, scott.drennan, martin.varghese

On Thu, Nov 7, 2019 at 12:31 PM Jiri Benc <jbenc@redhat.com> wrote:
>
> On Thu, 7 Nov 2019 11:35:07 -0500, Willem de Bruijn wrote:
> > If the bareudp device binds to a specific port on all local addresses,
> > which I think it's doing judging from what it passes to udp_sock_create
> > (but I may very well be missing something), then in6addr_any alone will
> > suffice to receive both v6 and v4 packets.
>
> This will not work when IPv6 is disabled, either by the kernel config
> or administratively. We do need to have two sockets.

This still needs only one socket, right? Just fall back to inet if
ipv6 is disabled.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-07 18:59                       ` Willem de Bruijn
@ 2019-11-07 19:05                         ` Jiri Benc
  2019-11-07 19:13                           ` Willem de Bruijn
  0 siblings, 1 reply; 36+ messages in thread
From: Jiri Benc @ 2019-11-07 19:05 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin Varghese, Network Development, David Miller,
	Jonathan Corbet, scott.drennan, martin.varghese

On Thu, 7 Nov 2019 13:59:23 -0500, Willem de Bruijn wrote:
> This still needs only one socket, right? Just fall back to inet if
> ipv6 is disabled.

What would happen if IPv6 is disabled while the tunnel is operating?

 Jiri


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-07 19:05                         ` Jiri Benc
@ 2019-11-07 19:13                           ` Willem de Bruijn
  0 siblings, 0 replies; 36+ messages in thread
From: Willem de Bruijn @ 2019-11-07 19:13 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Martin Varghese, Network Development, David Miller,
	Jonathan Corbet, scott.drennan, martin.varghese

On Thu, Nov 7, 2019 at 2:05 PM Jiri Benc <jbenc@redhat.com> wrote:
>
> On Thu, 7 Nov 2019 13:59:23 -0500, Willem de Bruijn wrote:
> > This still needs only one socket, right? Just fall back to inet if
> > ipv6 is disabled.
>
> What would happen if IPv6 is disabled while the tunnel is operating?

How do you disable ipv6 at runtime?

rmmod is no longer allowed as of commit 8ce440610357 ("ipv6: do not
allow ipv6 module to be removed"). The commit specifically mentions a
dependency by vxlan; that sounds quite similar to this.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-07 16:35                   ` Willem de Bruijn
  2019-11-07 17:31                     ` Jiri Benc
@ 2019-11-11 16:02                     ` Martin Varghese
  2019-11-11 21:45                       ` Willem de Bruijn
  1 sibling, 1 reply; 36+ messages in thread
From: Martin Varghese @ 2019-11-11 16:02 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Jonathan Corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Thu, Nov 07, 2019 at 11:35:07AM -0500, Willem de Bruijn wrote:
> On Thu, Nov 7, 2019 at 11:12 AM Martin Varghese
> <martinvarghesenokia@gmail.com> wrote:
> >
> > On Thu, Nov 07, 2019 at 10:53:47AM -0500, Willem de Bruijn wrote:
> > > > > I do think that with close scrutiny there is a lot more room for code
> > > > > deduplication. Just look at the lower half of geneve_rx and
> > > > > bareudp_udp_encap_recv, for instance. This, too, is identical down to
> > > > > the comments. Indeed, is it fair to say that geneve was taken as the
> > > > > basis for this device?
> > > > >
> > > > > That said, even just avoiding duplicating those routing functions
> > > > > would be a good start.
> > > > >
> > > > > I'm harping on this because in other examples in the past where a new
> > > > > device was created by duplicating instead of factoring out code
> > > > > implementations diverge over time in bad ways due to optimizations,
> > > > > features and most importantly bugfixes being applied only to one
> > > > > instance or the other. See for instance tun.c and tap.c.
> > > > >
> > > > > Unrelated, an ipv6 socket can receive both ipv4 and ipv6 traffic if
> > > > > not setting the v6only bit, so does the device need to have separate
> > > > > sock4 and sock6 members? Both sockets currently lead to the same
> > > > > bareudp_udp_encap_recv callback function.
> > > >
> > > > I was checking this.AF_INET6 allows v6 and v4 mapped v6 address.
> > > > And it doesnot allow both at the same time.So we need both
> > > > sockets to support v4 and v6 at the same time.correct ?
> > >
> > > bareudp_create_sock currently creates an inet socket listening on
> > > INADDR_ANY and an inet6 socket listening on in6addr_any with v6only.
> > > If so, just the latter without v6only should offer the same.
> >
> > To receive and ipv4 packet in AF_INET6 packet we need to pass v4 address
> > in v6 format( v4 mapped v6 address). Is it not ?
> 
> If the bareudp device binds to a specific port on all local addresses,
> which I think it's doing judging from what it passes to udp_sock_create
> (but I may very well be missing something), then in6addr_any alone will
> suffice to receive both v6 and v4 packets.

Must invokde udp_encap_enable explicitly from baredudp module during setup time.
Otherwise v4 packets will not land in encap_rcv handler.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-11 16:02                     ` Martin Varghese
@ 2019-11-11 21:45                       ` Willem de Bruijn
  2019-11-14 13:17                         ` Martin Varghese
  0 siblings, 1 reply; 36+ messages in thread
From: Willem de Bruijn @ 2019-11-11 21:45 UTC (permalink / raw)
  To: Martin Varghese
  Cc: Network Development, David Miller, Jonathan Corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Mon, Nov 11, 2019 at 11:02 AM Martin Varghese
<martinvarghesenokia@gmail.com> wrote:
>
> On Thu, Nov 07, 2019 at 11:35:07AM -0500, Willem de Bruijn wrote:
> > On Thu, Nov 7, 2019 at 11:12 AM Martin Varghese
> > <martinvarghesenokia@gmail.com> wrote:
> > >
> > > On Thu, Nov 07, 2019 at 10:53:47AM -0500, Willem de Bruijn wrote:
> > > > > > I do think that with close scrutiny there is a lot more room for code
> > > > > > deduplication. Just look at the lower half of geneve_rx and
> > > > > > bareudp_udp_encap_recv, for instance. This, too, is identical down to
> > > > > > the comments. Indeed, is it fair to say that geneve was taken as the
> > > > > > basis for this device?
> > > > > >
> > > > > > That said, even just avoiding duplicating those routing functions
> > > > > > would be a good start.
> > > > > >
> > > > > > I'm harping on this because in other examples in the past where a new
> > > > > > device was created by duplicating instead of factoring out code
> > > > > > implementations diverge over time in bad ways due to optimizations,
> > > > > > features and most importantly bugfixes being applied only to one
> > > > > > instance or the other. See for instance tun.c and tap.c.
> > > > > >
> > > > > > Unrelated, an ipv6 socket can receive both ipv4 and ipv6 traffic if
> > > > > > not setting the v6only bit, so does the device need to have separate
> > > > > > sock4 and sock6 members? Both sockets currently lead to the same
> > > > > > bareudp_udp_encap_recv callback function.
> > > > >
> > > > > I was checking this.AF_INET6 allows v6 and v4 mapped v6 address.
> > > > > And it doesnot allow both at the same time.So we need both
> > > > > sockets to support v4 and v6 at the same time.correct ?
> > > >
> > > > bareudp_create_sock currently creates an inet socket listening on
> > > > INADDR_ANY and an inet6 socket listening on in6addr_any with v6only.
> > > > If so, just the latter without v6only should offer the same.
> > >
> > > To receive and ipv4 packet in AF_INET6 packet we need to pass v4 address
> > > in v6 format( v4 mapped v6 address). Is it not ?
> >
> > If the bareudp device binds to a specific port on all local addresses,
> > which I think it's doing judging from what it passes to udp_sock_create
> > (but I may very well be missing something), then in6addr_any alone will
> > suffice to receive both v6 and v4 packets.
>
> Must invokde udp_encap_enable explicitly from baredudp module during setup time.
> Otherwise v4 packets will not land in encap_rcv handler.

The call to setup_udp_tunnel_sock should take care of that. The issue
is probably that in udp_tunnel_encap_enable:

  #if IS_ENABLED(CONFIG_IPV6)
        if (sock->sk->sk_family == PF_INET6)
                ipv6_stub->udpv6_encap_enable();
        else
  #endif
                udp_encap_enable();

does not call udp_encap_enable for IPv6 sockets. Likely because
existing callers like vxlan always pass v6only = 1. Due to dual stack,
PF_INET6 should enable both static keys.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
  2019-11-11 21:45                       ` Willem de Bruijn
@ 2019-11-14 13:17                         ` Martin Varghese
  0 siblings, 0 replies; 36+ messages in thread
From: Martin Varghese @ 2019-11-14 13:17 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Jonathan Corbet,
	scott.drennan, Jiri Benc, martin.varghese

On Mon, Nov 11, 2019 at 04:45:30PM -0500, Willem de Bruijn wrote:
> On Mon, Nov 11, 2019 at 11:02 AM Martin Varghese
> <martinvarghesenokia@gmail.com> wrote:
> >
> > On Thu, Nov 07, 2019 at 11:35:07AM -0500, Willem de Bruijn wrote:
> > > On Thu, Nov 7, 2019 at 11:12 AM Martin Varghese
> > > <martinvarghesenokia@gmail.com> wrote:
> > > >
> > > > On Thu, Nov 07, 2019 at 10:53:47AM -0500, Willem de Bruijn wrote:
> > > > > > > I do think that with close scrutiny there is a lot more room for code
> > > > > > > deduplication. Just look at the lower half of geneve_rx and
> > > > > > > bareudp_udp_encap_recv, for instance. This, too, is identical down to
> > > > > > > the comments. Indeed, is it fair to say that geneve was taken as the
> > > > > > > basis for this device?
> > > > > > >
> > > > > > > That said, even just avoiding duplicating those routing functions
> > > > > > > would be a good start.
> > > > > > >
> > > > > > > I'm harping on this because in other examples in the past where a new
> > > > > > > device was created by duplicating instead of factoring out code
> > > > > > > implementations diverge over time in bad ways due to optimizations,
> > > > > > > features and most importantly bugfixes being applied only to one
> > > > > > > instance or the other. See for instance tun.c and tap.c.
> > > > > > >
> > > > > > > Unrelated, an ipv6 socket can receive both ipv4 and ipv6 traffic if
> > > > > > > not setting the v6only bit, so does the device need to have separate
> > > > > > > sock4 and sock6 members? Both sockets currently lead to the same
> > > > > > > bareudp_udp_encap_recv callback function.
> > > > > >
> > > > > > I was checking this.AF_INET6 allows v6 and v4 mapped v6 address.
> > > > > > And it doesnot allow both at the same time.So we need both
> > > > > > sockets to support v4 and v6 at the same time.correct ?
> > > > >
> > > > > bareudp_create_sock currently creates an inet socket listening on
> > > > > INADDR_ANY and an inet6 socket listening on in6addr_any with v6only.
> > > > > If so, just the latter without v6only should offer the same.
> > > >
> > > > To receive and ipv4 packet in AF_INET6 packet we need to pass v4 address
> > > > in v6 format( v4 mapped v6 address). Is it not ?
> > >
> > > If the bareudp device binds to a specific port on all local addresses,
> > > which I think it's doing judging from what it passes to udp_sock_create
> > > (but I may very well be missing something), then in6addr_any alone will
> > > suffice to receive both v6 and v4 packets.
> >
> > Must invokde udp_encap_enable explicitly from baredudp module during setup time.
> > Otherwise v4 packets will not land in encap_rcv handler.
> 
> The call to setup_udp_tunnel_sock should take care of that. The issue
> is probably that in udp_tunnel_encap_enable:
> 
>   #if IS_ENABLED(CONFIG_IPV6)
>         if (sock->sk->sk_family == PF_INET6)
>                 ipv6_stub->udpv6_encap_enable();
>         else
>   #endif
>                 udp_encap_enable();
> 
> does not call udp_encap_enable for IPv6 sockets. Likely because
> existing callers like vxlan always pass v6only = 1. Due to dual stack,
> PF_INET6 should enable both static keys.

Thanks for your time.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2019-11-14 13:18 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-08  9:48 [PATCH net-next 0/2] Bareudp Tunnel Module Martin Varghese
2019-10-08  9:48 ` [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc Martin Varghese
2019-10-08 14:06   ` Jonathan Corbet
2019-10-08 14:57   ` Randy Dunlap
2019-10-08 16:04   ` Stephen Hemminger
2019-10-08 16:05   ` Stephen Hemminger
2019-10-08 16:28   ` Willem de Bruijn
2019-10-09 12:48     ` Martin Varghese
2019-10-09 14:58       ` Willem de Bruijn
2019-10-09 15:21         ` Willem de Bruijn
2019-10-09 15:42         ` Jiri Benc
2019-10-09 16:15           ` Willem de Bruijn
2019-10-18 20:03           ` Tom Herbert
2019-10-21 17:18             ` Jiri Benc
2019-10-17 13:20     ` Martin Varghese
2019-10-17 20:48       ` Willem de Bruijn
2019-10-18  8:20         ` Martin Varghese
2019-10-18 14:59           ` Willem de Bruijn
2019-10-23  2:40             ` Martin Varghese
2019-11-07 13:38             ` Martin Varghese
2019-11-07 15:53               ` Willem de Bruijn
2019-11-07 16:12                 ` Martin Varghese
2019-11-07 16:35                   ` Willem de Bruijn
2019-11-07 17:31                     ` Jiri Benc
2019-11-07 18:59                       ` Willem de Bruijn
2019-11-07 19:05                         ` Jiri Benc
2019-11-07 19:13                           ` Willem de Bruijn
2019-11-11 16:02                     ` Martin Varghese
2019-11-11 21:45                       ` Willem de Bruijn
2019-11-14 13:17                         ` Martin Varghese
2019-10-08  9:49 ` [PATCH net-next 2/2] Special handling for IP & MPLS Martin Varghese
2019-10-08 14:58   ` Randy Dunlap
2019-10-08 16:09   ` Willem de Bruijn
2019-10-09 13:38     ` Martin Varghese
2019-10-09 15:06       ` Willem de Bruijn
2019-10-09 15:19         ` Willem de Bruijn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.