All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC net-next 0/3] IP imposition of per-nh MPLS encap
@ 2015-06-01 16:46 Robert Shearman
  2015-06-01 16:46 ` [RFC net-next 1/3] net: infra for per-nexthop encap data Robert Shearman
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Robert Shearman @ 2015-06-01 16:46 UTC (permalink / raw)
  To: netdev; +Cc: Eric W. Biederman, roopa, Thomas Graf, Robert Shearman

In order to be able to function as a Label Edge Router in an MPLS
network, it is necessary to be able to take IP packets and impose an
MPLS encap and forward them out. The traditional approach of setting
up an interface for each "tunnel" endpoint doesn't scale for the
common MPLS use-cases where each IP route tends to be assigned a
different label as encap.

The solution suggested here for further discussion is to provide the
facility to define encap data on a per-nexthop basis using a new
netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
forwarding code, but interpreted by the virtual interface assigned to
the nexthop.

A new ipmpls interface type is defined to show the use of this
facility to allow IP packets to be imposed with an MPLS
encap. However, the facility is designed to be general enough to be
used by any encapsulation/tunneling mechanism that has similar
requirements of high-scale, high-variation-of-encap.

RFC because:
 - IPv6 side not implemented
 - struct rtable shouldn't be bloated by pointer+uint
 - Hasn't been thoroughly tested yet

Robert Shearman (3):
  net: infra for per-nexthop encap data
  ipv4: storing and retrieval of per-nexthop encap
  mpls: new ipmpls device for encapsulating IP packets as mpls

 include/linux/rtnetlink.h      |   7 +
 include/net/dst.h              |  11 ++
 include/net/dst_ops.h          |   2 +
 include/net/ip_fib.h           |   2 +
 include/net/route.h            |   3 +
 include/net/rtnetlink.h        |  11 ++
 include/uapi/linux/if_arp.h    |   1 +
 include/uapi/linux/rtnetlink.h |   1 +
 net/core/rtnetlink.c           |  36 ++++++
 net/ipv4/fib_frontend.c        |   3 +
 net/ipv4/fib_lookup.h          |   2 +
 net/ipv4/fib_semantics.c       | 179 +++++++++++++++++++++++++-
 net/ipv4/route.c               |  24 ++++
 net/mpls/Kconfig               |   5 +
 net/mpls/Makefile              |   1 +
 net/mpls/af_mpls.c             |   2 +
 net/mpls/ipmpls.c              | 284 +++++++++++++++++++++++++++++++++++++++++
 17 files changed, 572 insertions(+), 2 deletions(-)
 create mode 100644 net/mpls/ipmpls.c

-- 
2.1.4

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC net-next 1/3] net: infra for per-nexthop encap data
  2015-06-01 16:46 [RFC net-next 0/3] IP imposition of per-nh MPLS encap Robert Shearman
@ 2015-06-01 16:46 ` Robert Shearman
  2015-06-02 18:15   ` Eric W. Biederman
  2015-06-01 16:46 ` [RFC net-next 2/3] ipv4: storing and retrieval of per-nexthop encap Robert Shearman
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Robert Shearman @ 2015-06-01 16:46 UTC (permalink / raw)
  To: netdev; +Cc: Eric W. Biederman, roopa, Thomas Graf, Robert Shearman

Having to add a new interface to apply encap onto a packet is a
mechanism that works well today, allowing the setup of the encap to be
done separately from the routes out of them, meaning that routing
protocols and other user-space apps don't need to do anything special
to add routes out of a new type of interface. However, the overhead of
creating an interface is high, especially in terms of
memory. Therefore, the traditional method won't work very well for
large numbers of routes applying encap where there is a low degree of
sharing of the encap.

The solution is to introduce a way of defining encap on a per-nexthop
basis (i.e. per-route if only one nexthop) through the addition of a
new netlink attribute, RTA_ENCAP. The semantics of this attribute is
that the data is interpreted according to the output interface type
(RTA_OIF) and is opaque to the normal forwarding path. The output
interface doesn't have to be defined per-nexthop, but instead
represents the way of encapsulating the packet. There could be as few
as one per namespace, but more could be created, particularly if they
are used to define parameters which are shared by a large number of
routes. However, the split of what goes in the encap data and what
might be specified via interface attributes is entirely up to the
encap-type implementation.

New rtnetlink operations are defined to assist with the management of
this data:
- parse_encap for parsing the attribute given through rtnl and either
  sizing the in-memory version (if encap ptr is NULL) or filling in the
  in-memory version.  RTA_ENCAP work for IPv4. This operations allows
  the interface to reject invalid encap specified by user-space and the
  sizing allows the kernel to have a different in memory implementation
  to the netlink API (which might be optimised for extensibility rather
  than speed of packet forwarding).
- fill_encap for taking the in-memory version of the encap and filling
  in an RTA_ENCAP attribute in a netlink message.
- match_encap for comparing an in-memory version of encap with an
  RTA_ENCAP version, returning 0 if matching or 1 if different.

A new dst operation is also defined to allow encap-type interfaces to
retrieve the encap data from their xmit functions and use it for
encapsulating the packet and for further forwarding.

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
---
 include/linux/rtnetlink.h      |  7 +++++++
 include/net/dst.h              | 11 +++++++++++
 include/net/dst_ops.h          |  2 ++
 include/net/rtnetlink.h        | 11 +++++++++++
 include/uapi/linux/rtnetlink.h |  1 +
 net/core/rtnetlink.c           | 36 ++++++++++++++++++++++++++++++++++++
 6 files changed, 68 insertions(+)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index a2324fb45cf4..470d822ddd61 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -22,6 +22,13 @@ struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct net_device *dev,
 void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev,
 		       gfp_t flags);
 
+int rtnl_parse_encap(const struct net_device *dev, const struct nlattr *nla,
+		     void *encap);
+int rtnl_fill_encap(const struct net_device *dev, struct sk_buff *skb,
+		    int encap_len, const void *encap);
+int rtnl_match_encap(const struct net_device *dev, const struct nlattr *nla,
+		     int encap_len, const void *encap);
+
 
 /* RTNL is used as a global lock for all changes to network configuration  */
 extern void rtnl_lock(void);
diff --git a/include/net/dst.h b/include/net/dst.h
index 2bc73f8a00a9..df0e6ec18eca 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -506,4 +506,15 @@ static inline struct xfrm_state *dst_xfrm(const struct dst_entry *dst)
 }
 #endif
 
+/* Get encap data for destination */
+static inline int dst_get_encap(struct sk_buff *skb, const void **encap)
+{
+	const struct dst_entry *dst = skb_dst(skb);
+
+	if (!dst || !dst->ops->get_encap)
+		return 0;
+
+	return dst->ops->get_encap(dst, encap);
+}
+
 #endif /* _NET_DST_H */
diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
index d64253914a6a..97f48cf8ef7d 100644
--- a/include/net/dst_ops.h
+++ b/include/net/dst_ops.h
@@ -32,6 +32,8 @@ struct dst_ops {
 	struct neighbour *	(*neigh_lookup)(const struct dst_entry *dst,
 						struct sk_buff *skb,
 						const void *daddr);
+	int			(*get_encap)(const struct dst_entry *dst,
+					     const void **encap);
 
 	struct kmem_cache	*kmem_cachep;
 
diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 343d922d15c2..3121ade24957 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -95,6 +95,17 @@ struct rtnl_link_ops {
 						   const struct net_device *dev,
 						   const struct net_device *slave_dev);
 	struct net		*(*get_link_net)(const struct net_device *dev);
+	int			(*parse_encap)(const struct net_device *dev,
+					       const struct nlattr *nla,
+					       void *encap);
+	int			(*fill_encap)(const struct net_device *dev,
+					      struct sk_buff *skb,
+					      int encap_len,
+					      const void *encap);
+	int			(*match_encap)(const struct net_device *dev,
+					       const struct nlattr *nla,
+					       int encap_len,
+					       const void *encap);
 };
 
 int __rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 17fb02f488da..ed4c797503f2 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -308,6 +308,7 @@ enum rtattr_type_t {
 	RTA_VIA,
 	RTA_NEWDST,
 	RTA_PREF,
+	RTA_ENCAP,
 	__RTA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 077b6d280371..3b4e40a82799 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1441,6 +1441,42 @@ static int validate_linkmsg(struct net_device *dev, struct nlattr *tb[])
 	return 0;
 }
 
+int rtnl_parse_encap(const struct net_device *dev, const struct nlattr *nla,
+		     void *encap)
+{
+	const struct rtnl_link_ops *ops = dev->rtnl_link_ops;
+
+	if (!ops->parse_encap)
+		return -EINVAL;
+
+	return ops->parse_encap(dev, nla, encap);
+}
+EXPORT_SYMBOL(rtnl_parse_encap);
+
+int rtnl_fill_encap(const struct net_device *dev, struct sk_buff *skb,
+		    int encap_len, const void *encap)
+{
+	const struct rtnl_link_ops *ops = dev->rtnl_link_ops;
+
+	if (!ops->fill_encap)
+		return -EINVAL;
+
+	return ops->fill_encap(dev, skb, encap_len, encap);
+}
+EXPORT_SYMBOL(rtnl_fill_encap);
+
+int rtnl_match_encap(const struct net_device *dev, const struct nlattr *nla,
+		     int encap_len, const void *encap)
+{
+	const struct rtnl_link_ops *ops = dev->rtnl_link_ops;
+
+	if (!ops->match_encap)
+		return -EINVAL;
+
+	return ops->match_encap(dev, nla, encap_len, encap);
+}
+EXPORT_SYMBOL(rtnl_match_encap);
+
 static int do_setvfinfo(struct net_device *dev, struct nlattr *attr)
 {
 	int rem, err = -EINVAL;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC net-next 2/3] ipv4: storing and retrieval of per-nexthop encap
  2015-06-01 16:46 [RFC net-next 0/3] IP imposition of per-nh MPLS encap Robert Shearman
  2015-06-01 16:46 ` [RFC net-next 1/3] net: infra for per-nexthop encap data Robert Shearman
@ 2015-06-01 16:46 ` Robert Shearman
  2015-06-02 16:01   ` roopa
  2015-06-01 16:46 ` [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls Robert Shearman
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Robert Shearman @ 2015-06-01 16:46 UTC (permalink / raw)
  To: netdev; +Cc: Eric W. Biederman, roopa, Thomas Graf, Robert Shearman

Parse RTA_ENCAP attribute for one path and multipath routes. The encap
length is stored in a newly added field to fib_nh, nh_encap_len,
although this is added to a padding hole in the structure so that it
doesn't increase the size at all. The encap data itself is stored at
the end of the array of nexthops. Whilst this means that retrieval
isn't optimal, especially if there are multiple nexthops, this avoids
the memory cost of an extra pointer, as well as any potential change
to the cache or instruction layout that could cause a performance
impact.

Currently, the dst structure allocated to represent the destination of
the packet and used for retrieving the encap by the encap-type
interface has been grown through the addition of the rt_encap_len and
rt_encap fields. This isn't desirable and could be fixed by defining a
new destination type with operations copied from the normal case,
other than the addition of the get_encap operation.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
---
 include/net/ip_fib.h     |   2 +
 include/net/route.h      |   3 +
 net/ipv4/fib_frontend.c  |   3 +
 net/ipv4/fib_lookup.h    |   2 +
 net/ipv4/fib_semantics.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv4/route.c         |  24 +++++++
 6 files changed, 211 insertions(+), 2 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 54271ed0ed45..a06cec5eb3aa 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -44,6 +44,7 @@ struct fib_config {
 	u32			fc_flow;
 	u32			fc_nlflags;
 	struct nl_info		fc_nlinfo;
+	struct nlattr *fc_encap;
  };
 
 struct fib_info;
@@ -75,6 +76,7 @@ struct fib_nh {
 	struct fib_info		*nh_parent;
 	unsigned int		nh_flags;
 	unsigned char		nh_scope;
+	unsigned char		nh_encap_len;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	int			nh_weight;
 	int			nh_power;
diff --git a/include/net/route.h b/include/net/route.h
index fe22d03afb6a..e8b58914c4c1 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -64,6 +64,9 @@ struct rtable {
 	/* Miscellaneous cached information */
 	u32			rt_pmtu;
 
+	unsigned int		rt_encap_len;
+	void			*rt_encap;
+
 	struct list_head	rt_uncached;
 	struct uncached_list	*rt_uncached_list;
 };
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 872494e6e6eb..aa538ab7e3b9 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -656,6 +656,9 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
 		case RTA_TABLE:
 			cfg->fc_table = nla_get_u32(attr);
 			break;
+		case RTA_ENCAP:
+			cfg->fc_encap = attr;
+			break;
 		}
 	}
 
diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
index c6211ed60b03..003318c51ae8 100644
--- a/net/ipv4/fib_lookup.h
+++ b/net/ipv4/fib_lookup.h
@@ -34,6 +34,8 @@ int fib_dump_info(struct sk_buff *skb, u32 pid, u32 seq, int event, u32 tb_id,
 		  unsigned int);
 void rtmsg_fib(int event, __be32 key, struct fib_alias *fa, int dst_len,
 	       u32 tb_id, const struct nl_info *info, unsigned int nlm_flags);
+const void *fib_get_nh_encap(const struct fib_info *fi,
+			     const struct fib_nh *nh);
 
 static inline void fib_result_assign(struct fib_result *res,
 				     struct fib_info *fi)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 28ec3c1823bf..db466b636241 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -257,6 +257,9 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
 	const struct fib_nh *onh = ofi->fib_nh;
 
 	for_nexthops(fi) {
+		const void *onh_encap = fib_get_nh_encap(fi, nh);
+		const void *nh_encap = fib_get_nh_encap(fi, nh);
+
 		if (nh->nh_oif != onh->nh_oif ||
 		    nh->nh_gw  != onh->nh_gw ||
 		    nh->nh_scope != onh->nh_scope ||
@@ -266,7 +269,10 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		    nh->nh_tclassid != onh->nh_tclassid ||
 #endif
-		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD))
+		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD) ||
+		    nh->nh_encap_len != onh->nh_encap_len ||
+		    memcmp(nh_encap, onh_encap, nh->nh_encap_len)
+			)
 			return -1;
 		onh++;
 	} endfor_nexthops(fi);
@@ -374,6 +380,11 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi)
 		/* may contain flow and gateway attribute */
 		nhsize += 2 * nla_total_size(4);
 
+		for_nexthops(fi) {
+			if (nh->nh_encap_len)
+				nhsize += nla_total_size(nh->nh_encap_len);
+		} endfor_nexthops(fi);
+
 		/* all nexthops are packed in a nested attribute */
 		payload += nla_total_size(fi->fib_nhs * nhsize);
 	}
@@ -434,6 +445,83 @@ static int fib_detect_death(struct fib_info *fi, int order,
 	return 1;
 }
 
+static int fib_total_encap(struct fib_config *cfg)
+{
+	struct net *net = cfg->fc_nlinfo.nl_net;
+	int total_encap_len = 0;
+
+	if (cfg->fc_mp) {
+		int remaining = cfg->fc_mp_len;
+		struct rtnexthop *rtnh = cfg->fc_mp;
+
+		while (rtnh_ok(rtnh, remaining)) {
+			struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
+			int attrlen;
+
+			attrlen = rtnh_attrlen(rtnh);
+			nla = nla_find(attrs, attrlen, RTA_ENCAP);
+			if (nla) {
+				struct net_device *dev;
+				int len;
+
+				dev = __dev_get_by_index(net,
+							 rtnh->rtnh_ifindex);
+				if (!dev)
+					return -EINVAL;
+
+				/* Determine space required */
+				len = rtnl_parse_encap(dev, nla, NULL);
+				if (len < 0)
+					return len;
+
+				total_encap_len += len;
+			}
+
+			rtnh = rtnh_next(rtnh, &remaining);
+		}
+	} else {
+		if (cfg->fc_encap) {
+			struct net_device *dev;
+			int len;
+
+			dev = __dev_get_by_index(net, cfg->fc_oif);
+			if (!dev)
+				return -EINVAL;
+
+			/* Determine space required */
+			len = rtnl_parse_encap(dev, cfg->fc_encap, NULL);
+			if (len < 0)
+				return len;
+
+			total_encap_len += len;
+		}
+	}
+
+	return total_encap_len;
+}
+
+static void *__fib_get_nh_encap(const struct fib_info *fi,
+				const struct fib_nh *the_nh)
+{
+	char *cur_encap_ptr = (char *)(fi->fib_nh + fi->fib_nhs);
+
+	for_nexthops(fi) {
+		if (nh == the_nh)
+			return cur_encap_ptr;
+		cur_encap_ptr += nh->nh_encap_len;
+	} endfor_nexthops(fi);
+
+	return NULL;
+}
+
+const void *fib_get_nh_encap(const struct fib_info *fi, const struct fib_nh *nh)
+{
+	if (!nh->nh_encap_len)
+		return NULL;
+
+	return __fib_get_nh_encap(fi, nh);
+}
+
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 
 static int fib_count_nexthops(struct rtnexthop *rtnh, int remaining)
@@ -475,6 +563,26 @@ static int fib_get_nhs(struct fib_info *fi, struct rtnexthop *rtnh,
 			if (nexthop_nh->nh_tclassid)
 				fi->fib_net->ipv4.fib_num_tclassid_users++;
 #endif
+			nla = nla_find(attrs, attrlen, RTA_ENCAP);
+			if (nla) {
+				struct net *net = cfg->fc_nlinfo.nl_net;
+				struct net_device *dev;
+				void *nh_encap;
+				int len;
+
+				dev = __dev_get_by_index(net,
+							 nexthop_nh->nh_oif);
+				if (!dev)
+					return -EINVAL;
+
+				nh_encap = __fib_get_nh_encap(fi, nexthop_nh);
+
+				/* Fill in nh encap */
+				len = rtnl_parse_encap(dev, nla, nh_encap);
+				if (len < 0)
+					return len;
+				nexthop_nh->nh_encap_len = len;
+			}
 		}
 
 		rtnh = rtnh_next(rtnh, &remaining);
@@ -495,6 +603,17 @@ int fib_nh_match(struct fib_config *cfg, struct fib_info *fi)
 	if (cfg->fc_priority && cfg->fc_priority != fi->fib_priority)
 		return 1;
 
+	if (cfg->fc_encap) {
+		const void *nh_encap = fib_get_nh_encap(fi, fi->fib_nh);
+
+		if (!fi->fib_nh->nh_oif ||
+		    rtnl_match_encap(fi->fib_nh->nh_dev,
+				     cfg->fc_encap,
+				     fi->fib_nh->nh_encap_len,
+				     nh_encap))
+			return 1;
+	}
+
 	if (cfg->fc_oif || cfg->fc_gw) {
 		if ((!cfg->fc_oif || cfg->fc_oif == fi->fib_nh->nh_oif) &&
 		    (!cfg->fc_gw  || cfg->fc_gw == fi->fib_nh->nh_gw))
@@ -530,6 +649,17 @@ int fib_nh_match(struct fib_config *cfg, struct fib_info *fi)
 			if (nla && nla_get_u32(nla) != nh->nh_tclassid)
 				return 1;
 #endif
+			nla = nla_find(attrs, attrlen, RTA_ENCAP);
+			if (nla) {
+				const void *nh_encap = fib_get_nh_encap(fi, nh);
+
+				if (!nh->nh_oif ||
+				    rtnl_match_encap(nh->nh_dev,
+						     cfg->fc_encap,
+						     nh->nh_encap_len,
+						     nh_encap))
+					return 1;
+			}
 		}
 
 		rtnh = rtnh_next(rtnh, &remaining);
@@ -760,6 +890,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 	struct fib_info *ofi;
 	int nhs = 1;
 	struct net *net = cfg->fc_nlinfo.nl_net;
+	int encap_len;
 
 	if (cfg->fc_type > RTN_MAX)
 		goto err_inval;
@@ -798,7 +929,14 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 			goto failure;
 	}
 
-	fi = kzalloc(sizeof(*fi)+nhs*sizeof(struct fib_nh), GFP_KERNEL);
+	encap_len = fib_total_encap(cfg);
+	if (encap_len < 0) {
+		err = encap_len;
+		goto failure;
+	}
+
+	fi = kzalloc(sizeof(*fi) + nhs * sizeof(struct fib_nh) + encap_len,
+		     GFP_KERNEL);
 	if (!fi)
 		goto failure;
 	fib_info_cnt++;
@@ -886,6 +1024,26 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 		nh->nh_weight = 1;
 #endif
+		if (cfg->fc_encap) {
+			struct net_device *dev;
+			void *nh_encap;
+			int len;
+
+			err = -EINVAL;
+			dev = __dev_get_by_index(net, nh->nh_oif);
+			if (!dev)
+				goto failure;
+
+			nh_encap = __fib_get_nh_encap(fi, nh);
+
+			/* Fill in nh encap */
+			len = rtnl_parse_encap(dev, cfg->fc_encap, nh_encap);
+			if (len < 0 || len > sizeof(nh->nh_encap_len) * 8) {
+				err = len;
+				goto failure;
+			}
+			nh->nh_encap_len = len;
+		}
 	}
 
 	if (fib_props[cfg->fc_type].error) {
@@ -1023,6 +1181,8 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 	    nla_put_in_addr(skb, RTA_PREFSRC, fi->fib_prefsrc))
 		goto nla_put_failure;
 	if (fi->fib_nhs == 1) {
+		const void *nh_encap;
+
 		if (fi->fib_nh->nh_gw &&
 		    nla_put_in_addr(skb, RTA_GATEWAY, fi->fib_nh->nh_gw))
 			goto nla_put_failure;
@@ -1034,6 +1194,12 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 		    nla_put_u32(skb, RTA_FLOW, fi->fib_nh[0].nh_tclassid))
 			goto nla_put_failure;
 #endif
+
+		nh_encap = fib_get_nh_encap(fi, &fi->fib_nh[0]);
+		if (nh_encap && rtnl_fill_encap(fi->fib_nh[0].nh_dev, skb,
+						fi->fib_nh[0].nh_encap_len,
+						nh_encap))
+			goto nla_put_failure;
 	}
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (fi->fib_nhs > 1) {
@@ -1045,6 +1211,8 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 			goto nla_put_failure;
 
 		for_nexthops(fi) {
+			const void *nh_encap;
+
 			rtnh = nla_reserve_nohdr(skb, sizeof(*rtnh));
 			if (!rtnh)
 				goto nla_put_failure;
@@ -1061,6 +1229,13 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 			    nla_put_u32(skb, RTA_FLOW, nh->nh_tclassid))
 				goto nla_put_failure;
 #endif
+
+			nh_encap = fib_get_nh_encap(fi, nh);
+			if (nh_encap && rtnl_fill_encap(nh->nh_dev, skb,
+							nh->nh_encap_len,
+							nh_encap))
+				goto nla_put_failure;
+
 			/* length of rtnetlink header + attributes */
 			rtnh->rtnh_len = nlmsg_get_pos(skb) - (void *) rtnh;
 		} endfor_nexthops(fi);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f6055984c307..d52fa3d168a5 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -110,6 +110,8 @@
 #endif
 #include <net/secure_seq.h>
 
+#include "fib_lookup.h"
+
 #define RT_FL_TOS(oldflp4) \
 	((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
 
@@ -138,6 +140,8 @@ static void		 ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 					   struct sk_buff *skb, u32 mtu);
 static void		 ip_do_redirect(struct dst_entry *dst, struct sock *sk,
 					struct sk_buff *skb);
+static int		ipv4_dst_get_encap(const struct dst_entry *dst,
+					   const void **encap);
 static void		ipv4_dst_destroy(struct dst_entry *dst);
 
 static u32 *ipv4_cow_metrics(struct dst_entry *dst, unsigned long old)
@@ -163,6 +167,7 @@ static struct dst_ops ipv4_dst_ops = {
 	.redirect =		ip_do_redirect,
 	.local_out =		__ip_local_out,
 	.neigh_lookup =		ipv4_neigh_lookup,
+	.get_encap =		ipv4_dst_get_encap,
 };
 
 #define ECN_OR_COST(class)	TC_PRIO_##class
@@ -1145,6 +1150,15 @@ static void ipv4_link_failure(struct sk_buff *skb)
 		dst_set_expires(&rt->dst, 0);
 }
 
+static int ipv4_dst_get_encap(const struct dst_entry *dst,
+			      const void **encap)
+{
+	const struct rtable *rt = (const struct rtable *)dst;
+
+	*encap = rt->rt_encap;
+	return rt->rt_encap_len;
+}
+
 static int ip_rt_bug(struct sock *sk, struct sk_buff *skb)
 {
 	pr_debug("%s: %pI4 -> %pI4, %s\n",
@@ -1394,6 +1408,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 
 	if (fi) {
 		struct fib_nh *nh = &FIB_RES_NH(*res);
+		const void *nh_encap;
 
 		if (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) {
 			rt->rt_gateway = nh->nh_gw;
@@ -1403,6 +1418,15 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif
+
+		nh_encap = fib_get_nh_encap(fi, nh);
+		if (unlikely(nh_encap)) {
+			rt->rt_encap = kmemdup(nh_encap, nh->nh_encap_len,
+					       GFP_KERNEL);
+			if (rt->rt_encap)
+				rt->rt_encap_len = nh->nh_encap_len;
+		}
+
 		if (unlikely(fnhe))
 			cached = rt_bind_exception(rt, fnhe, daddr);
 		else if (!(rt->dst.flags & DST_NOCACHE))
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-01 16:46 [RFC net-next 0/3] IP imposition of per-nh MPLS encap Robert Shearman
  2015-06-01 16:46 ` [RFC net-next 1/3] net: infra for per-nexthop encap data Robert Shearman
  2015-06-01 16:46 ` [RFC net-next 2/3] ipv4: storing and retrieval of per-nexthop encap Robert Shearman
@ 2015-06-01 16:46 ` Robert Shearman
  2015-06-02 16:15   ` roopa
  2015-06-02 18:26   ` Eric W. Biederman
  2015-06-02  0:06 ` [RFC net-next 0/3] IP imposition of per-nh MPLS encap Thomas Graf
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 32+ messages in thread
From: Robert Shearman @ 2015-06-01 16:46 UTC (permalink / raw)
  To: netdev; +Cc: Eric W. Biederman, roopa, Thomas Graf, Robert Shearman

Allow creating an mpls device for the purposes of encapsulating IP
packets with:

  ip link add type ipmpls

This device defines its per-nexthop encapsulation data as a stack of
labels, in the same format as for RTA_NEWST. It uses the encap data
which will have been stored in the IP route to encapsulate the packet
with that stack of labels, with the last label corresponding to a
local label that defines how the packet will be sent out. The device
sends packets over loopback to the local MPLS forwarding logic which
performs all of the work.

Stats are implemented, although any error in the sending via the real
interface will be handled by the main mpls forwarding code and so not
accounted by the interface.

This implementation is based on an alternative earlier implementation
by Eric W. Biederman.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
---
 include/uapi/linux/if_arp.h |   1 +
 net/mpls/Kconfig            |   5 +
 net/mpls/Makefile           |   1 +
 net/mpls/af_mpls.c          |   2 +
 net/mpls/ipmpls.c           | 284 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 293 insertions(+)
 create mode 100644 net/mpls/ipmpls.c

diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h
index 4d024d75d64b..17d669fd1781 100644
--- a/include/uapi/linux/if_arp.h
+++ b/include/uapi/linux/if_arp.h
@@ -88,6 +88,7 @@
 #define ARPHRD_IEEE80211_RADIOTAP 803	/* IEEE 802.11 + radiotap header */
 #define ARPHRD_IEEE802154	  804
 #define ARPHRD_IEEE802154_MONITOR 805	/* IEEE 802.15.4 network monitor */
+#define ARPHRD_MPLS	806		/* IP and IPv6 over MPLS tunnels */
 
 #define ARPHRD_PHONET	820		/* PhoNet media type		*/
 #define ARPHRD_PHONET_PIPE 821		/* PhoNet pipe header		*/
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
index 17bde799c854..5264da94733a 100644
--- a/net/mpls/Kconfig
+++ b/net/mpls/Kconfig
@@ -27,4 +27,9 @@ config MPLS_ROUTING
 	help
 	 Add support for forwarding of mpls packets.
 
+config MPLS_IPTUNNEL
+	tristate "MPLS: IP over MPLS tunnel support"
+	help
+	 A network device that encapsulates ip packets as mpls
+
 endif # MPLS
diff --git a/net/mpls/Makefile b/net/mpls/Makefile
index 65bbe68c72e6..3a93c14b23c5 100644
--- a/net/mpls/Makefile
+++ b/net/mpls/Makefile
@@ -3,5 +3,6 @@
 #
 obj-$(CONFIG_NET_MPLS_GSO) += mpls_gso.o
 obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o
+obj-$(CONFIG_MPLS_IPTUNNEL) += ipmpls.o
 
 mpls_router-y := af_mpls.o
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 7b3f732269e4..68bdfbdddfaf 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -615,6 +615,7 @@ int nla_put_labels(struct sk_buff *skb, int attrtype,
 
 	return 0;
 }
+EXPORT_SYMBOL(nla_put_labels);
 
 int nla_get_labels(const struct nlattr *nla,
 		   u32 max_labels, u32 *labels, u32 label[])
@@ -660,6 +661,7 @@ int nla_get_labels(const struct nlattr *nla,
 	*labels = nla_labels;
 	return 0;
 }
+EXPORT_SYMBOL(nla_get_labels);
 
 static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
 			       struct mpls_route_config *cfg)
diff --git a/net/mpls/ipmpls.c b/net/mpls/ipmpls.c
new file mode 100644
index 000000000000..cf6894ae0c61
--- /dev/null
+++ b/net/mpls/ipmpls.c
@@ -0,0 +1,284 @@
+#include <linux/types.h>
+#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <linux/if_arp.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/module.h>
+#include <linux/mpls.h>
+#include "internal.h"
+
+static LIST_HEAD(ipmpls_dev_list);
+
+#define MAX_NEW_LABELS 2
+
+struct ipmpls_dev_priv {
+	struct net_device *out_dev;
+	struct list_head list;
+	struct net_device *dev;
+};
+
+static netdev_tx_t ipmpls_dev_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ipmpls_dev_priv *priv = netdev_priv(dev);
+	struct net_device *out_dev = priv->out_dev;
+	struct mpls_shim_hdr *hdr;
+	bool bottom_of_stack = true;
+	int len = skb->len;
+	const void *encap;
+	int num_labels;
+	unsigned ttl;
+	const u32 *labels;
+	int ret;
+	int i;
+
+	num_labels = dst_get_encap(skb, &encap) / 4;
+	if (!num_labels)
+		goto drop;
+
+	labels = encap;
+
+	/* Obtain the ttl */
+	if (skb->protocol == htons(ETH_P_IP)) {
+		ttl = ip_hdr(skb)->ttl;
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		ttl = ipv6_hdr(skb)->hop_limit;
+	} else if (skb->protocol == htons(ETH_P_MPLS_UC)) {
+		ttl = mpls_entry_decode(mpls_hdr(skb)).ttl;
+		bottom_of_stack = false;
+	} else {
+		goto drop;
+	}
+
+	/* Now that the encap has been retrieved, there's no longer
+	 * any need to keep the dst around so clear it out.
+	 */
+	skb_dst_drop(skb);
+	skb_orphan(skb);
+
+	skb->inner_protocol = skb->protocol;
+	skb->inner_network_header = skb->network_header;
+
+	skb_push(skb, num_labels * sizeof(*hdr));
+	skb_reset_network_header(skb);
+	hdr = mpls_hdr(skb);
+
+	for (i = num_labels - 1; i >= 0; i--) {
+		hdr[i] = mpls_entry_encode(labels[i], ttl, 0, bottom_of_stack);
+		bottom_of_stack = false;
+	}
+
+	skb->dev = out_dev;
+	skb->protocol = htons(ETH_P_MPLS_UC);
+
+	ret = dev_hard_header(skb, out_dev, ETH_P_MPLS_UC,
+			      out_dev->dev_addr, NULL, len);
+	if (ret >= 0)
+		ret = dev_queue_xmit(skb);
+	if (ret)
+		goto drop;
+
+	dev->stats.tx_packets++;
+	dev->stats.tx_bytes += len;
+
+	return 0;
+
+drop:
+	dev->stats.tx_dropped++;
+	kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int ipmpls_dev_init(struct net_device *dev)
+{
+	struct ipmpls_dev_priv *priv = netdev_priv(dev);
+
+	list_add_tail(&priv->list, &ipmpls_dev_list);
+
+	return 0;
+}
+
+static void ipmpls_dev_uninit(struct net_device *dev)
+{
+	struct ipmpls_dev_priv *priv = netdev_priv(dev);
+
+	list_del_init(&priv->list);
+}
+
+static void ipmpls_dev_free(struct net_device *dev)
+{
+	free_netdev(dev);
+}
+
+static const struct net_device_ops ipmpls_netdev_ops = {
+	.ndo_init		= ipmpls_dev_init,
+	.ndo_start_xmit		= ipmpls_dev_xmit,
+	.ndo_uninit		= ipmpls_dev_uninit,
+};
+
+#define IPMPLS_FEATURES (NETIF_F_SG |			\
+			 NETIF_F_FRAGLIST |		\
+			 NETIF_F_HIGHDMA |		\
+			 NETIF_F_VLAN_CHALLENGED)
+
+static void ipmpls_dev_setup(struct net_device *dev)
+{
+	dev->netdev_ops		= &ipmpls_netdev_ops;
+
+	dev->type		= ARPHRD_MPLS;
+	dev->flags		= IFF_NOARP;
+	netif_keep_dst(dev);
+	dev->addr_len		= 0;
+	dev->features		|= NETIF_F_LLTX;
+	dev->features		|= IPMPLS_FEATURES;
+	dev->hw_features	|= IPMPLS_FEATURES;
+	dev->vlan_features	= 0;
+
+	dev->destructor = ipmpls_dev_free;
+}
+
+static int ipmpls_dev_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	return 0;
+}
+
+static int ipmpls_dev_newlink(struct net *src_net, struct net_device *dev,
+			      struct nlattr *tb[], struct nlattr *data[])
+{
+	struct ipmpls_dev_priv *priv = netdev_priv(dev);
+
+	priv->out_dev = src_net->loopback_dev;
+	priv->dev = dev;
+
+	dev->hard_header_len =
+		priv->out_dev->hard_header_len +
+		sizeof(struct mpls_shim_hdr) * MAX_NEW_LABELS;
+
+	return register_netdevice(dev);
+}
+
+static void ipmpls_dev_dellink(struct net_device *dev, struct list_head *head)
+{
+	unregister_netdevice_queue(dev, head);
+}
+
+static int ipmpls_dev_parse_encap(const struct net_device *dev,
+				  const struct nlattr *nla,
+				  void *encap)
+{
+	u32 labels;
+
+	if (nla_len(nla) / 4 > MAX_NEW_LABELS)
+		return -EINVAL;
+
+	if (encap && nla_get_labels(nla, MAX_NEW_LABELS, &labels, encap))
+		return -EINVAL;
+
+	/* Stored encap size is the same as the rtnl encap len */
+	return nla_len(nla);
+}
+
+static int ipmpls_dev_fill_encap(const struct net_device *dev,
+				 struct sk_buff *skb, int encap_len,
+				 const void *encap)
+{
+	return nla_put_labels(skb, RTA_ENCAP, encap_len / 4, encap);
+}
+
+static int ipmpls_dev_match_encap(const struct net_device *dev,
+				  const struct nlattr *nla, int encap_len,
+				  const void *encap)
+{
+	unsigned nla_labels;
+	struct mpls_shim_hdr *nla_label;
+	const u32 *stored_labels = encap;
+	int i;
+
+	/* Stored encap size is the same as the rtnl encap len */
+	if (nla_len(nla) != encap_len)
+		return 1;
+
+	nla_labels = nla_len(nla) / 4;
+	nla_label = nla_data(nla);
+
+	for (i = 0; i < nla_labels; i++) {
+		struct mpls_entry_decoded dec;
+
+		dec = mpls_entry_decode(nla_label + i);
+
+		if (stored_labels[i] != dec.label)
+			return 1;
+	}
+
+	return 0;
+}
+
+static struct rtnl_link_ops ipmpls_ops = {
+	.kind		= "ipmpls",
+	.priv_size	= sizeof(struct ipmpls_dev_priv),
+	.setup		= ipmpls_dev_setup,
+	.validate	= ipmpls_dev_validate,
+	.newlink	= ipmpls_dev_newlink,
+	.dellink	= ipmpls_dev_dellink,
+	.parse_encap	= ipmpls_dev_parse_encap,
+	.fill_encap	= ipmpls_dev_fill_encap,
+	.match_encap	= ipmpls_dev_match_encap,
+};
+
+static int ipmpls_dev_notify(struct notifier_block *this, unsigned long event,
+			     void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+	if (event == NETDEV_UNREGISTER) {
+		struct ipmpls_dev_priv *priv, *priv2;
+		LIST_HEAD(list_kill);
+
+		/* Ignore netns device moves */
+		if (dev->reg_state != NETREG_UNREGISTERING)
+			goto done;
+
+		list_for_each_entry_safe(priv, priv2, &ipmpls_dev_list, list) {
+			if (priv->out_dev != dev)
+				continue;
+
+			ipmpls_dev_dellink(priv->dev, &list_kill);
+		}
+		unregister_netdevice_many(&list_kill);
+	}
+done:
+	return NOTIFY_OK;
+}
+
+static struct notifier_block ipmpls_dev_notifier = {
+	.notifier_call = ipmpls_dev_notify,
+};
+
+static int __init ipmpls_init(void)
+{
+	int err;
+
+	err = register_netdevice_notifier(&ipmpls_dev_notifier);
+	if (err)
+		goto out;
+
+	err = rtnl_link_register(&ipmpls_ops);
+	if (err)
+		goto out_unregister_notifier;
+out:
+	return err;
+out_unregister_notifier:
+	unregister_netdevice_notifier(&ipmpls_dev_notifier);
+	goto out;
+}
+module_init(ipmpls_init);
+
+static void __exit ipmpls_exit(void)
+{
+	rtnl_link_unregister(&ipmpls_ops);
+	unregister_netdevice_notifier(&ipmpls_dev_notifier);
+}
+module_exit(ipmpls_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_RTNL_LINK("ipmpls");
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-01 16:46 [RFC net-next 0/3] IP imposition of per-nh MPLS encap Robert Shearman
                   ` (2 preceding siblings ...)
  2015-06-01 16:46 ` [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls Robert Shearman
@ 2015-06-02  0:06 ` Thomas Graf
  2015-06-02 13:28   ` Robert Shearman
  2015-06-02 15:31 ` roopa
  2015-06-02 18:11 ` Eric W. Biederman
  5 siblings, 1 reply; 32+ messages in thread
From: Thomas Graf @ 2015-06-02  0:06 UTC (permalink / raw)
  To: Robert Shearman; +Cc: netdev, Eric W. Biederman, roopa

On 06/01/15 at 05:46pm, Robert Shearman wrote:
> In order to be able to function as a Label Edge Router in an MPLS
> network, it is necessary to be able to take IP packets and impose an
> MPLS encap and forward them out. The traditional approach of setting
> up an interface for each "tunnel" endpoint doesn't scale for the
> common MPLS use-cases where each IP route tends to be assigned a
> different label as encap.
> 
> The solution suggested here for further discussion is to provide the
> facility to define encap data on a per-nexthop basis using a new
> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
> forwarding code, but interpreted by the virtual interface assigned to
> the nexthop.

RTA_ENCAP is currently a binary blob specific to each encapsulation
type interface. I guess this should be converted to a set of nested
Netlink attributes for each type of encap to make it extendible in
the future.

What is your plan regarding the receive side and on the matching of
encap fields? Storing the receive parameters is what lead me to
storing it in skb_shared_info.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02  0:06 ` [RFC net-next 0/3] IP imposition of per-nh MPLS encap Thomas Graf
@ 2015-06-02 13:28   ` Robert Shearman
  2015-06-02 21:43     ` Thomas Graf
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Shearman @ 2015-06-02 13:28 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, Eric W. Biederman, roopa

On 02/06/15 01:06, Thomas Graf wrote:
> On 06/01/15 at 05:46pm, Robert Shearman wrote:
>> In order to be able to function as a Label Edge Router in an MPLS
>> network, it is necessary to be able to take IP packets and impose an
>> MPLS encap and forward them out. The traditional approach of setting
>> up an interface for each "tunnel" endpoint doesn't scale for the
>> common MPLS use-cases where each IP route tends to be assigned a
>> different label as encap.
>>
>> The solution suggested here for further discussion is to provide the
>> facility to define encap data on a per-nexthop basis using a new
>> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
>> forwarding code, but interpreted by the virtual interface assigned to
>> the nexthop.
>
> RTA_ENCAP is currently a binary blob specific to each encapsulation
> type interface. I guess this should be converted to a set of nested
> Netlink attributes for each type of encap to make it extendible in
> the future.

Nesting attributes inside the RTA_ENCAP blob should be supported by the 
patch series today. Something like this:

+enum rta_tunnel_t {
+	RTA_TUN_UNSPEC,
+	RTA_TUN_ID,
+	RTA_TUN_DST,
+	RTA_TUN_SRC,
+	RTA_TUN_TTL,
+	RTA_TUN_TOS,
+	RTA_TUN_SPORT,
+	RTA_TUN_DPORT,
+	RTA_TUN_FLAGS,
+	RTA_TUN_MAX,
+};
+
+static const struct nla_policy tunnel_policy[RTA_TUN_MAX + 1] = {
+	[RTA_TUN_ID]		= { .type = NLA_U64 },
+	[RTA_TUN_DST]		= { .type = NLA_U32 },
+	[RTA_TUN_SRC]		= { .type = NLA_U32 },
+	[RTA_TUN_TTL]		= { .type = NLA_U8 },
+	[RTA_TUN_TOS]		= { .type = NLA_U8 },
+	[RTA_TUN_SPORT]		= { .type = NLA_U16 },
+	[RTA_TUN_DPORT]		= { .type = NLA_U16 },
+	[RTA_TUN_FLAGS]		= { .type = NLA_U16 },
+};
+
+static int vxlan_parse_encap(const struct net_device *dev,
+			     const struct nlattr *nla,
+			     void *encap)
+{
+	if (encap) {
+		struct ip_tunnel_info *tun_info = encap;
+		struct nlattr *tb[RTA_TUN_MAX+1];
+		int err;
+
+		err = nla_parse_nested(tb, RTA_TUN_MAX, nla, tunnel_policy);
+		if (err < 0)
+			return err;
+
+		if (tb[RTA_TUN_ID])
+			tun_info->key.tun_id = nla_get_u64(tb[RTA_TUN_ID]);
+
+		if (tb[RTA_TUN_DST])
+			tun_info->key.ipv4_dst = nla_get_be32(tb[RTA_TUN_DST]);
+
+		if (tb[RTA_TUN_SRC])
+			tun_info->key.ipv4_src = nla_get_be32(tb[RTA_TUN_SRC]);
+
+		if (tb[RTA_TUN_TTL])
+			tun_info->key.ipv4_ttl = nla_get_u8(tb[RTA_TUN_TTL]);
+
+		if (tb[RTA_TUN_TOS])
+			tun_info->key.ipv4_tos = nla_get_u8(tb[RTA_TUN_TOS]);
+
+		if (tb[RTA_TUN_SPORT])
+			tun_info->key.tp_src = nla_get_be16(tb[RTA_TUN_SPORT]);
+
+		if (tb[RTA_TUN_DPORT])
+			tun_info->key.tp_dst = nla_get_be16(tb[RTA_TUN_DPORT]);
+
+		if (tb[RTA_TUN_FLAGS])
+			tun_info->key.tun_flags = nla_get_u16(tb[RTA_TUN_FLAGS]);
+
+		tun_info->options = NULL;
+		tun_info->options_len = 0;
+	}
+
+	return sizeof(struct ip_tunnel_info);
+}
+
+static int vxlan_fill_encap(const struct net_device *dev,
+			    struct sk_buff *skb, int encap_len,
+			    const void *encap)
+{
+	const struct ip_tunnel_info *tun_info = encap;
+	struct nlattr *encap_attr;
+
+	encap_attr = nla_nest_start(skb, RTA_ENCAP);
+	if (!encap_attr)
+		return -ENOMEM;
+
+	if (nla_put_u64(skb, RTA_TUN_ID, tun_info->key.tun_id) ||
+	    nla_put_be32(skb, RTA_TUN_DST, tun_info->key.ipv4_dst) ||
+	    nla_put_be32(skb, RTA_TUN_SRC, tun_info->key.ipv4_src) ||
+	    nla_put_u8(skb, RTA_TUN_TOS, tun_info->key.ipv4_tos) ||
+	    nla_put_u8(skb, RTA_TUN_TTL, tun_info->key.ipv4_ttl) ||
+	    nla_put_u16(skb, RTA_TUN_SPORT, tun_info->key.tp_src) ||
+	    nla_put_u16(skb, RTA_TUN_DPORT, tun_info->key.tp_dst) ||
+	    nla_put_u16(skb, RTA_TUN_FLAGS, tun_info->key.tun_flags))
+		return -ENOMEM;
+
+	nla_nest_end(skb, encap_attr);
+
+	return 0;
+}
+
+static int vxlan_match_encap(const struct net_device *dev,
+			     const struct nlattr *nla, int encap_len,
+			     const void *encap)
+{
+	const struct ip_tunnel_info *tun_info = encap;
+	struct nlattr *tb[RTA_TUN_MAX+1];
+	int err;
+
+	err = nla_parse_nested(tb, RTA_TUN_MAX, nla, tunnel_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[RTA_TUN_ID] &&
+	    tun_info->key.tun_id != nla_get_u64(tb[RTA_TUN_ID]))
+		return 1;
+
+	if (tb[RTA_TUN_DST] &&
+	    tun_info->key.ipv4_dst != nla_get_be32(tb[RTA_TUN_DST]))
+		return 1;
+
+	if (tb[RTA_TUN_SRC] &&
+	    tun_info->key.ipv4_src != nla_get_be32(tb[RTA_TUN_SRC]))
+		return 1;
+
+	if (tb[RTA_TUN_TTL] &&
+	    tun_info->key.ipv4_ttl != nla_get_u8(tb[RTA_TUN_TTL]))
+		return 1;
+
+	if (tb[RTA_TUN_TOS] &&
+	    tun_info->key.ipv4_tos != nla_get_u8(tb[RTA_TUN_TOS]))
+		return 1;
+
+	if (tb[RTA_TUN_SPORT] &&
+	    tun_info->key.tp_src != nla_get_be16(tb[RTA_TUN_SPORT]))
+		return 1;
+
+	if (tb[RTA_TUN_DPORT] &&
+	    tun_info->key.tp_dst != nla_get_be16(tb[RTA_TUN_DPORT]))
+		return 1;
+
+	if (tb[RTA_TUN_FLAGS] &&
+	    tun_info->key.tun_flags != nla_get_u16(tb[RTA_TUN_FLAGS]))
+		return 1;
+
+	return 0;
+}
+
  static struct rtnl_link_ops vxlan_link_ops __read_mostly = {
  	.kind		= "vxlan",
  	.maxtype	= IFLA_VXLAN_MAX,
@@ -2893,6 +3093,9 @@ static struct rtnl_link_ops vxlan_link_ops 
__read_mostly = {
  	.get_size	= vxlan_get_size,
  	.fill_info	= vxlan_fill_info,
  	.get_link_net	= vxlan_get_link_net,
+	.parse_encap	= vxlan_parse_encap,
+	.fill_encap	= vxlan_fill_encap,
+	.match_encap	= vxlan_match_encap,
  };


> What is your plan regarding the receive side and on the matching of
> encap fields? Storing the receive parameters is what lead me to
> storing it in skb_shared_info.

No plan for the receive side and it wouldn't easily fit in with my 
approach, so you'll need to implement that separately.

Thanks,
Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-01 16:46 [RFC net-next 0/3] IP imposition of per-nh MPLS encap Robert Shearman
                   ` (3 preceding siblings ...)
  2015-06-02  0:06 ` [RFC net-next 0/3] IP imposition of per-nh MPLS encap Thomas Graf
@ 2015-06-02 15:31 ` roopa
  2015-06-02 18:30   ` Eric W. Biederman
  2015-06-02 18:11 ` Eric W. Biederman
  5 siblings, 1 reply; 32+ messages in thread
From: roopa @ 2015-06-02 15:31 UTC (permalink / raw)
  To: Robert Shearman; +Cc: netdev, Eric W. Biederman, Thomas Graf, Vivek Venkatraman

On 6/1/15, 9:46 AM, Robert Shearman wrote:
> In order to be able to function as a Label Edge Router in an MPLS
> network, it is necessary to be able to take IP packets and impose an
> MPLS encap and forward them out. The traditional approach of setting
> up an interface for each "tunnel" endpoint doesn't scale for the
> common MPLS use-cases where each IP route tends to be assigned a
> different label as encap.
>
> The solution suggested here for further discussion is to provide the
> facility to define encap data on a per-nexthop basis using a new
> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
> forwarding code, but interpreted by the virtual interface assigned to
> the nexthop.
>
> A new ipmpls interface type is defined to show the use of this
> facility to allow IP packets to be imposed with an MPLS
> encap. However, the facility is designed to be general enough to be
> used by any encapsulation/tunneling mechanism that has similar
> requirements of high-scale, high-variation-of-encap.
>
> RFC because:
>   - IPv6 side not implemented
>   - struct rtable shouldn't be bloated by pointer+uint
>   - Hasn't been thoroughly tested yet
>
> Robert Shearman (3):
>    net: infra for per-nexthop encap data
>    ipv4: storing and retrieval of per-nexthop encap
>    mpls: new ipmpls device for encapsulating IP packets as mpls
>
>
Glad to see these patches!.
I have a similar series i have been working on...but no netdevice.
A set of ops similar to iptun_encaps and I store encap data in fib_nh
and in ip_route_output_slow i point the dst.output to the output func 
provided by one of the encap ops.

I see the advantages of using a netdevice...and i see this align with 
patches from thomas.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 2/3] ipv4: storing and retrieval of per-nexthop encap
  2015-06-01 16:46 ` [RFC net-next 2/3] ipv4: storing and retrieval of per-nexthop encap Robert Shearman
@ 2015-06-02 16:01   ` roopa
  2015-06-02 16:35     ` Robert Shearman
  0 siblings, 1 reply; 32+ messages in thread
From: roopa @ 2015-06-02 16:01 UTC (permalink / raw)
  To: Robert Shearman
  Cc: netdev, Eric W. Biederman, Thomas Graf, Dinesh Dutt, Vivek Venkatraman

On 6/1/15, 9:46 AM, Robert Shearman wrote:
> Parse RTA_ENCAP attribute for one path and multipath routes. The encap
> length is stored in a newly added field to fib_nh, nh_encap_len,
> although this is added to a padding hole in the structure so that it
> doesn't increase the size at all. The encap data itself is stored at
> the end of the array of nexthops. Whilst this means that retrieval
> isn't optimal, especially if there are multiple nexthops, this avoids
> the memory cost of an extra pointer, as well as any potential change
> to the cache or instruction layout that could cause a performance
> impact.
>
> Currently, the dst structure allocated to represent the destination of
> the packet and used for retrieving the encap by the encap-type
> interface has been grown through the addition of the rt_encap_len and
> rt_encap fields. This isn't desirable and could be fixed by defining a
> new destination type with operations copied from the normal case,
> other than the addition of the get_encap operation.
>
> Signed-off-by: Robert Shearman <rshearma@brocade.com>
> ---
>   include/net/ip_fib.h     |   2 +
>   include/net/route.h      |   3 +
>   net/ipv4/fib_frontend.c  |   3 +
>   net/ipv4/fib_lookup.h    |   2 +
>   net/ipv4/fib_semantics.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++-
>   net/ipv4/route.c         |  24 +++++++
>   6 files changed, 211 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index 54271ed0ed45..a06cec5eb3aa 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -44,6 +44,7 @@ struct fib_config {
>   	u32			fc_flow;
>   	u32			fc_nlflags;
>   	struct nl_info		fc_nlinfo;
> +	struct nlattr *fc_encap;
>    };
>   
>   struct fib_info;
> @@ -75,6 +76,7 @@ struct fib_nh {
>   	struct fib_info		*nh_parent;
>   	unsigned int		nh_flags;
>   	unsigned char		nh_scope;
> +	unsigned char		nh_encap_len;
>   #ifdef CONFIG_IP_ROUTE_MULTIPATH
>   	int			nh_weight;
>   	int			nh_power;
> diff --git a/include/net/route.h b/include/net/route.h
> index fe22d03afb6a..e8b58914c4c1 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -64,6 +64,9 @@ struct rtable {
>   	/* Miscellaneous cached information */
>   	u32			rt_pmtu;
>   
> +	unsigned int		rt_encap_len;
> +	void			*rt_encap;
> +
>   	struct list_head	rt_uncached;
>   	struct uncached_list	*rt_uncached_list;
>   };
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 872494e6e6eb..aa538ab7e3b9 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -656,6 +656,9 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
>   		case RTA_TABLE:
>   			cfg->fc_table = nla_get_u32(attr);
>   			break;
> +		case RTA_ENCAP:
> +			cfg->fc_encap = attr;
> +			break;
>   		}
>   	}
>   
> diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
> index c6211ed60b03..003318c51ae8 100644
> --- a/net/ipv4/fib_lookup.h
> +++ b/net/ipv4/fib_lookup.h
> @@ -34,6 +34,8 @@ int fib_dump_info(struct sk_buff *skb, u32 pid, u32 seq, int event, u32 tb_id,
>   		  unsigned int);
>   void rtmsg_fib(int event, __be32 key, struct fib_alias *fa, int dst_len,
>   	       u32 tb_id, const struct nl_info *info, unsigned int nlm_flags);
> +const void *fib_get_nh_encap(const struct fib_info *fi,
> +			     const struct fib_nh *nh);
>   
>   static inline void fib_result_assign(struct fib_result *res,
>   				     struct fib_info *fi)
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index 28ec3c1823bf..db466b636241 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -257,6 +257,9 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
>   	const struct fib_nh *onh = ofi->fib_nh;
>   
>   	for_nexthops(fi) {
> +		const void *onh_encap = fib_get_nh_encap(fi, nh);
> +		const void *nh_encap = fib_get_nh_encap(fi, nh);
> +
>   		if (nh->nh_oif != onh->nh_oif ||
>   		    nh->nh_gw  != onh->nh_gw ||
>   		    nh->nh_scope != onh->nh_scope ||
> @@ -266,7 +269,10 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
>   #ifdef CONFIG_IP_ROUTE_CLASSID
>   		    nh->nh_tclassid != onh->nh_tclassid ||
>   #endif
> -		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD))
> +		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD) ||
> +		    nh->nh_encap_len != onh->nh_encap_len ||
> +		    memcmp(nh_encap, onh_encap, nh->nh_encap_len)
> +			)
>   			return -1;
>   		onh++;
>   	} endfor_nexthops(fi);
> @@ -374,6 +380,11 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi)
>   		/* may contain flow and gateway attribute */
>   		nhsize += 2 * nla_total_size(4);
>   
> +		for_nexthops(fi) {
> +			if (nh->nh_encap_len)
> +				nhsize += nla_total_size(nh->nh_encap_len);
> +		} endfor_nexthops(fi);
> +
>   		/* all nexthops are packed in a nested attribute */
>   		payload += nla_total_size(fi->fib_nhs * nhsize);
>   	}
> @@ -434,6 +445,83 @@ static int fib_detect_death(struct fib_info *fi, int order,
>   	return 1;
>   }
>   
> +static int fib_total_encap(struct fib_config *cfg)
> +{
> +	struct net *net = cfg->fc_nlinfo.nl_net;
> +	int total_encap_len = 0;
> +
> +	if (cfg->fc_mp) {
> +		int remaining = cfg->fc_mp_len;
> +		struct rtnexthop *rtnh = cfg->fc_mp;
> +
> +		while (rtnh_ok(rtnh, remaining)) {
> +			struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
> +			int attrlen;
> +
> +			attrlen = rtnh_attrlen(rtnh);
> +			nla = nla_find(attrs, attrlen, RTA_ENCAP);
> +			if (nla) {
> +				struct net_device *dev;
> +				int len;
> +
> +				dev = __dev_get_by_index(net,
> +							 rtnh->rtnh_ifindex);
> +				if (!dev)
> +					return -EINVAL;
> +
> +				/* Determine space required */
> +				len = rtnl_parse_encap(dev, nla, NULL);
> +				if (len < 0)
> +					return len;
> +
> +				total_encap_len += len;
> +			}
> +
> +			rtnh = rtnh_next(rtnh, &remaining);
> +		}
> +	} else {
> +		if (cfg->fc_encap) {
> +			struct net_device *dev;
> +			int len;
> +
> +			dev = __dev_get_by_index(net, cfg->fc_oif);
> +			if (!dev)
> +				return -EINVAL;
> +
> +			/* Determine space required */
> +			len = rtnl_parse_encap(dev, cfg->fc_encap, NULL);
> +			if (len < 0)
> +				return len;
> +
> +			total_encap_len += len;
> +		}
> +	}
> +
> +	return total_encap_len;
> +}
we could avoid parsing and finding this device twice, if fib_nh just 
held a pointer to the encap_info
(or tunnel info) ?. And the encap_info/tun_info could be refcounted and 
shared between
nexthops ?. In my implementation i have just a pointer to parsed encap state
in fib_nh

> +
> +static void *__fib_get_nh_encap(const struct fib_info *fi,
> +				const struct fib_nh *the_nh)
> +{
> +	char *cur_encap_ptr = (char *)(fi->fib_nh + fi->fib_nhs);
> +
> +	for_nexthops(fi) {
> +		if (nh == the_nh)
> +			return cur_encap_ptr;
> +		cur_encap_ptr += nh->nh_encap_len;
> +	} endfor_nexthops(fi);
> +
> +	return NULL;
> +}
> +
> +const void *fib_get_nh_encap(const struct fib_info *fi, const struct fib_nh *nh)
> +{
> +	if (!nh->nh_encap_len)
> +		return NULL;
> +
> +	return __fib_get_nh_encap(fi, nh);
> +}
> +
>   #ifdef CONFIG_IP_ROUTE_MULTIPATH
>   
>   static int fib_count_nexthops(struct rtnexthop *rtnh, int remaining)
> @@ -475,6 +563,26 @@ static int fib_get_nhs(struct fib_info *fi, struct rtnexthop *rtnh,
>   			if (nexthop_nh->nh_tclassid)
>   				fi->fib_net->ipv4.fib_num_tclassid_users++;
>   #endif
> +			nla = nla_find(attrs, attrlen, RTA_ENCAP);
> +			if (nla) {
> +				struct net *net = cfg->fc_nlinfo.nl_net;
> +				struct net_device *dev;
> +				void *nh_encap;
> +				int len;
> +
> +				dev = __dev_get_by_index(net,
> +							 nexthop_nh->nh_oif);
> +				if (!dev)
> +					return -EINVAL;
> +
> +				nh_encap = __fib_get_nh_encap(fi, nexthop_nh);
> +
> +				/* Fill in nh encap */
> +				len = rtnl_parse_encap(dev, nla, nh_encap);
> +				if (len < 0)
> +					return len;
> +				nexthop_nh->nh_encap_len = len;
> +			}
>   		}
>   
>   		rtnh = rtnh_next(rtnh, &remaining);
> @@ -495,6 +603,17 @@ int fib_nh_match(struct fib_config *cfg, struct fib_info *fi)
>   	if (cfg->fc_priority && cfg->fc_priority != fi->fib_priority)
>   		return 1;
>   
> +	if (cfg->fc_encap) {
> +		const void *nh_encap = fib_get_nh_encap(fi, fi->fib_nh);
> +
> +		if (!fi->fib_nh->nh_oif ||
> +		    rtnl_match_encap(fi->fib_nh->nh_dev,
> +				     cfg->fc_encap,
> +				     fi->fib_nh->nh_encap_len,
> +				     nh_encap))
> +			return 1;
> +	}
> +
>   	if (cfg->fc_oif || cfg->fc_gw) {
>   		if ((!cfg->fc_oif || cfg->fc_oif == fi->fib_nh->nh_oif) &&
>   		    (!cfg->fc_gw  || cfg->fc_gw == fi->fib_nh->nh_gw))
> @@ -530,6 +649,17 @@ int fib_nh_match(struct fib_config *cfg, struct fib_info *fi)
>   			if (nla && nla_get_u32(nla) != nh->nh_tclassid)
>   				return 1;
>   #endif
> +			nla = nla_find(attrs, attrlen, RTA_ENCAP);
> +			if (nla) {
> +				const void *nh_encap = fib_get_nh_encap(fi, nh);
> +
> +				if (!nh->nh_oif ||
> +				    rtnl_match_encap(nh->nh_dev,
> +						     cfg->fc_encap,
> +						     nh->nh_encap_len,
> +						     nh_encap))
> +					return 1;
> +			}
>   		}
>   
>   		rtnh = rtnh_next(rtnh, &remaining);
> @@ -760,6 +890,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>   	struct fib_info *ofi;
>   	int nhs = 1;
>   	struct net *net = cfg->fc_nlinfo.nl_net;
> +	int encap_len;
>   
>   	if (cfg->fc_type > RTN_MAX)
>   		goto err_inval;
> @@ -798,7 +929,14 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>   			goto failure;
>   	}
>   
> -	fi = kzalloc(sizeof(*fi)+nhs*sizeof(struct fib_nh), GFP_KERNEL);
> +	encap_len = fib_total_encap(cfg);
> +	if (encap_len < 0) {
> +		err = encap_len;
> +		goto failure;
> +	}
> +
> +	fi = kzalloc(sizeof(*fi) + nhs * sizeof(struct fib_nh) + encap_len,
> +		     GFP_KERNEL);
>   	if (!fi)
>   		goto failure;
>   	fib_info_cnt++;
> @@ -886,6 +1024,26 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>   #ifdef CONFIG_IP_ROUTE_MULTIPATH
>   		nh->nh_weight = 1;
>   #endif
> +		if (cfg->fc_encap) {
> +			struct net_device *dev;
> +			void *nh_encap;
> +			int len;
> +
> +			err = -EINVAL;
> +			dev = __dev_get_by_index(net, nh->nh_oif);
> +			if (!dev)
> +				goto failure;
> +
> +			nh_encap = __fib_get_nh_encap(fi, nh);
> +
> +			/* Fill in nh encap */
> +			len = rtnl_parse_encap(dev, cfg->fc_encap, nh_encap);
> +			if (len < 0 || len > sizeof(nh->nh_encap_len) * 8) {
> +				err = len;
> +				goto failure;
> +			}
> +			nh->nh_encap_len = len;
> +		}
>   	}
>   
>   	if (fib_props[cfg->fc_type].error) {
> @@ -1023,6 +1181,8 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
>   	    nla_put_in_addr(skb, RTA_PREFSRC, fi->fib_prefsrc))
>   		goto nla_put_failure;
>   	if (fi->fib_nhs == 1) {
> +		const void *nh_encap;
> +
>   		if (fi->fib_nh->nh_gw &&
>   		    nla_put_in_addr(skb, RTA_GATEWAY, fi->fib_nh->nh_gw))
>   			goto nla_put_failure;
> @@ -1034,6 +1194,12 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
>   		    nla_put_u32(skb, RTA_FLOW, fi->fib_nh[0].nh_tclassid))
>   			goto nla_put_failure;
>   #endif
> +
> +		nh_encap = fib_get_nh_encap(fi, &fi->fib_nh[0]);
> +		if (nh_encap && rtnl_fill_encap(fi->fib_nh[0].nh_dev, skb,
> +						fi->fib_nh[0].nh_encap_len,
> +						nh_encap))
> +			goto nla_put_failure;
>   	}
>   #ifdef CONFIG_IP_ROUTE_MULTIPATH
>   	if (fi->fib_nhs > 1) {
> @@ -1045,6 +1211,8 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
>   			goto nla_put_failure;
>   
>   		for_nexthops(fi) {
> +			const void *nh_encap;
> +
>   			rtnh = nla_reserve_nohdr(skb, sizeof(*rtnh));
>   			if (!rtnh)
>   				goto nla_put_failure;
> @@ -1061,6 +1229,13 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
>   			    nla_put_u32(skb, RTA_FLOW, nh->nh_tclassid))
>   				goto nla_put_failure;
>   #endif
> +
> +			nh_encap = fib_get_nh_encap(fi, nh);
> +			if (nh_encap && rtnl_fill_encap(nh->nh_dev, skb,
> +							nh->nh_encap_len,
> +							nh_encap))
> +				goto nla_put_failure;
> +
>   			/* length of rtnetlink header + attributes */
>   			rtnh->rtnh_len = nlmsg_get_pos(skb) - (void *) rtnh;
>   		} endfor_nexthops(fi);
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index f6055984c307..d52fa3d168a5 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -110,6 +110,8 @@
>   #endif
>   #include <net/secure_seq.h>
>   
> +#include "fib_lookup.h"
> +
>   #define RT_FL_TOS(oldflp4) \
>   	((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
>   
> @@ -138,6 +140,8 @@ static void		 ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
>   					   struct sk_buff *skb, u32 mtu);
>   static void		 ip_do_redirect(struct dst_entry *dst, struct sock *sk,
>   					struct sk_buff *skb);
> +static int		ipv4_dst_get_encap(const struct dst_entry *dst,
> +					   const void **encap);
>   static void		ipv4_dst_destroy(struct dst_entry *dst);
>   
>   static u32 *ipv4_cow_metrics(struct dst_entry *dst, unsigned long old)
> @@ -163,6 +167,7 @@ static struct dst_ops ipv4_dst_ops = {
>   	.redirect =		ip_do_redirect,
>   	.local_out =		__ip_local_out,
>   	.neigh_lookup =		ipv4_neigh_lookup,
> +	.get_encap =		ipv4_dst_get_encap,
>   };
>   
>   #define ECN_OR_COST(class)	TC_PRIO_##class
> @@ -1145,6 +1150,15 @@ static void ipv4_link_failure(struct sk_buff *skb)
>   		dst_set_expires(&rt->dst, 0);
>   }
>   
> +static int ipv4_dst_get_encap(const struct dst_entry *dst,
> +			      const void **encap)
> +{
> +	const struct rtable *rt = (const struct rtable *)dst;
> +
> +	*encap = rt->rt_encap;
> +	return rt->rt_encap_len;
> +}
> +
>   static int ip_rt_bug(struct sock *sk, struct sk_buff *skb)
>   {
>   	pr_debug("%s: %pI4 -> %pI4, %s\n",
> @@ -1394,6 +1408,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
>   
>   	if (fi) {
>   		struct fib_nh *nh = &FIB_RES_NH(*res);
> +		const void *nh_encap;
>   
>   		if (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) {
>   			rt->rt_gateway = nh->nh_gw;
> @@ -1403,6 +1418,15 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
>   #ifdef CONFIG_IP_ROUTE_CLASSID
>   		rt->dst.tclassid = nh->nh_tclassid;
>   #endif
> +
> +		nh_encap = fib_get_nh_encap(fi, nh);
> +		if (unlikely(nh_encap)) {
> +			rt->rt_encap = kmemdup(nh_encap, nh->nh_encap_len,
> +					       GFP_KERNEL);
> +			if (rt->rt_encap)
> +				rt->rt_encap_len = nh->nh_encap_len;
> +		}
> +

And..., you could make the rtable point to the same encap info.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-01 16:46 ` [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls Robert Shearman
@ 2015-06-02 16:15   ` roopa
  2015-06-02 16:33     ` Robert Shearman
  2015-06-02 18:26   ` Eric W. Biederman
  1 sibling, 1 reply; 32+ messages in thread
From: roopa @ 2015-06-02 16:15 UTC (permalink / raw)
  To: Robert Shearman
  Cc: netdev, Eric W. Biederman, Thomas Graf, Dinesh Dutt, Vivek Venkatraman

On 6/1/15, 9:46 AM, Robert Shearman wrote:
> Allow creating an mpls device for the purposes of encapsulating IP
> packets with:
>
>    ip link add type ipmpls
>
> This device defines its per-nexthop encapsulation data as a stack of
> labels, in the same format as for RTA_NEWST. It uses the encap data
> which will have been stored in the IP route to encapsulate the packet
> with that stack of labels, with the last label corresponding to a
> local label that defines how the packet will be sent out. The device
> sends packets over loopback to the local MPLS forwarding logic which
> performs all of the work.
>
>
Maybe a silly question, but when you loop the packet back, what does the 
local MPLS forwarding logic
lookup with ? It probably assumes there is a mpls route with that label 
and nexthop.
Will this need any internal labels (thinking same label stack different 
tunnel device etc) ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-02 16:15   ` roopa
@ 2015-06-02 16:33     ` Robert Shearman
  2015-06-02 18:57       ` roopa
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Shearman @ 2015-06-02 16:33 UTC (permalink / raw)
  To: roopa
  Cc: netdev, Eric W. Biederman, Thomas Graf, Dinesh Dutt, Vivek Venkatraman

On 02/06/15 17:15, roopa wrote:
> On 6/1/15, 9:46 AM, Robert Shearman wrote:
>> Allow creating an mpls device for the purposes of encapsulating IP
>> packets with:
>>
>>    ip link add type ipmpls
>>
>> This device defines its per-nexthop encapsulation data as a stack of
>> labels, in the same format as for RTA_NEWST. It uses the encap data
>> which will have been stored in the IP route to encapsulate the packet
>> with that stack of labels, with the last label corresponding to a
>> local label that defines how the packet will be sent out. The device
>> sends packets over loopback to the local MPLS forwarding logic which
>> performs all of the work.
>>
>>
> Maybe a silly question, but when you loop the packet back, what does the
> local MPLS forwarding logic
> lookup with ? It probably assumes there is a mpls route with that label
> and nexthop.
> Will this need any internal labels (thinking same label stack different
> tunnel device etc) ?

Yes, it requires that local/internal labels have been allocated and 
label routes installed in the label table for them.

It is entirely possible to put the outgoing interface into the encap 
data to avoid having to allocate extra labels, but I did it this way in 
order to support PIC Core for MPLS-VPN routes.

Note: I have two extra patches which avoid using the loopback device 
(which causes the TTL to end up being one less than it should on 
output), but I haven't posted them here because they were dependent on 
other mpls changes in my tree.

Thanks,
Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 2/3] ipv4: storing and retrieval of per-nexthop encap
  2015-06-02 16:01   ` roopa
@ 2015-06-02 16:35     ` Robert Shearman
  0 siblings, 0 replies; 32+ messages in thread
From: Robert Shearman @ 2015-06-02 16:35 UTC (permalink / raw)
  To: roopa
  Cc: netdev, Eric W. Biederman, Thomas Graf, Dinesh Dutt, Vivek Venkatraman

On 02/06/15 17:01, roopa wrote:
> On 6/1/15, 9:46 AM, Robert Shearman wrote:
>> Parse RTA_ENCAP attribute for one path and multipath routes. The encap
>> length is stored in a newly added field to fib_nh, nh_encap_len,
>> although this is added to a padding hole in the structure so that it
>> doesn't increase the size at all. The encap data itself is stored at
>> the end of the array of nexthops. Whilst this means that retrieval
>> isn't optimal, especially if there are multiple nexthops, this avoids
>> the memory cost of an extra pointer, as well as any potential change
>> to the cache or instruction layout that could cause a performance
>> impact.
>>
>> Currently, the dst structure allocated to represent the destination of
>> the packet and used for retrieving the encap by the encap-type
>> interface has been grown through the addition of the rt_encap_len and
>> rt_encap fields. This isn't desirable and could be fixed by defining a
>> new destination type with operations copied from the normal case,
>> other than the addition of the get_encap operation.
>>
>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
...
>> @@ -434,6 +445,83 @@ static int fib_detect_death(struct fib_info *fi,
>> int order,
>>       return 1;
>>   }
>> +static int fib_total_encap(struct fib_config *cfg)
>> +{
>> +    struct net *net = cfg->fc_nlinfo.nl_net;
>> +    int total_encap_len = 0;
>> +
>> +    if (cfg->fc_mp) {
>> +        int remaining = cfg->fc_mp_len;
>> +        struct rtnexthop *rtnh = cfg->fc_mp;
>> +
>> +        while (rtnh_ok(rtnh, remaining)) {
>> +            struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
>> +            int attrlen;
>> +
>> +            attrlen = rtnh_attrlen(rtnh);
>> +            nla = nla_find(attrs, attrlen, RTA_ENCAP);
>> +            if (nla) {
>> +                struct net_device *dev;
>> +                int len;
>> +
>> +                dev = __dev_get_by_index(net,
>> +                             rtnh->rtnh_ifindex);
>> +                if (!dev)
>> +                    return -EINVAL;
>> +
>> +                /* Determine space required */
>> +                len = rtnl_parse_encap(dev, nla, NULL);
>> +                if (len < 0)
>> +                    return len;
>> +
>> +                total_encap_len += len;
>> +            }
>> +
>> +            rtnh = rtnh_next(rtnh, &remaining);
>> +        }
>> +    } else {
>> +        if (cfg->fc_encap) {
>> +            struct net_device *dev;
>> +            int len;
>> +
>> +            dev = __dev_get_by_index(net, cfg->fc_oif);
>> +            if (!dev)
>> +                return -EINVAL;
>> +
>> +            /* Determine space required */
>> +            len = rtnl_parse_encap(dev, cfg->fc_encap, NULL);
>> +            if (len < 0)
>> +                return len;
>> +
>> +            total_encap_len += len;
>> +        }
>> +    }
>> +
>> +    return total_encap_len;
>> +}
> we could avoid parsing and finding this device twice, if fib_nh just
> held a pointer to the encap_info
> (or tunnel info) ?. And the encap_info/tun_info could be refcounted and
> shared between
> nexthops ?. In my implementation i have just a pointer to parsed encap
> state
> in fib_nh

Right - I took the approach here to avoid any extra memory use if encap 
isn't in use for the nexthop/route, and to avoid any potential 
performance penalty caused by extra cache misses. It would certainly 
make things simpler if those weren't concerns. I'd appreciate input from 
the community on this.

>> @@ -1403,6 +1418,15 @@ static void rt_set_nexthop(struct rtable *rt,
>> __be32 daddr,
>>   #ifdef CONFIG_IP_ROUTE_CLASSID
>>           rt->dst.tclassid = nh->nh_tclassid;
>>   #endif
>> +
>> +        nh_encap = fib_get_nh_encap(fi, nh);
>> +        if (unlikely(nh_encap)) {
>> +            rt->rt_encap = kmemdup(nh_encap, nh->nh_encap_len,
>> +                           GFP_KERNEL);
>> +            if (rt->rt_encap)
>> +                rt->rt_encap_len = nh->nh_encap_len;
>> +        }
>> +
>
> And..., you could make the rtable point to the same encap info.

Ack.

Thanks,
Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-01 16:46 [RFC net-next 0/3] IP imposition of per-nh MPLS encap Robert Shearman
                   ` (4 preceding siblings ...)
  2015-06-02 15:31 ` roopa
@ 2015-06-02 18:11 ` Eric W. Biederman
  2015-06-02 20:57   ` Robert Shearman
  5 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-06-02 18:11 UTC (permalink / raw)
  To: Robert Shearman; +Cc: netdev, roopa, Thomas Graf

Robert Shearman <rshearma@brocade.com> writes:

> In order to be able to function as a Label Edge Router in an MPLS
> network, it is necessary to be able to take IP packets and impose an
> MPLS encap and forward them out. The traditional approach of setting
> up an interface for each "tunnel" endpoint doesn't scale for the
> common MPLS use-cases where each IP route tends to be assigned a
> different label as encap.
>
> The solution suggested here for further discussion is to provide the
> facility to define encap data on a per-nexthop basis using a new
> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
> forwarding code, but interpreted by the virtual interface assigned to
> the nexthop.
>
> A new ipmpls interface type is defined to show the use of this
> facility to allow IP packets to be imposed with an MPLS
> encap. However, the facility is designed to be general enough to be
> used by any encapsulation/tunneling mechanism that has similar
> requirements of high-scale, high-variation-of-encap.

I am still digging into the details but adding a new network device to
make this possible if very undesirable.

It is a pain point.  Those network devices get to be a major source of
memory consumption when there are 4K network namespaces in existence.

It is conceptually wrong.  The network device will never be used as an
ordinary network device.  All the network device gives you is the
ability to avoid creating an enumeration of different kinds of
encapsulation.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 1/3] net: infra for per-nexthop encap data
  2015-06-01 16:46 ` [RFC net-next 1/3] net: infra for per-nexthop encap data Robert Shearman
@ 2015-06-02 18:15   ` Eric W. Biederman
  0 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2015-06-02 18:15 UTC (permalink / raw)
  To: Robert Shearman; +Cc: netdev, roopa, Thomas Graf

Robert Shearman <rshearma@brocade.com> writes:

> Having to add a new interface to apply encap onto a packet is a
> mechanism that works well today, allowing the setup of the encap to be
> done separately from the routes out of them, meaning that routing
> protocols and other user-space apps don't need to do anything special
> to add routes out of a new type of interface. However, the overhead of
> creating an interface is high, especially in terms of
> memory. Therefore, the traditional method won't work very well for
> large numbers of routes applying encap where there is a low degree of
> sharing of the encap.
>
> The solution is to introduce a way of defining encap on a per-nexthop
> basis (i.e. per-route if only one nexthop) through the addition of a
> new netlink attribute, RTA_ENCAP. The semantics of this attribute is
> that the data is interpreted according to the output interface type
> (RTA_OIF) and is opaque to the normal forwarding path. The output
> interface doesn't have to be defined per-nexthop, but instead
> represents the way of encapsulating the packet. There could be as few
> as one per namespace, but more could be created, particularly if they
> are used to define parameters which are shared by a large number of
> routes. However, the split of what goes in the encap data and what
> might be specified via interface attributes is entirely up to the
> encap-type implementation.
>
> New rtnetlink operations are defined to assist with the management of
> this data:
> - parse_encap for parsing the attribute given through rtnl and either
>   sizing the in-memory version (if encap ptr is NULL) or filling in the
>   in-memory version.  RTA_ENCAP work for IPv4. This operations allows
>   the interface to reject invalid encap specified by user-space and the
>   sizing allows the kernel to have a different in memory implementation
>   to the netlink API (which might be optimised for extensibility rather
>   than speed of packet forwarding).
> - fill_encap for taking the in-memory version of the encap and filling
>   in an RTA_ENCAP attribute in a netlink message.
> - match_encap for comparing an in-memory version of encap with an
>   RTA_ENCAP version, returning 0 if matching or 1 if different.
>
> A new dst operation is also defined to allow encap-type interfaces to
> retrieve the encap data from their xmit functions and use it for
> encapsulating the packet and for further forwarding.

This bit of infrastructure should be more like rtnl_register.  Where
we register an encap type and the operations to go with it.

Just like rtnl_register we can have small array with the operations for
each supported encapsulation.

Eric

> Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
> Signed-off-by: Robert Shearman <rshearma@brocade.com>
> ---
>  include/linux/rtnetlink.h      |  7 +++++++
>  include/net/dst.h              | 11 +++++++++++
>  include/net/dst_ops.h          |  2 ++
>  include/net/rtnetlink.h        | 11 +++++++++++
>  include/uapi/linux/rtnetlink.h |  1 +
>  net/core/rtnetlink.c           | 36 ++++++++++++++++++++++++++++++++++++
>  6 files changed, 68 insertions(+)
>
> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> index a2324fb45cf4..470d822ddd61 100644
> --- a/include/linux/rtnetlink.h
> +++ b/include/linux/rtnetlink.h
> @@ -22,6 +22,13 @@ struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct net_device *dev,
>  void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev,
>  		       gfp_t flags);
>  
> +int rtnl_parse_encap(const struct net_device *dev, const struct nlattr *nla,
> +		     void *encap);
> +int rtnl_fill_encap(const struct net_device *dev, struct sk_buff *skb,
> +		    int encap_len, const void *encap);
> +int rtnl_match_encap(const struct net_device *dev, const struct nlattr *nla,
> +		     int encap_len, const void *encap);
> +
>  
>  /* RTNL is used as a global lock for all changes to network configuration  */
>  extern void rtnl_lock(void);
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 2bc73f8a00a9..df0e6ec18eca 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -506,4 +506,15 @@ static inline struct xfrm_state *dst_xfrm(const struct dst_entry *dst)
>  }
>  #endif
>  
> +/* Get encap data for destination */
> +static inline int dst_get_encap(struct sk_buff *skb, const void **encap)
> +{
> +	const struct dst_entry *dst = skb_dst(skb);
> +
> +	if (!dst || !dst->ops->get_encap)
> +		return 0;
> +
> +	return dst->ops->get_encap(dst, encap);
> +}
> +
>  #endif /* _NET_DST_H */
> diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
> index d64253914a6a..97f48cf8ef7d 100644
> --- a/include/net/dst_ops.h
> +++ b/include/net/dst_ops.h
> @@ -32,6 +32,8 @@ struct dst_ops {
>  	struct neighbour *	(*neigh_lookup)(const struct dst_entry *dst,
>  						struct sk_buff *skb,
>  						const void *daddr);
> +	int			(*get_encap)(const struct dst_entry *dst,
> +					     const void **encap);
>  
>  	struct kmem_cache	*kmem_cachep;
>  
> diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
> index 343d922d15c2..3121ade24957 100644
> --- a/include/net/rtnetlink.h
> +++ b/include/net/rtnetlink.h
> @@ -95,6 +95,17 @@ struct rtnl_link_ops {
>  						   const struct net_device *dev,
>  						   const struct net_device *slave_dev);
>  	struct net		*(*get_link_net)(const struct net_device *dev);
> +	int			(*parse_encap)(const struct net_device *dev,
> +					       const struct nlattr *nla,
> +					       void *encap);
> +	int			(*fill_encap)(const struct net_device *dev,
> +					      struct sk_buff *skb,
> +					      int encap_len,
> +					      const void *encap);
> +	int			(*match_encap)(const struct net_device *dev,
> +					       const struct nlattr *nla,
> +					       int encap_len,
> +					       const void *encap);
>  };
>  
>  int __rtnl_link_register(struct rtnl_link_ops *ops);
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 17fb02f488da..ed4c797503f2 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -308,6 +308,7 @@ enum rtattr_type_t {
>  	RTA_VIA,
>  	RTA_NEWDST,
>  	RTA_PREF,
> +	RTA_ENCAP,
>  	__RTA_MAX
>  };
>  
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 077b6d280371..3b4e40a82799 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1441,6 +1441,42 @@ static int validate_linkmsg(struct net_device *dev, struct nlattr *tb[])
>  	return 0;
>  }
>  
> +int rtnl_parse_encap(const struct net_device *dev, const struct nlattr *nla,
> +		     void *encap)
> +{
> +	const struct rtnl_link_ops *ops = dev->rtnl_link_ops;
> +
> +	if (!ops->parse_encap)
> +		return -EINVAL;
> +
> +	return ops->parse_encap(dev, nla, encap);
> +}
> +EXPORT_SYMBOL(rtnl_parse_encap);
> +
> +int rtnl_fill_encap(const struct net_device *dev, struct sk_buff *skb,
> +		    int encap_len, const void *encap)
> +{
> +	const struct rtnl_link_ops *ops = dev->rtnl_link_ops;
> +
> +	if (!ops->fill_encap)
> +		return -EINVAL;
> +
> +	return ops->fill_encap(dev, skb, encap_len, encap);
> +}
> +EXPORT_SYMBOL(rtnl_fill_encap);
> +
> +int rtnl_match_encap(const struct net_device *dev, const struct nlattr *nla,
> +		     int encap_len, const void *encap)
> +{
> +	const struct rtnl_link_ops *ops = dev->rtnl_link_ops;
> +
> +	if (!ops->match_encap)
> +		return -EINVAL;
> +
> +	return ops->match_encap(dev, nla, encap_len, encap);
> +}
> +EXPORT_SYMBOL(rtnl_match_encap);
> +
>  static int do_setvfinfo(struct net_device *dev, struct nlattr *attr)
>  {
>  	int rem, err = -EINVAL;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-01 16:46 ` [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls Robert Shearman
  2015-06-02 16:15   ` roopa
@ 2015-06-02 18:26   ` Eric W. Biederman
  2015-06-02 21:37     ` Thomas Graf
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-06-02 18:26 UTC (permalink / raw)
  To: Robert Shearman; +Cc: netdev, roopa, Thomas Graf

Robert Shearman <rshearma@brocade.com> writes:

> Allow creating an mpls device for the purposes of encapsulating IP
> packets with:
>
>   ip link add type ipmpls
>
> This device defines its per-nexthop encapsulation data as a stack of
> labels, in the same format as for RTA_NEWST. It uses the encap data
> which will have been stored in the IP route to encapsulate the packet
> with that stack of labels, with the last label corresponding to a
> local label that defines how the packet will be sent out. The device
> sends packets over loopback to the local MPLS forwarding logic which
> performs all of the work.
>
> Stats are implemented, although any error in the sending via the real
> interface will be handled by the main mpls forwarding code and so not
> accounted by the interface.

Eeek stats!  Lots of unnecessary overhead.  If stats were ok we could
have simply reduced the cost of struct net_device to the point where it
would not matter.

This is really a bad hack for not getting in and being able to set
dst_output the way the xfrm infrastructure does.

What we really want here is xfrm-lite.  By lite I mean the tunnel
selection criteria is simple enough that it fits into the normal
routing table instead of having to do weird flow based magic that
is rarely needed.

I believe what we want are the xfrm stacking of dst entries.

Eric


> This implementation is based on an alternative earlier implementation
> by Eric W. Biederman.
>
> Signed-off-by: Robert Shearman <rshearma@brocade.com>
> ---
>  include/uapi/linux/if_arp.h |   1 +
>  net/mpls/Kconfig            |   5 +
>  net/mpls/Makefile           |   1 +
>  net/mpls/af_mpls.c          |   2 +
>  net/mpls/ipmpls.c           | 284 ++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 293 insertions(+)
>  create mode 100644 net/mpls/ipmpls.c
>
> diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h
> index 4d024d75d64b..17d669fd1781 100644
> --- a/include/uapi/linux/if_arp.h
> +++ b/include/uapi/linux/if_arp.h
> @@ -88,6 +88,7 @@
>  #define ARPHRD_IEEE80211_RADIOTAP 803	/* IEEE 802.11 + radiotap header */
>  #define ARPHRD_IEEE802154	  804
>  #define ARPHRD_IEEE802154_MONITOR 805	/* IEEE 802.15.4 network monitor */
> +#define ARPHRD_MPLS	806		/* IP and IPv6 over MPLS tunnels */
>  
>  #define ARPHRD_PHONET	820		/* PhoNet media type		*/
>  #define ARPHRD_PHONET_PIPE 821		/* PhoNet pipe header		*/
> diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
> index 17bde799c854..5264da94733a 100644
> --- a/net/mpls/Kconfig
> +++ b/net/mpls/Kconfig
> @@ -27,4 +27,9 @@ config MPLS_ROUTING
>  	help
>  	 Add support for forwarding of mpls packets.
>  
> +config MPLS_IPTUNNEL
> +	tristate "MPLS: IP over MPLS tunnel support"
> +	help
> +	 A network device that encapsulates ip packets as mpls
> +
>  endif # MPLS
> diff --git a/net/mpls/Makefile b/net/mpls/Makefile
> index 65bbe68c72e6..3a93c14b23c5 100644
> --- a/net/mpls/Makefile
> +++ b/net/mpls/Makefile
> @@ -3,5 +3,6 @@
>  #
>  obj-$(CONFIG_NET_MPLS_GSO) += mpls_gso.o
>  obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o
> +obj-$(CONFIG_MPLS_IPTUNNEL) += ipmpls.o
>  
>  mpls_router-y := af_mpls.o
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 7b3f732269e4..68bdfbdddfaf 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -615,6 +615,7 @@ int nla_put_labels(struct sk_buff *skb, int attrtype,
>  
>  	return 0;
>  }
> +EXPORT_SYMBOL(nla_put_labels);
>  
>  int nla_get_labels(const struct nlattr *nla,
>  		   u32 max_labels, u32 *labels, u32 label[])
> @@ -660,6 +661,7 @@ int nla_get_labels(const struct nlattr *nla,
>  	*labels = nla_labels;
>  	return 0;
>  }
> +EXPORT_SYMBOL(nla_get_labels);
>  
>  static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
>  			       struct mpls_route_config *cfg)
> diff --git a/net/mpls/ipmpls.c b/net/mpls/ipmpls.c
> new file mode 100644
> index 000000000000..cf6894ae0c61
> --- /dev/null
> +++ b/net/mpls/ipmpls.c
> @@ -0,0 +1,284 @@
> +#include <linux/types.h>
> +#include <linux/netdevice.h>
> +#include <linux/if_vlan.h>
> +#include <linux/if_arp.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <linux/module.h>
> +#include <linux/mpls.h>
> +#include "internal.h"
> +
> +static LIST_HEAD(ipmpls_dev_list);
> +
> +#define MAX_NEW_LABELS 2
> +
> +struct ipmpls_dev_priv {
> +	struct net_device *out_dev;
> +	struct list_head list;
> +	struct net_device *dev;
> +};
> +
> +static netdev_tx_t ipmpls_dev_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct ipmpls_dev_priv *priv = netdev_priv(dev);
> +	struct net_device *out_dev = priv->out_dev;
> +	struct mpls_shim_hdr *hdr;
> +	bool bottom_of_stack = true;
> +	int len = skb->len;
> +	const void *encap;
> +	int num_labels;
> +	unsigned ttl;
> +	const u32 *labels;
> +	int ret;
> +	int i;
> +
> +	num_labels = dst_get_encap(skb, &encap) / 4;
> +	if (!num_labels)
> +		goto drop;
> +
> +	labels = encap;
> +
> +	/* Obtain the ttl */
> +	if (skb->protocol == htons(ETH_P_IP)) {
> +		ttl = ip_hdr(skb)->ttl;
> +	} else if (skb->protocol == htons(ETH_P_IPV6)) {
> +		ttl = ipv6_hdr(skb)->hop_limit;
> +	} else if (skb->protocol == htons(ETH_P_MPLS_UC)) {
> +		ttl = mpls_entry_decode(mpls_hdr(skb)).ttl;
> +		bottom_of_stack = false;
> +	} else {
> +		goto drop;
> +	}
> +
> +	/* Now that the encap has been retrieved, there's no longer
> +	 * any need to keep the dst around so clear it out.
> +	 */
> +	skb_dst_drop(skb);
> +	skb_orphan(skb);
> +
> +	skb->inner_protocol = skb->protocol;
> +	skb->inner_network_header = skb->network_header;
> +
> +	skb_push(skb, num_labels * sizeof(*hdr));
> +	skb_reset_network_header(skb);
> +	hdr = mpls_hdr(skb);
> +
> +	for (i = num_labels - 1; i >= 0; i--) {
> +		hdr[i] = mpls_entry_encode(labels[i], ttl, 0, bottom_of_stack);
> +		bottom_of_stack = false;
> +	}
> +
> +	skb->dev = out_dev;
> +	skb->protocol = htons(ETH_P_MPLS_UC);
> +
> +	ret = dev_hard_header(skb, out_dev, ETH_P_MPLS_UC,
> +			      out_dev->dev_addr, NULL, len);
> +	if (ret >= 0)
> +		ret = dev_queue_xmit(skb);
> +	if (ret)
> +		goto drop;
> +
> +	dev->stats.tx_packets++;
> +	dev->stats.tx_bytes += len;
> +
> +	return 0;
> +
> +drop:
> +	dev->stats.tx_dropped++;
> +	kfree_skb(skb);
> +	return NETDEV_TX_OK;
> +}
> +
> +static int ipmpls_dev_init(struct net_device *dev)
> +{
> +	struct ipmpls_dev_priv *priv = netdev_priv(dev);
> +
> +	list_add_tail(&priv->list, &ipmpls_dev_list);
> +
> +	return 0;
> +}
> +
> +static void ipmpls_dev_uninit(struct net_device *dev)
> +{
> +	struct ipmpls_dev_priv *priv = netdev_priv(dev);
> +
> +	list_del_init(&priv->list);
> +}
> +
> +static void ipmpls_dev_free(struct net_device *dev)
> +{
> +	free_netdev(dev);
> +}
> +
> +static const struct net_device_ops ipmpls_netdev_ops = {
> +	.ndo_init		= ipmpls_dev_init,
> +	.ndo_start_xmit		= ipmpls_dev_xmit,
> +	.ndo_uninit		= ipmpls_dev_uninit,
> +};
> +
> +#define IPMPLS_FEATURES (NETIF_F_SG |			\
> +			 NETIF_F_FRAGLIST |		\
> +			 NETIF_F_HIGHDMA |		\
> +			 NETIF_F_VLAN_CHALLENGED)
> +
> +static void ipmpls_dev_setup(struct net_device *dev)
> +{
> +	dev->netdev_ops		= &ipmpls_netdev_ops;
> +
> +	dev->type		= ARPHRD_MPLS;
> +	dev->flags		= IFF_NOARP;
> +	netif_keep_dst(dev);
> +	dev->addr_len		= 0;
> +	dev->features		|= NETIF_F_LLTX;
> +	dev->features		|= IPMPLS_FEATURES;
> +	dev->hw_features	|= IPMPLS_FEATURES;
> +	dev->vlan_features	= 0;
> +
> +	dev->destructor = ipmpls_dev_free;
> +}
> +
> +static int ipmpls_dev_validate(struct nlattr *tb[], struct nlattr *data[])
> +{
> +	return 0;
> +}
> +
> +static int ipmpls_dev_newlink(struct net *src_net, struct net_device *dev,
> +			      struct nlattr *tb[], struct nlattr *data[])
> +{
> +	struct ipmpls_dev_priv *priv = netdev_priv(dev);
> +
> +	priv->out_dev = src_net->loopback_dev;
> +	priv->dev = dev;
> +
> +	dev->hard_header_len =
> +		priv->out_dev->hard_header_len +
> +		sizeof(struct mpls_shim_hdr) * MAX_NEW_LABELS;
> +
> +	return register_netdevice(dev);
> +}
> +
> +static void ipmpls_dev_dellink(struct net_device *dev, struct list_head *head)
> +{
> +	unregister_netdevice_queue(dev, head);
> +}
> +
> +static int ipmpls_dev_parse_encap(const struct net_device *dev,
> +				  const struct nlattr *nla,
> +				  void *encap)
> +{
> +	u32 labels;
> +
> +	if (nla_len(nla) / 4 > MAX_NEW_LABELS)
> +		return -EINVAL;
> +
> +	if (encap && nla_get_labels(nla, MAX_NEW_LABELS, &labels, encap))
> +		return -EINVAL;
> +
> +	/* Stored encap size is the same as the rtnl encap len */
> +	return nla_len(nla);
> +}
> +
> +static int ipmpls_dev_fill_encap(const struct net_device *dev,
> +				 struct sk_buff *skb, int encap_len,
> +				 const void *encap)
> +{
> +	return nla_put_labels(skb, RTA_ENCAP, encap_len / 4, encap);
> +}
> +
> +static int ipmpls_dev_match_encap(const struct net_device *dev,
> +				  const struct nlattr *nla, int encap_len,
> +				  const void *encap)
> +{
> +	unsigned nla_labels;
> +	struct mpls_shim_hdr *nla_label;
> +	const u32 *stored_labels = encap;
> +	int i;
> +
> +	/* Stored encap size is the same as the rtnl encap len */
> +	if (nla_len(nla) != encap_len)
> +		return 1;
> +
> +	nla_labels = nla_len(nla) / 4;
> +	nla_label = nla_data(nla);
> +
> +	for (i = 0; i < nla_labels; i++) {
> +		struct mpls_entry_decoded dec;
> +
> +		dec = mpls_entry_decode(nla_label + i);
> +
> +		if (stored_labels[i] != dec.label)
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static struct rtnl_link_ops ipmpls_ops = {
> +	.kind		= "ipmpls",
> +	.priv_size	= sizeof(struct ipmpls_dev_priv),
> +	.setup		= ipmpls_dev_setup,
> +	.validate	= ipmpls_dev_validate,
> +	.newlink	= ipmpls_dev_newlink,
> +	.dellink	= ipmpls_dev_dellink,
> +	.parse_encap	= ipmpls_dev_parse_encap,
> +	.fill_encap	= ipmpls_dev_fill_encap,
> +	.match_encap	= ipmpls_dev_match_encap,
> +};
> +
> +static int ipmpls_dev_notify(struct notifier_block *this, unsigned long event,
> +			     void *ptr)
> +{
> +	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +
> +	if (event == NETDEV_UNREGISTER) {
> +		struct ipmpls_dev_priv *priv, *priv2;
> +		LIST_HEAD(list_kill);
> +
> +		/* Ignore netns device moves */
> +		if (dev->reg_state != NETREG_UNREGISTERING)
> +			goto done;
> +
> +		list_for_each_entry_safe(priv, priv2, &ipmpls_dev_list, list) {
> +			if (priv->out_dev != dev)
> +				continue;
> +
> +			ipmpls_dev_dellink(priv->dev, &list_kill);
> +		}
> +		unregister_netdevice_many(&list_kill);
> +	}
> +done:
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block ipmpls_dev_notifier = {
> +	.notifier_call = ipmpls_dev_notify,
> +};
> +
> +static int __init ipmpls_init(void)
> +{
> +	int err;
> +
> +	err = register_netdevice_notifier(&ipmpls_dev_notifier);
> +	if (err)
> +		goto out;
> +
> +	err = rtnl_link_register(&ipmpls_ops);
> +	if (err)
> +		goto out_unregister_notifier;
> +out:
> +	return err;
> +out_unregister_notifier:
> +	unregister_netdevice_notifier(&ipmpls_dev_notifier);
> +	goto out;
> +}
> +module_init(ipmpls_init);
> +
> +static void __exit ipmpls_exit(void)
> +{
> +	rtnl_link_unregister(&ipmpls_ops);
> +	unregister_netdevice_notifier(&ipmpls_dev_notifier);
> +}
> +module_exit(ipmpls_exit);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_ALIAS_RTNL_LINK("ipmpls");

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02 15:31 ` roopa
@ 2015-06-02 18:30   ` Eric W. Biederman
  2015-06-02 18:39     ` roopa
  0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-06-02 18:30 UTC (permalink / raw)
  To: roopa; +Cc: Robert Shearman, netdev, Thomas Graf, Vivek Venkatraman

roopa <roopa@cumulusnetworks.com> writes:

> On 6/1/15, 9:46 AM, Robert Shearman wrote:
>> In order to be able to function as a Label Edge Router in an MPLS
>> network, it is necessary to be able to take IP packets and impose an
>> MPLS encap and forward them out. The traditional approach of setting
>> up an interface for each "tunnel" endpoint doesn't scale for the
>> common MPLS use-cases where each IP route tends to be assigned a
>> different label as encap.
>>
>> The solution suggested here for further discussion is to provide the
>> facility to define encap data on a per-nexthop basis using a new
>> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
>> forwarding code, but interpreted by the virtual interface assigned to
>> the nexthop.
>>
>> A new ipmpls interface type is defined to show the use of this
>> facility to allow IP packets to be imposed with an MPLS
>> encap. However, the facility is designed to be general enough to be
>> used by any encapsulation/tunneling mechanism that has similar
>> requirements of high-scale, high-variation-of-encap.
>>
>> RFC because:
>>   - IPv6 side not implemented
>>   - struct rtable shouldn't be bloated by pointer+uint
>>   - Hasn't been thoroughly tested yet
>>
>> Robert Shearman (3):
>>    net: infra for per-nexthop encap data
>>    ipv4: storing and retrieval of per-nexthop encap
>>    mpls: new ipmpls device for encapsulating IP packets as mpls
>>
>>
> Glad to see these patches!.
> I have a similar series i have been working on...but no netdevice.
> A set of ops similar to iptun_encaps and I store encap data in fib_nh
> and in ip_route_output_slow i point the dst.output to the output func provided
> by one of the encap ops.
>
> I see the advantages of using a netdevice...and i see this align with patches
> from thomas.

roopa I think I would prefer your patches.  I thinking using a netdevice
the way Robert is proposing is quite possibly a mess, from a scalability
stand point.

Do you mean ip_route_input_slow?  There is no ip_route_output_slow.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02 18:30   ` Eric W. Biederman
@ 2015-06-02 18:39     ` roopa
  0 siblings, 0 replies; 32+ messages in thread
From: roopa @ 2015-06-02 18:39 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Robert Shearman, netdev, Thomas Graf, Vivek Venkatraman

On 6/2/15, 11:30 AM, Eric W. Biederman wrote:
> roopa <roopa@cumulusnetworks.com> writes:
>
>> On 6/1/15, 9:46 AM, Robert Shearman wrote:
>>> In order to be able to function as a Label Edge Router in an MPLS
>>> network, it is necessary to be able to take IP packets and impose an
>>> MPLS encap and forward them out. The traditional approach of setting
>>> up an interface for each "tunnel" endpoint doesn't scale for the
>>> common MPLS use-cases where each IP route tends to be assigned a
>>> different label as encap.
>>>
>>> The solution suggested here for further discussion is to provide the
>>> facility to define encap data on a per-nexthop basis using a new
>>> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
>>> forwarding code, but interpreted by the virtual interface assigned to
>>> the nexthop.
>>>
>>> A new ipmpls interface type is defined to show the use of this
>>> facility to allow IP packets to be imposed with an MPLS
>>> encap. However, the facility is designed to be general enough to be
>>> used by any encapsulation/tunneling mechanism that has similar
>>> requirements of high-scale, high-variation-of-encap.
>>>
>>> RFC because:
>>>    - IPv6 side not implemented
>>>    - struct rtable shouldn't be bloated by pointer+uint
>>>    - Hasn't been thoroughly tested yet
>>>
>>> Robert Shearman (3):
>>>     net: infra for per-nexthop encap data
>>>     ipv4: storing and retrieval of per-nexthop encap
>>>     mpls: new ipmpls device for encapsulating IP packets as mpls
>>>
>>>
>> Glad to see these patches!.
>> I have a similar series i have been working on...but no netdevice.
>> A set of ops similar to iptun_encaps and I store encap data in fib_nh
>> and in ip_route_output_slow i point the dst.output to the output func provided
>> by one of the encap ops.
>>
>> I see the advantages of using a netdevice...and i see this align with patches
>> from thomas.
> roopa I think I would prefer your patches.  I thinking using a netdevice
> the way Robert is proposing is quite possibly a mess, from a scalability
> stand point.
>
> Do you mean ip_route_input_slow?  There is no ip_route_output_slow.
yes, correct, sorry. I mean ip_route_input_slow. They need work but i 
will try to get them out today to add more context to the discussion.

thanks,
Roopa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-02 16:33     ` Robert Shearman
@ 2015-06-02 18:57       ` roopa
  2015-06-02 21:06         ` Robert Shearman
  0 siblings, 1 reply; 32+ messages in thread
From: roopa @ 2015-06-02 18:57 UTC (permalink / raw)
  To: Robert Shearman
  Cc: netdev, Eric W. Biederman, Thomas Graf, Dinesh Dutt, Vivek Venkatraman

On 6/2/15, 9:33 AM, Robert Shearman wrote:
> On 02/06/15 17:15, roopa wrote:
>> On 6/1/15, 9:46 AM, Robert Shearman wrote:
>>> Allow creating an mpls device for the purposes of encapsulating IP
>>> packets with:
>>>
>>>    ip link add type ipmpls
>>>
>>> This device defines its per-nexthop encapsulation data as a stack of
>>> labels, in the same format as for RTA_NEWST. It uses the encap data
>>> which will have been stored in the IP route to encapsulate the packet
>>> with that stack of labels, with the last label corresponding to a
>>> local label that defines how the packet will be sent out. The device
>>> sends packets over loopback to the local MPLS forwarding logic which
>>> performs all of the work.
>>>
>>>
>> Maybe a silly question, but when you loop the packet back, what does the
>> local MPLS forwarding logic
>> lookup with ? It probably assumes there is a mpls route with that label
>> and nexthop.
>> Will this need any internal labels (thinking same label stack different
>> tunnel device etc) ?
>
> Yes, it requires that local/internal labels have been allocated and 
> label routes installed in the label table for them.
This is our only concern.
>
> It is entirely possible to put the outgoing interface into the encap 
> data to avoid having to allocate extra labels, 
> but I did it this way in order to support PIC Core for MPLS-VPN routes.

hmm..., is a netdevice must in this case.., can you please elaborate on 
this ?.

>
> Note: I have two extra patches which avoid using the loopback device 
> (which causes the TTL to end up being one less than it should on 
> output), but I haven't posted them here because they were dependent on 
> other mpls changes in my tree.

ok, thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02 18:11 ` Eric W. Biederman
@ 2015-06-02 20:57   ` Robert Shearman
  2015-06-02 21:10     ` Eric W. Biederman
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Shearman @ 2015-06-02 20:57 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, roopa, Thomas Graf

On 02/06/15 19:11, Eric W. Biederman wrote:
> Robert Shearman <rshearma@brocade.com> writes:
>
>> In order to be able to function as a Label Edge Router in an MPLS
>> network, it is necessary to be able to take IP packets and impose an
>> MPLS encap and forward them out. The traditional approach of setting
>> up an interface for each "tunnel" endpoint doesn't scale for the
>> common MPLS use-cases where each IP route tends to be assigned a
>> different label as encap.
>>
>> The solution suggested here for further discussion is to provide the
>> facility to define encap data on a per-nexthop basis using a new
>> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
>> forwarding code, but interpreted by the virtual interface assigned to
>> the nexthop.
>>
>> A new ipmpls interface type is defined to show the use of this
>> facility to allow IP packets to be imposed with an MPLS
>> encap. However, the facility is designed to be general enough to be
>> used by any encapsulation/tunneling mechanism that has similar
>> requirements of high-scale, high-variation-of-encap.
>
> I am still digging into the details but adding a new network device to
> make this possible if very undesirable.
>
> It is a pain point.  Those network devices get to be a major source of
> memory consumption when there are 4K network namespaces in existence.
>
> It is conceptually wrong.  The network device will never be used as an
> ordinary network device.  All the network device gives you is the
> ability to avoid creating an enumeration of different kinds of
> encapsulation.

This isn't true. The network device also gives some of the things you 
take for granted. Things like fragmentation through specifying the mtu 
on the shared tunnel device, being able to specify rules using the 
shared tunnel output device, IP stats, and the ability specify a 
different destination namespace.

Thanks,
Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-02 18:57       ` roopa
@ 2015-06-02 21:06         ` Robert Shearman
  2015-06-03 18:43           ` Vivek Venkatraman
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Shearman @ 2015-06-02 21:06 UTC (permalink / raw)
  To: roopa
  Cc: netdev, Eric W. Biederman, Thomas Graf, Dinesh Dutt, Vivek Venkatraman

On 02/06/15 19:57, roopa wrote:
> On 6/2/15, 9:33 AM, Robert Shearman wrote:
>> On 02/06/15 17:15, roopa wrote:
>>> On 6/1/15, 9:46 AM, Robert Shearman wrote:
>>>> Allow creating an mpls device for the purposes of encapsulating IP
>>>> packets with:
>>>>
>>>>    ip link add type ipmpls
>>>>
>>>> This device defines its per-nexthop encapsulation data as a stack of
>>>> labels, in the same format as for RTA_NEWST. It uses the encap data
>>>> which will have been stored in the IP route to encapsulate the packet
>>>> with that stack of labels, with the last label corresponding to a
>>>> local label that defines how the packet will be sent out. The device
>>>> sends packets over loopback to the local MPLS forwarding logic which
>>>> performs all of the work.
>>>>
>>>>
>>> Maybe a silly question, but when you loop the packet back, what does the
>>> local MPLS forwarding logic
>>> lookup with ? It probably assumes there is a mpls route with that label
>>> and nexthop.
>>> Will this need any internal labels (thinking same label stack different
>>> tunnel device etc) ?
>>
>> Yes, it requires that local/internal labels have been allocated and
>> label routes installed in the label table for them.
> This is our only concern.
>>
>> It is entirely possible to put the outgoing interface into the encap
>> data to avoid having to allocate extra labels, but I did it this way
>> in order to support PIC Core for MPLS-VPN routes.
>
> hmm..., is a netdevice must in this case.., can you please elaborate on
> this ?.

Yes, the ipmpls device would still be used to perform the encapsulation, 
transitioning from the IP forwarding path to the MPLS forwarding path.

Thanks,
Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02 20:57   ` Robert Shearman
@ 2015-06-02 21:10     ` Eric W. Biederman
  2015-06-02 22:15       ` Robert Shearman
  0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-06-02 21:10 UTC (permalink / raw)
  To: Robert Shearman; +Cc: netdev, roopa, Thomas Graf

Robert Shearman <rshearma@brocade.com> writes:

> On 02/06/15 19:11, Eric W. Biederman wrote:
>> Robert Shearman <rshearma@brocade.com> writes:
>>
>>> In order to be able to function as a Label Edge Router in an MPLS
>>> network, it is necessary to be able to take IP packets and impose an
>>> MPLS encap and forward them out. The traditional approach of setting
>>> up an interface for each "tunnel" endpoint doesn't scale for the
>>> common MPLS use-cases where each IP route tends to be assigned a
>>> different label as encap.
>>>
>>> The solution suggested here for further discussion is to provide the
>>> facility to define encap data on a per-nexthop basis using a new
>>> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
>>> forwarding code, but interpreted by the virtual interface assigned to
>>> the nexthop.
>>>
>>> A new ipmpls interface type is defined to show the use of this
>>> facility to allow IP packets to be imposed with an MPLS
>>> encap. However, the facility is designed to be general enough to be
>>> used by any encapsulation/tunneling mechanism that has similar
>>> requirements of high-scale, high-variation-of-encap.
>>
>> I am still digging into the details but adding a new network device to
>> make this possible if very undesirable.
>>
>> It is a pain point.  Those network devices get to be a major source of
>> memory consumption when there are 4K network namespaces in existence.
>>
>> It is conceptually wrong.  The network device will never be used as an
>> ordinary network device.  All the network device gives you is the
>> ability to avoid creating an enumeration of different kinds of
>> encapsulation.
>
> This isn't true. The network device also gives some of the things you
> take for granted. Things like fragmentation through specifying the mtu
> on the shared tunnel device, being able to specify rules using the
> shared tunnel output device, IP stats, and the ability specify a
> different destination namespace.

Granted you get a few more things.  It is still conceptually wrong as
the network device will netver be used as an ordinary network device.

Fragmentation is already silly because we are talking about multiple
tunnels with different properties.  You need per-route mtu to handle
that case.

Further I am not saying you don't need an output device (which is what
is needed to specify a different destination namespace) I am saying that
having a funny mpls device is wrong as far as I can see.  Certainly it
is a lot of bloody unnecessary overhead.

If we are going to design for maximum scaling (and 1 million+ routes)
sounds like maximum scaling we should see how far we can go without
dragging in the horrible heaviness of additional network devices.  35K a
piece last I measured it.  Just a small handful of them are already
scaling issues for network namespaces.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-02 18:26   ` Eric W. Biederman
@ 2015-06-02 21:37     ` Thomas Graf
  2015-06-02 22:48       ` Eric W. Biederman
  2015-06-02 23:23       ` Eric W. Biederman
  0 siblings, 2 replies; 32+ messages in thread
From: Thomas Graf @ 2015-06-02 21:37 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Robert Shearman, netdev, roopa

On 06/02/15 at 01:26pm, Eric W. Biederman wrote:
> What we really want here is xfrm-lite.  By lite I mean the tunnel
> selection criteria is simple enough that it fits into the normal
> routing table instead of having to do weird flow based magic that
> is rarely needed.
> 
> I believe what we want are the xfrm stacking of dst entries.

I assume you are referring to reusing the selector and stacked
dst. I considered that for the transmit side.

Can you elaborate on this some more? How would this look like
for the specific case of VXLAN? Any thoughts on the receive
side? You also mention that you dislike the net_device approach.
What do you suggest instead? The encapsulation is often postponed
to after the packet is fully constructed. Where should it get
hooked into?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02 13:28   ` Robert Shearman
@ 2015-06-02 21:43     ` Thomas Graf
  2015-06-03 13:30       ` Robert Shearman
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Graf @ 2015-06-02 21:43 UTC (permalink / raw)
  To: Robert Shearman; +Cc: netdev, Eric W. Biederman, roopa

On 06/02/15 at 02:28pm, Robert Shearman wrote:
> Nesting attributes inside the RTA_ENCAP blob should be supported by the
> patch series today. Something like this:

Sure. I'm not seeing such a construct for the MPLS case yet.

I'm happy to rebase my patches on top of your nexthop implementation.
It is definitely superior. Are you maintaining a git tree somewhere?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02 21:10     ` Eric W. Biederman
@ 2015-06-02 22:15       ` Robert Shearman
  2015-06-02 22:58         ` Eric W. Biederman
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Shearman @ 2015-06-02 22:15 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, roopa, Thomas Graf

On 02/06/15 22:10, Eric W. Biederman wrote:
> Robert Shearman <rshearma@brocade.com> writes:
>
>> On 02/06/15 19:11, Eric W. Biederman wrote:
>>> Robert Shearman <rshearma@brocade.com> writes:
>>>
>>>> In order to be able to function as a Label Edge Router in an MPLS
>>>> network, it is necessary to be able to take IP packets and impose an
>>>> MPLS encap and forward them out. The traditional approach of setting
>>>> up an interface for each "tunnel" endpoint doesn't scale for the
>>>> common MPLS use-cases where each IP route tends to be assigned a
>>>> different label as encap.
>>>>
>>>> The solution suggested here for further discussion is to provide the
>>>> facility to define encap data on a per-nexthop basis using a new
>>>> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
>>>> forwarding code, but interpreted by the virtual interface assigned to
>>>> the nexthop.
>>>>
>>>> A new ipmpls interface type is defined to show the use of this
>>>> facility to allow IP packets to be imposed with an MPLS
>>>> encap. However, the facility is designed to be general enough to be
>>>> used by any encapsulation/tunneling mechanism that has similar
>>>> requirements of high-scale, high-variation-of-encap.
>>>
>>> I am still digging into the details but adding a new network device to
>>> make this possible if very undesirable.
>>>
>>> It is a pain point.  Those network devices get to be a major source of
>>> memory consumption when there are 4K network namespaces in existence.
>>>
>>> It is conceptually wrong.  The network device will never be used as an
>>> ordinary network device.  All the network device gives you is the
>>> ability to avoid creating an enumeration of different kinds of
>>> encapsulation.
>>
>> This isn't true. The network device also gives some of the things you
>> take for granted. Things like fragmentation through specifying the mtu
>> on the shared tunnel device, being able to specify rules using the
>> shared tunnel output device, IP stats, and the ability specify a
>> different destination namespace.
>
> Granted you get a few more things.  It is still conceptually wrong as
> the network device will netver be used as an ordinary network device.
>
> Fragmentation is already silly because we are talking about multiple
> tunnels with different properties.  You need per-route mtu to handle
> that case.

It's unlikely you'll have a huge variation in the mtus across routes, 
unless you're running in an ISP environment. In the example uses we've 
got in hand, it's highly likely they'll only be a handful of different 
mtus, if that.

> Further I am not saying you don't need an output device (which is what
> is needed to specify a different destination namespace) I am saying that
> having a funny mpls device is wrong as far as I can see.  Certainly it
> is a lot of bloody unnecessary overhead.
>
> If we are going to design for maximum scaling (and 1 million+ routes)
> sounds like maximum scaling we should see how far we can go without
> dragging in the horrible heaviness of additional network devices.  35K a
> piece last I measured it.  Just a small handful of them are already
> scaling issues for network namespaces.

For the ipmpls interface I've implemented here, you only need one per 
namespace. You could argue the same for the veth interfaces which would 
be much more commonly used in network namespaces.

BTW, maybe I've missed something, or maybe netdevs have gone on a diet, 
but I count the cost of creating a basic interface at ~2700 bytes on x86_64:
sizeof(struct net_device) /* 2112 */ + 1 * sizeof(struct netdev_queue) 
/* 384 */ + 1 * sizeof(struct netdev_rx_queue) /* 128 */ + sizeof(struct 
netdev_hw_addr) /* 80 */ + sizeof(int) * nr_poss_cpus /* 4 * n */)

Thanks,
Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-02 21:37     ` Thomas Graf
@ 2015-06-02 22:48       ` Eric W. Biederman
  2015-06-02 23:23       ` Eric W. Biederman
  1 sibling, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2015-06-02 22:48 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Robert Shearman, netdev, roopa

Thomas Graf <tgraf@suug.ch> writes:

> On 06/02/15 at 01:26pm, Eric W. Biederman wrote:
>> What we really want here is xfrm-lite.  By lite I mean the tunnel
>> selection criteria is simple enough that it fits into the normal
>> routing table instead of having to do weird flow based magic that
>> is rarely needed.
>> 
>> I believe what we want are the xfrm stacking of dst entries.
>
> I assume you are referring to reusing the selector and stacked
> dst. I considered that for the transmit side.
>
> Can you elaborate on this some more? How would this look like
> for the specific case of VXLAN? Any thoughts on the receive
> side? You also mention that you dislike the net_device approach.
> What do you suggest instead? The encapsulation is often postponed
> to after the packet is fully constructed. Where should it get
> hooked into?

Things I think xfrm does correct today:
- Transmitting things when an appropriate dst has been found.

Things I think xfrm could do better:
- Finding the dst entry.  Having to perform a separate lookup in a
  second set of tables looks slow, and not much maintained.

So if we focus on the normal routing case where lookup works today (aka
no source port or destination port based routing or any of the other
weird things so we can use a standard fib lookup I think I can explain
what I imagine things would look like.

To be clear I am focusing on the very light weight tunnels and I am not
certain vxlan applies.  It may be more reasonable to simply have a
single ethernet looking device that does speaks vxlan behind the scenes.

If I look at vxlan as a set of ipv4 host routes (no arp, no unknown host
support) it looks like the kind of light-weight tunnel that we are
dealing with for mpls.

On the reception side packets that match the magic udp socket have their
tunneling bits stripped off and pushed up to the ip layer.  Roughly
equivalent to the current af_mpls code.

On the transmit side there would be a host route for each remote host.
In the fib we would store a pointer to a data structure that holds a
precomputed header to be prepended to the packet (inner ethernet, vxlan,
outer udp, outer ip).  That data pointer would become dst->xfrm when the
route lookup happens and we generate a route/dst entry.  There would
also be an output function in the fib and that output function would
be compue dst->output.  I would be more specific but I forget the
details of the fib_trie data structures.

The output function function in the dst entry in the ipv4 route would
know how to interpret the pointer in the ipv4 routing table, append
the precomputed headers, update the precomputed udp header's source port
with the flow hash of the the inner packet, and have an inner dst
so that would essentially call ip_finish_output2 again and sending
the packet to it's destination.

There is some wiggle room but that is how I imagine things working, and
that is what I think we want for the mpls case.  Adding two pointers to
the fib could be interesting.  One pointer can be a union with the
output network device, the other pointer I am not certain about.

And of course we get fun cases where we have tunnels running through
other tunnels.  So there likely needs to be a bit of indirection going
on.

The problem I think needs to be solved is how to make tunnels very light
weight and cheap, so the can scale to 1million+.  Enough so that the
kernel can hold a full routing table full of tunnels.

It looks like xfrm is almost there but it's data structures appear to be
excessively complicated and inscrutible, and the require an extra lookup.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02 22:15       ` Robert Shearman
@ 2015-06-02 22:58         ` Eric W. Biederman
  2015-06-04 15:12           ` Nicolas Dichtel
  0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-06-02 22:58 UTC (permalink / raw)
  To: Robert Shearman; +Cc: netdev, roopa, Thomas Graf

Robert Shearman <rshearma@brocade.com> writes:

> On 02/06/15 22:10, Eric W. Biederman wrote:
>> Robert Shearman <rshearma@brocade.com> writes:
>>
>>> On 02/06/15 19:11, Eric W. Biederman wrote:
>>>> Robert Shearman <rshearma@brocade.com> writes:
>>>>
>>>>> In order to be able to function as a Label Edge Router in an MPLS
>>>>> network, it is necessary to be able to take IP packets and impose an
>>>>> MPLS encap and forward them out. The traditional approach of setting
>>>>> up an interface for each "tunnel" endpoint doesn't scale for the
>>>>> common MPLS use-cases where each IP route tends to be assigned a
>>>>> different label as encap.
>>>>>
>>>>> The solution suggested here for further discussion is to provide the
>>>>> facility to define encap data on a per-nexthop basis using a new
>>>>> netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
>>>>> forwarding code, but interpreted by the virtual interface assigned to
>>>>> the nexthop.
>>>>>
>>>>> A new ipmpls interface type is defined to show the use of this
>>>>> facility to allow IP packets to be imposed with an MPLS
>>>>> encap. However, the facility is designed to be general enough to be
>>>>> used by any encapsulation/tunneling mechanism that has similar
>>>>> requirements of high-scale, high-variation-of-encap.
>>>>
>>>> I am still digging into the details but adding a new network device to
>>>> make this possible if very undesirable.
>>>>
>>>> It is a pain point.  Those network devices get to be a major source of
>>>> memory consumption when there are 4K network namespaces in existence.
>>>>
>>>> It is conceptually wrong.  The network device will never be used as an
>>>> ordinary network device.  All the network device gives you is the
>>>> ability to avoid creating an enumeration of different kinds of
>>>> encapsulation.
>>>
>>> This isn't true. The network device also gives some of the things you
>>> take for granted. Things like fragmentation through specifying the mtu
>>> on the shared tunnel device, being able to specify rules using the
>>> shared tunnel output device, IP stats, and the ability specify a
>>> different destination namespace.
>>
>> Granted you get a few more things.  It is still conceptually wrong as
>> the network device will netver be used as an ordinary network device.
>>
>> Fragmentation is already silly because we are talking about multiple
>> tunnels with different properties.  You need per-route mtu to handle
>> that case.
>
> It's unlikely you'll have a huge variation in the mtus across routes,
> unless you're running in an ISP environment. In the example uses we've
> got in hand, it's highly likely they'll only be a handful of different
> mtus, if that.

Did we ever implement an mpls mtu per netdevice (I think so).
Anyway the tunnel mtu is easy enough to calculate in context (base mtu -
tunnel overhead).  So for default we should not need to do much.

>> Further I am not saying you don't need an output device (which is what
>> is needed to specify a different destination namespace) I am saying that
>> having a funny mpls device is wrong as far as I can see.  Certainly it
>> is a lot of bloody unnecessary overhead.
>>
>> If we are going to design for maximum scaling (and 1 million+ routes)
>> sounds like maximum scaling we should see how far we can go without
>> dragging in the horrible heaviness of additional network devices.  35K a
>> piece last I measured it.  Just a small handful of them are already
>> scaling issues for network namespaces.
>
> For the ipmpls interface I've implemented here, you only need one per
> namespace. You could argue the same for the veth interfaces which
> would be much more commonly used in network namespaces.

But if I can avoid the extra 143M (35Kibibytes*4096namespaces) I would like to.

On the drawing board is getting cross namespace routes so with a little
luck I will only need loopback devices in most of my network namespaces
when the dust settles.

Outputing to network devices in another network namespace is
fundamentally simple but I haven't take the time to figure out which
assumptions I may have to purge to make it work reliably.

> BTW, maybe I've missed something, or maybe netdevs have gone on a
> diet, but I count the cost of creating a basic interface at ~2700
> bytes on x86_64:
> sizeof(struct net_device) /* 2112 */ + 1 * sizeof(struct netdev_queue)
> /* 384 */ + 1 * sizeof(struct netdev_rx_queue) /* 128 */ +
> sizeof(struct netdev_hw_addr) /* 80 */ + sizeof(int) * nr_poss_cpus /*
> 4 * n */)

It is a non-trivial thing to measure.  You really have to create a lot
of them and see how much memory is consumed.  But between the per cpu
stats, the sysctl attributes, the sysfs attribute and everything else
an actual working netdevice in an all yes config kernel was consuming
something like 35K not too long ago.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-02 21:37     ` Thomas Graf
  2015-06-02 22:48       ` Eric W. Biederman
@ 2015-06-02 23:23       ` Eric W. Biederman
  2015-06-03  9:50         ` Thomas Graf
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-06-02 23:23 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Robert Shearman, netdev, roopa

Thomas Graf <tgraf@suug.ch> writes:

> On 06/02/15 at 01:26pm, Eric W. Biederman wrote:
>> What we really want here is xfrm-lite.  By lite I mean the tunnel
>> selection criteria is simple enough that it fits into the normal
>> routing table instead of having to do weird flow based magic that
>> is rarely needed.
>> 
>> I believe what we want are the xfrm stacking of dst entries.
>
> I assume you are referring to reusing the selector and stacked
> dst. I considered that for the transmit side.
>
> Can you elaborate on this some more? How would this look like
> for the specific case of VXLAN? Any thoughts on the receive
> side? You also mention that you dislike the net_device approach.
> What do you suggest instead? The encapsulation is often postponed
> to after the packet is fully constructed. Where should it get
> hooked into?

Thomas I may have misunderstood what you are trying to do.

Is what you were aiming for roughly the existing RTA_FLOW so you can
transmit packets out one network device and have enough information to
know which of a set of tunnels of a given type you want the packets go
into?

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-02 23:23       ` Eric W. Biederman
@ 2015-06-03  9:50         ` Thomas Graf
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Graf @ 2015-06-03  9:50 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Robert Shearman, netdev, roopa

On 06/02/15 at 06:23pm, Eric W. Biederman wrote:
> Thomas I may have misunderstood what you are trying to do.
> 
> Is what you were aiming for roughly the existing RTA_FLOW so you can
> transmit packets out one network device and have enough information to
> know which of a set of tunnels of a given type you want the packets go
> into?

The aim is to extend the existing the flow forwarding decisions
with the ability to attach encapsulation instructions to the
packet and allow flow forwarding and filtering decisions based
on encapsulation information such as outer & encap header fields.
On top of that, since we support various L2 in something encaps,
it must also be usable by bridges including OVS and Linux bridge.

So for a pure routing solution this would look like:

        ip route add 20.1.1.1/8 \
        via tunnel 10.1.1.1 id 20 dev vxlan0

Receive:

        ip route add 20.1.1.2/32 tunnel id 20 dev veth0
or:
        ip rule add from all tunnel-id 20 lookup 20


On 06/02/15 at 05:48pm, Eric W. Biederman wrote:
> Things I think xfrm does correct today:
> - Transmitting things when an appropriate dst has been found.
> 
> Things I think xfrm could do better:
> - Finding the dst entry.  Having to perform a separate lookup in a
>   second set of tables looks slow, and not much maintained.
> 
> So if we focus on the normal routing case where lookup works today (aka
> no source port or destination port based routing or any of the other
> weird things so we can use a standard fib lookup I think I can explain
> what I imagine things would look like.

Right. That's how I expect the routing transmit path for flow based
tunnels to look like. No modification to the FIB lookup logic.

> To be clear I am focusing on the very light weight tunnels and I am not
> certain vxlan applies.  It may be more reasonable to simply have a
> single ethernet looking device that does speaks vxlan behind the scenes.
> 
> If I look at vxlan as a set of ipv4 host routes (no arp, no unknown host
> support) it looks like the kind of light-weight tunnel that we are
> dealing with for mpls.
> 
> On the reception side packets that match the magic udp socket have their
> tunneling bits stripped off and pushed up to the ip layer.  Roughly
> equivalent to the current af_mpls code.

That's the easy part. Where do you match on the VNI? How do you handle
BUM traffic? The whole point here is to get rid of the requirement
to maintain a VXLAN net_device for every VNI, or more generally, a
virtual tunnel device for every virtual network. As we know, it's is
a non-scalable solution.

> On the transmit side there would be a host route for each remote host.
> In the fib we would store a pointer to a data structure that holds a
> precomputed header to be prepended to the packet (inner ethernet, vxlan,
> outer udp, outer ip).

So we need a FIB entry for each inner header L2 address pair? This
would duplicate the neighbour cache in each namespace. I don't think
this will scale, see a couple of paragraphs below.

I looked at getting rid of the VXLAN (or other encap) net_device but
this would require to store all parameters including all the
checksumming parameters, flags, ports, ... for each single route. This
will blow up the size of a route considerably. What is proposed instead
is that the parameters which are likely per flow are put in the route
while the parameters which are likely shared remain in the net_device.

> That data pointer would become dst->xfrm when the
> route lookup happens and we generate a route/dst entry.  There would
> also be an output function in the fib and that output function would
> be compue dst->output.  I would be more specific but I forget the
> details of the fib_trie data structures.

I assume you would propose something like a chained dst output so we
call the L2 dst output first which then in turn calls the vxlan dst
output to perform the encap and hooks it back into L3 for the outer
header? How would this work for bridges?

> The output function function in the dst entry in the ipv4 route would
> know how to interpret the pointer in the ipv4 routing table, append
> the precomputed headers, update the precomputed udp header's source port
> with the flow hash of the the inner packet, and have an inner dst
> so that would essentially call ip_finish_output2 again and sending
> the packet to it's destination.

What I don't understand is that exactly does this buy us? I understand
that you want to get rid of the net_device per netns in a VRF == netns
architecture. Let's think further:

Thinking outside of the actual implementation for a bit. I really
don't want to keep a full copy of the entire underlay L2/L3 state
in each namespace. I also don't want to keep a map of overlay ip to
tunnel endpoint in each namespace. I want to keep as little as
possible in the guest namespace, in particular if we are talking 4K
namespaces with up to 1M tunnel endpoints (dude, what kind of cluster
are you running? ;-)

My current thinking is to maintain a single namespace to perform
the FIB lookup which maps outer IPs to the tunnel endpoint and which
also contains the neighbour cache for the underlay. This requires a
single tunnel net_device or more generally, one shared net_device
per shared set of parameters. The namespacing of the routes occurs
through multiple routing tables or by using the mark to distinguish
between guest namespaces. My plan there is to extend veth with the
capability to set a mark value to all packets and thus extend the
namespaces into shared data structures as we typically already
support mark in all common networking data structures.

> There is some wiggle room but that is how I imagine things working, and
> that is what I think we want for the mpls case.  Adding two pointers to
> the fib could be interesting.  One pointer can be a union with the
> output network device, the other pointer I am not certain about.
> 
> And of course we get fun cases where we have tunnels running through
> other tunnels.  So there likely needs to be a bit of indirection going
> on.
> 
> The problem I think needs to be solved is how to make tunnels very light
> weight and cheap, so the can scale to 1million+.  Enough so that the
> kernel can hold a full routing table full of tunnels.

ACK. Although I don't want to hold 4K * full routing tables ;-)

> It looks like xfrm is almost there but it's data structures appear to be
> excessively complicated and inscrutible, and the require an extra lookup.

I'm still not fully understanding why do you want to keep the encap
information in a separate table? Or are you just talking about the use
of the dst field to attach the encap information to the packet?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02 21:43     ` Thomas Graf
@ 2015-06-03 13:30       ` Robert Shearman
  0 siblings, 0 replies; 32+ messages in thread
From: Robert Shearman @ 2015-06-03 13:30 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, Eric W. Biederman, roopa

On 02/06/15 22:43, Thomas Graf wrote:
> On 06/02/15 at 02:28pm, Robert Shearman wrote:
>> Nesting attributes inside the RTA_ENCAP blob should be supported by the
>> patch series today. Something like this:
>
> Sure. I'm not seeing such a construct for the MPLS case yet.

Right, that is something that should probably be done.

> I'm happy to rebase my patches on top of your nexthop implementation.
> It is definitely superior. Are you maintaining a git tree somewhere?

I wasn't, but I am now:

https://github.com/rshearman/linux/tree/mpls-encap
https://github.com/rshearman/iproute2/tree/mpls-encap

Thanks,
Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-02 21:06         ` Robert Shearman
@ 2015-06-03 18:43           ` Vivek Venkatraman
  2015-06-04 18:46             ` Robert Shearman
  0 siblings, 1 reply; 32+ messages in thread
From: Vivek Venkatraman @ 2015-06-03 18:43 UTC (permalink / raw)
  To: Robert Shearman
  Cc: roopa, netdev, Eric W. Biederman, Thomas Graf, Dinesh Dutt

On Tue, Jun 2, 2015 at 2:06 PM, Robert Shearman <rshearma@brocade.com> wrote:
> On 02/06/15 19:57, roopa wrote:
>>
>> On 6/2/15, 9:33 AM, Robert Shearman wrote:
>>>
>>> On 02/06/15 17:15, roopa wrote:
>>>>
>>>> On 6/1/15, 9:46 AM, Robert Shearman wrote:
>>>>>
>>>>> Allow creating an mpls device for the purposes of encapsulating IP
>>>>> packets with:
>>>>>
>>>>>    ip link add type ipmpls
>>>>>
>>>>> This device defines its per-nexthop encapsulation data as a stack of
>>>>> labels, in the same format as for RTA_NEWST. It uses the encap data
>>>>> which will have been stored in the IP route to encapsulate the packet
>>>>> with that stack of labels, with the last label corresponding to a
>>>>> local label that defines how the packet will be sent out. The device
>>>>> sends packets over loopback to the local MPLS forwarding logic which
>>>>> performs all of the work.
>>>>>
>>>>>
>>>> Maybe a silly question, but when you loop the packet back, what does the
>>>> local MPLS forwarding logic
>>>> lookup with ? It probably assumes there is a mpls route with that label
>>>> and nexthop.
>>>> Will this need any internal labels (thinking same label stack different
>>>> tunnel device etc) ?
>>>
>>>
>>> Yes, it requires that local/internal labels have been allocated and
>>> label routes installed in the label table for them.
>>
>> This is our only concern.
>>>
>>>
>>> It is entirely possible to put the outgoing interface into the encap
>>> data to avoid having to allocate extra labels, but I did it this way
>>> in order to support PIC Core for MPLS-VPN routes.
>>
>>
>> hmm..., is a netdevice must in this case.., can you please elaborate on
>> this ?.
>
>
> Yes, the ipmpls device would still be used to perform the encapsulation,
> transitioning from the IP forwarding path to the MPLS forwarding path.
>

Transitioning from IP forwarding to MPLS forwarding as you have here
will certainly facilitate PIC core when another path exists to the
edge. But it cannot deal with PIC edge, right? Additionally, this
approach would mean that the user's (iproute2) view would be rather
strange - while the actual forwarding requires labels L1 and L2
(bottom) to be pushed when forwarding to a destination, it would look
as if labels L3 and L2 are pushed and then L3 is swapped with L1.

A different way to achieve PIC (core and edge) without transitioning
from IP forwarding to MPLS forwarding may be to introduce the concept
of an alternate nexthop with something (e.g., link status) determining
which nexthop is used.

> Thanks,
> Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
  2015-06-02 22:58         ` Eric W. Biederman
@ 2015-06-04 15:12           ` Nicolas Dichtel
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Dichtel @ 2015-06-04 15:12 UTC (permalink / raw)
  To: Eric W. Biederman, Robert Shearman; +Cc: netdev, roopa, Thomas Graf

Le 03/06/2015 00:58, Eric W. Biederman a écrit :
[snip]
> On the drawing board is getting cross namespace routes so with a little
> luck I will only need loopback devices in most of my network namespaces
> when the dust settles.
+1

I've start to look at this, but I didn't have enough time right now. But this
feature will be a good optimization from a scalability and memory consumption
point of view.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-03 18:43           ` Vivek Venkatraman
@ 2015-06-04 18:46             ` Robert Shearman
  2015-06-04 21:38               ` Vivek Venkatraman
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Shearman @ 2015-06-04 18:46 UTC (permalink / raw)
  To: Vivek Venkatraman
  Cc: roopa, netdev, Eric W. Biederman, Thomas Graf, Dinesh Dutt

On 03/06/15 19:43, Vivek Venkatraman wrote:
> On Tue, Jun 2, 2015 at 2:06 PM, Robert Shearman <rshearma@brocade.com> wrote:
>> On 02/06/15 19:57, roopa wrote:
>>>
>>> On 6/2/15, 9:33 AM, Robert Shearman wrote:
>>>>
>>>> On 02/06/15 17:15, roopa wrote:
>>>>>
>>>>> On 6/1/15, 9:46 AM, Robert Shearman wrote:
>>>>>>
>>>>>> Allow creating an mpls device for the purposes of encapsulating IP
>>>>>> packets with:
>>>>>>
>>>>>>     ip link add type ipmpls
>>>>>>
>>>>>> This device defines its per-nexthop encapsulation data as a stack of
>>>>>> labels, in the same format as for RTA_NEWST. It uses the encap data
>>>>>> which will have been stored in the IP route to encapsulate the packet
>>>>>> with that stack of labels, with the last label corresponding to a
>>>>>> local label that defines how the packet will be sent out. The device
>>>>>> sends packets over loopback to the local MPLS forwarding logic which
>>>>>> performs all of the work.
>>>>>>
>>>>>>
>>>>> Maybe a silly question, but when you loop the packet back, what does the
>>>>> local MPLS forwarding logic
>>>>> lookup with ? It probably assumes there is a mpls route with that label
>>>>> and nexthop.
>>>>> Will this need any internal labels (thinking same label stack different
>>>>> tunnel device etc) ?
>>>>
>>>>
>>>> Yes, it requires that local/internal labels have been allocated and
>>>> label routes installed in the label table for them.
>>>
>>> This is our only concern.
>>>>
>>>>
>>>> It is entirely possible to put the outgoing interface into the encap
>>>> data to avoid having to allocate extra labels, but I did it this way
>>>> in order to support PIC Core for MPLS-VPN routes.
>>>
>>>
>>> hmm..., is a netdevice must in this case.., can you please elaborate on
>>> this ?.
>>
>>
>> Yes, the ipmpls device would still be used to perform the encapsulation,
>> transitioning from the IP forwarding path to the MPLS forwarding path.
>>
>
> Transitioning from IP forwarding to MPLS forwarding as you have here
> will certainly facilitate PIC core when another path exists to the
> edge. But it cannot deal with PIC edge, right?

Right, it won't allow to PIC edge to work as is, but it could be a step 
towards implementing PIC edge.

> Additionally, this
> approach would mean that the user's (iproute2) view would be rather
> strange - while the actual forwarding requires labels L1 and L2
> (bottom) to be pushed when forwarding to a destination, it would look
> as if labels L3 and L2 are pushed and then L3 is swapped with L1.

Right, but a level of indirection is required somehow. The natural level 
of indirection is the L3 nexthop, but that is more complicated and I 
don't know if that sort of change would be welcome.

> A different way to achieve PIC (core and edge) without transitioning
> from IP forwarding to MPLS forwarding may be to introduce the concept
> of an alternate nexthop with something (e.g., link status) determining
> which nexthop is used.

I'm not sure I understand. Could you elaborate on this?

Thanks,
Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
  2015-06-04 18:46             ` Robert Shearman
@ 2015-06-04 21:38               ` Vivek Venkatraman
  0 siblings, 0 replies; 32+ messages in thread
From: Vivek Venkatraman @ 2015-06-04 21:38 UTC (permalink / raw)
  To: Robert Shearman
  Cc: roopa, netdev, Eric W. Biederman, Thomas Graf, Dinesh Dutt

On Thu, Jun 4, 2015 at 11:46 AM, Robert Shearman <rshearma@brocade.com> wrote:
> On 03/06/15 19:43, Vivek Venkatraman wrote:
>>
>> On Tue, Jun 2, 2015 at 2:06 PM, Robert Shearman <rshearma@brocade.com>
>> wrote:
>>>
>>> On 02/06/15 19:57, roopa wrote:
>>>>
>>>>
>>>> On 6/2/15, 9:33 AM, Robert Shearman wrote:
>>>>>
>>>>>
>>>>> On 02/06/15 17:15, roopa wrote:
>>>>>>
>>>>>>
>>>>>> On 6/1/15, 9:46 AM, Robert Shearman wrote:
>>>>>>>
>>>>>>>
>>>>>>> Allow creating an mpls device for the purposes of encapsulating IP
>>>>>>> packets with:
>>>>>>>
>>>>>>>     ip link add type ipmpls
>>>>>>>
>>>>>>> This device defines its per-nexthop encapsulation data as a stack of
>>>>>>> labels, in the same format as for RTA_NEWST. It uses the encap data
>>>>>>> which will have been stored in the IP route to encapsulate the packet
>>>>>>> with that stack of labels, with the last label corresponding to a
>>>>>>> local label that defines how the packet will be sent out. The device
>>>>>>> sends packets over loopback to the local MPLS forwarding logic which
>>>>>>> performs all of the work.
>>>>>>>
>>>>>>>
>>>>>> Maybe a silly question, but when you loop the packet back, what does
>>>>>> the
>>>>>> local MPLS forwarding logic
>>>>>> lookup with ? It probably assumes there is a mpls route with that
>>>>>> label
>>>>>> and nexthop.
>>>>>> Will this need any internal labels (thinking same label stack
>>>>>> different
>>>>>> tunnel device etc) ?
>>>>>
>>>>>
>>>>>
>>>>> Yes, it requires that local/internal labels have been allocated and
>>>>> label routes installed in the label table for them.
>>>>
>>>>
>>>> This is our only concern.
>>>>>
>>>>>
>>>>>
>>>>> It is entirely possible to put the outgoing interface into the encap
>>>>> data to avoid having to allocate extra labels, but I did it this way
>>>>> in order to support PIC Core for MPLS-VPN routes.
>>>>
>>>>
>>>>
>>>> hmm..., is a netdevice must in this case.., can you please elaborate on
>>>> this ?.
>>>
>>>
>>>
>>> Yes, the ipmpls device would still be used to perform the encapsulation,
>>> transitioning from the IP forwarding path to the MPLS forwarding path.
>>>
>>
>> Transitioning from IP forwarding to MPLS forwarding as you have here
>> will certainly facilitate PIC core when another path exists to the
>> edge. But it cannot deal with PIC edge, right?
>
>
> Right, it won't allow to PIC edge to work as is, but it could be a step
> towards implementing PIC edge.
>
>> Additionally, this
>> approach would mean that the user's (iproute2) view would be rather
>> strange - while the actual forwarding requires labels L1 and L2
>> (bottom) to be pushed when forwarding to a destination, it would look
>> as if labels L3 and L2 are pushed and then L3 is swapped with L1.
>
>
> Right, but a level of indirection is required somehow. The natural level of
> indirection is the L3 nexthop, but that is more complicated and I don't know
> if that sort of change would be welcome.
>
>> A different way to achieve PIC (core and edge) without transitioning
>> from IP forwarding to MPLS forwarding may be to introduce the concept
>> of an alternate nexthop with something (e.g., link status) determining
>> which nexthop is used.
>
>
> I'm not sure I understand. Could you elaborate on this?

Indirection on the L3 nexthop is what I meant. I agree that it would
be more complicated. The thought was to model it like ECMP, so there
would be a set of nexthops (2 for main and alternate), but it would be
a protection group instead of a shared group.

>
> Thanks,
> Rob

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-06-04 21:38 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-01 16:46 [RFC net-next 0/3] IP imposition of per-nh MPLS encap Robert Shearman
2015-06-01 16:46 ` [RFC net-next 1/3] net: infra for per-nexthop encap data Robert Shearman
2015-06-02 18:15   ` Eric W. Biederman
2015-06-01 16:46 ` [RFC net-next 2/3] ipv4: storing and retrieval of per-nexthop encap Robert Shearman
2015-06-02 16:01   ` roopa
2015-06-02 16:35     ` Robert Shearman
2015-06-01 16:46 ` [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls Robert Shearman
2015-06-02 16:15   ` roopa
2015-06-02 16:33     ` Robert Shearman
2015-06-02 18:57       ` roopa
2015-06-02 21:06         ` Robert Shearman
2015-06-03 18:43           ` Vivek Venkatraman
2015-06-04 18:46             ` Robert Shearman
2015-06-04 21:38               ` Vivek Venkatraman
2015-06-02 18:26   ` Eric W. Biederman
2015-06-02 21:37     ` Thomas Graf
2015-06-02 22:48       ` Eric W. Biederman
2015-06-02 23:23       ` Eric W. Biederman
2015-06-03  9:50         ` Thomas Graf
2015-06-02  0:06 ` [RFC net-next 0/3] IP imposition of per-nh MPLS encap Thomas Graf
2015-06-02 13:28   ` Robert Shearman
2015-06-02 21:43     ` Thomas Graf
2015-06-03 13:30       ` Robert Shearman
2015-06-02 15:31 ` roopa
2015-06-02 18:30   ` Eric W. Biederman
2015-06-02 18:39     ` roopa
2015-06-02 18:11 ` Eric W. Biederman
2015-06-02 20:57   ` Robert Shearman
2015-06-02 21:10     ` Eric W. Biederman
2015-06-02 22:15       ` Robert Shearman
2015-06-02 22:58         ` Eric W. Biederman
2015-06-04 15:12           ` Nicolas Dichtel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.