[PATCH net-next 0/3] net: ipv6: Improve user experience with multipath routes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] net: ipv6: Improve user experience with multipath routes
@ 2017-01-15 20:07 David Ahern
  2017-01-15 20:07 ` [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route David Ahern
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: David Ahern @ 2017-01-15 20:07 UTC (permalink / raw)
  To: netdev; +Cc: ddutt, David Ahern

This series closes a couple of gaps between IPv4 and IPv6 with respect
to multipath routes.

Patch 1 allows IPv6 multipath routes to be deleted using just the prefix
and length - similar to what IPv4 allows.

Patch 2 removes the nowait arg which is always 0.

Patch 3 allows IPv6 multipath routes to be returned encoded in the
RTA_MULTIPATH attribute as opposed to a series of single routes.

Backwards compatibility for both patch 1 and 3 is maintained by
requiring users to opt in to the new behavior via change in the
request sent to the kernel.

David Ahern (3):
  net: ipv6: Allow shorthand delete of all nexthops in multipath route
  net: ipv6: remove nowait arg to rt6_fill_node
  net: ipv6: Add option to dump multipath routes via RTA_MULTIPATH
    attribute

 include/linux/mroute6.h        |   2 +-
 include/net/ip6_fib.h          |   4 +-
 include/net/ip6_route.h        |   1 +
 include/uapi/linux/rtnetlink.h |   1 +
 net/ipv6/ip6_fib.c             |  29 +++++++-
 net/ipv6/ip6mr.c               |   9 +--
 net/ipv6/route.c               | 152 +++++++++++++++++++++++++++++++----------
 7 files changed, 153 insertions(+), 45 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route
  2017-01-15 20:07 [PATCH net-next 0/3] net: ipv6: Improve user experience with multipath routes David Ahern
@ 2017-01-15 20:07 ` David Ahern
  2017-01-16 15:48   ` Roopa Prabhu
  2017-01-17  0:51   ` David Miller
  2017-01-15 20:07 ` [PATCH net-next 2/3] net: ipv6: remove nowait arg to rt6_fill_node David Ahern
  2017-01-15 20:07 ` [PATCH net-next 3/3] net: ipv6: Add option to dump multipath routes via RTA_MULTIPATH attribute David Ahern
  2 siblings, 2 replies; 11+ messages in thread
From: David Ahern @ 2017-01-15 20:07 UTC (permalink / raw)
  To: netdev; +Cc: ddutt, David Ahern

IPv4 allows multipath routes to be deleted using just the prefix and
length. For example:
    $ ip ro ls vrf red
    unreachable default metric 8192
    1.1.1.0/24
        nexthop via 10.100.1.254  dev eth1 weight 1
        nexthop via 10.11.200.2  dev eth11.200 weight 1
    10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
    10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3

    $ ip ro del 1.1.1.0/24 vrf red

    $ ip ro ls vrf red
    unreachable default metric 8192
    10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
    10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3

The same notation does not work with IPv6 because of how multipath routes
are implemented for IPv6. For IPv6 only the first nexthop of a multipath
route is deleted if the request contains only a prefix and length. This
leads to unnecessary complexity in userspace dealing with IPv6 multipath
routes.

This patch allows all nexthops to be deleted without specifying each one
in the delete request by passing a new flag, RTM_F_ALL_NEXTHOPS, in
rtm_flags. Internally, this is done by walking the sibling list of the
route matching the specifications given (prefix, length, metric, protocol,
etc).

With this patch (and an updated iproute2 command):
    $  ip -6 ro ls vrf red
    2001:db8::/120 via 2001:db8:1::62 dev eth1 metric 256  pref medium
    2001:db8::/120 via 2001:db8:1::61 dev eth1 metric 256  pref medium
    2001:db8::/120 via 2001:db8:1::60 dev eth1 metric 256  pref medium
    2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
    ...

    $ ip -6 ro del vrf red 1111::1/120
    $ ip -6 ro ls vrf red
    2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
    ...

The flag is added to fib6_config by converting fc_type to a u16 (as
noted fc_type only uses 8 bits). The new u16 hole is a bitmap with
fc_delete_all_nexthop as the first bit.

Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v2
- switched example to rfc 3849 documentation address per request
- changed delete loop to explicitly look at siblings list for
  first route matching specs given (metric, protocol, etc)

 include/net/ip6_fib.h          |  4 +++-
 include/uapi/linux/rtnetlink.h |  1 +
 net/ipv6/route.c               | 28 +++++++++++++++++++++++++---
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index a74e2aa40ef4..11ab99e87c5f 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -37,7 +37,9 @@ struct fib6_config {
 	int		fc_ifindex;
 	u32		fc_flags;
 	u32		fc_protocol;
-	u32		fc_type;	/* only 8 bits are used */
+	u16		fc_type;	/* only 8 bits are used */
+	u16		fc_delete_all_nexthop : 1,
+			__unused : 15;
 
 	struct in6_addr	fc_dst;
 	struct in6_addr	fc_src;
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 8c93ad1ef9ab..7fb206bc42f9 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -276,6 +276,7 @@ enum rt_scope_t {
 #define RTM_F_EQUALIZE		0x400	/* Multipath equalizer: NI	*/
 #define RTM_F_PREFIX		0x800	/* Prefix addresses		*/
 #define RTM_F_LOOKUP_TABLE	0x1000	/* set rtm_table to FIB lookup result */
+#define RTM_F_ALL_NEXTHOPS	0x2000	/* delete all nexthops (IPv6) */
 
 /* Reserved table identifiers */
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ce5aaf448c54..c95e2f941468 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2143,6 +2143,26 @@ int ip6_del_rt(struct rt6_info *rt)
 	return __ip6_del_rt(rt, &info);
 }
 
+/* called with table lock held */
+static int __ip6_route_del(struct rt6_info *rt, struct fib6_config *cfg)
+{
+	int err;
+
+	if (rt->rt6i_nsiblings && cfg->fc_delete_all_nexthop) {
+		struct rt6_info *sibling, *next_sibling;
+
+		list_for_each_entry_safe(sibling, next_sibling,
+					 &rt->rt6i_siblings,
+					 rt6i_siblings) {
+			err = fib6_del(sibling, &cfg->fc_nlinfo);
+			if (err)
+				return err;
+		}
+	}
+
+	return fib6_del(rt, &cfg->fc_nlinfo);
+}
+
 static int ip6_route_del(struct fib6_config *cfg)
 {
 	struct fib6_table *table;
@@ -2176,10 +2196,9 @@ static int ip6_route_del(struct fib6_config *cfg)
 				continue;
 			if (cfg->fc_protocol && cfg->fc_protocol != rt->rt6i_protocol)
 				continue;
-			dst_hold(&rt->dst);
-			read_unlock_bh(&table->tb6_lock);
 
-			return __ip6_del_rt(rt, &cfg->fc_nlinfo);
+			err = __ip6_route_del(rt, cfg);
+			break;
 		}
 	}
 	read_unlock_bh(&table->tb6_lock);
@@ -2849,6 +2868,9 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (rtm->rtm_flags & RTM_F_CLONED)
 		cfg->fc_flags |= RTF_CACHE;
 
+	if (rtm->rtm_flags & RTM_F_ALL_NEXTHOPS)
+		cfg->fc_delete_all_nexthop = 1;
+
 	cfg->fc_nlinfo.portid = NETLINK_CB(skb).portid;
 	cfg->fc_nlinfo.nlh = nlh;
 	cfg->fc_nlinfo.nl_net = sock_net(skb->sk);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 2/3] net: ipv6: remove nowait arg to rt6_fill_node
  2017-01-15 20:07 [PATCH net-next 0/3] net: ipv6: Improve user experience with multipath routes David Ahern
  2017-01-15 20:07 ` [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route David Ahern
@ 2017-01-15 20:07 ` David Ahern
  2017-01-15 20:07 ` [PATCH net-next 3/3] net: ipv6: Add option to dump multipath routes via RTA_MULTIPATH attribute David Ahern
  2 siblings, 0 replies; 11+ messages in thread
From: David Ahern @ 2017-01-15 20:07 UTC (permalink / raw)
  To: netdev; +Cc: ddutt, David Ahern

All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and
simplify rt6_fill_node accordingly.

rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the
nowait arg from it as well.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/linux/mroute6.h |  2 +-
 net/ipv6/ip6mr.c        |  9 ++-------
 net/ipv6/route.c        | 27 ++++++++++-----------------
 3 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/include/linux/mroute6.h b/include/linux/mroute6.h
index 19a1c0c2993b..ce44e3e96d27 100644
--- a/include/linux/mroute6.h
+++ b/include/linux/mroute6.h
@@ -116,7 +116,7 @@ struct mfc6_cache {
 
 struct rtmsg;
 extern int ip6mr_get_route(struct net *net, struct sk_buff *skb,
-			   struct rtmsg *rtm, int nowait, u32 portid);
+			   struct rtmsg *rtm, u32 portid);
 
 #ifdef CONFIG_IPV6_MROUTE
 extern struct sock *mroute6_socket(struct net *net, struct sk_buff *skb);
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index e275077e8af2..babaf3ec2742 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -2288,7 +2288,7 @@ static int __ip6mr_fill_mroute(struct mr6_table *mrt, struct sk_buff *skb,
 }
 
 int ip6mr_get_route(struct net *net, struct sk_buff *skb, struct rtmsg *rtm,
-		    int nowait, u32 portid)
+		    u32 portid)
 {
 	int err;
 	struct mr6_table *mrt;
@@ -2315,11 +2315,6 @@ int ip6mr_get_route(struct net *net, struct sk_buff *skb, struct rtmsg *rtm,
 		struct net_device *dev;
 		int vif;
 
-		if (nowait) {
-			read_unlock(&mrt_lock);
-			return -EAGAIN;
-		}
-
 		dev = skb->dev;
 		if (!dev || (vif = ip6mr_find_vif(mrt, dev)) < 0) {
 			read_unlock(&mrt_lock);
@@ -2357,7 +2352,7 @@ int ip6mr_get_route(struct net *net, struct sk_buff *skb, struct rtmsg *rtm,
 		return err;
 	}
 
-	if (!nowait && (rtm->rtm_flags&RTM_F_NOTIFY))
+	if (rtm->rtm_flags & RTM_F_NOTIFY)
 		cache->mfc_flags |= MFC_NOTIFY;
 
 	err = __ip6mr_fill_mroute(mrt, skb, cache, rtm);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c95e2f941468..9f6f28ecd065 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3191,7 +3191,7 @@ static int rt6_fill_node(struct net *net,
 			 struct sk_buff *skb, struct rt6_info *rt,
 			 struct in6_addr *dst, struct in6_addr *src,
 			 int iif, int type, u32 portid, u32 seq,
-			 int prefix, int nowait, unsigned int flags)
+			 int prefix, unsigned int flags)
 {
 	u32 metrics[RTAX_MAX];
 	struct rtmsg *rtm;
@@ -3283,19 +3283,12 @@ static int rt6_fill_node(struct net *net,
 	if (iif) {
 #ifdef CONFIG_IPV6_MROUTE
 		if (ipv6_addr_is_multicast(&rt->rt6i_dst.addr)) {
-			int err = ip6mr_get_route(net, skb, rtm, nowait,
-						  portid);
-
-			if (err <= 0) {
-				if (!nowait) {
-					if (err == 0)
-						return 0;
-					goto nla_put_failure;
-				} else {
-					if (err == -EMSGSIZE)
-						goto nla_put_failure;
-				}
-			}
+			int err = ip6mr_get_route(net, skb, rtm, portid);
+
+			if (err == 0)
+				return 0;
+			if (err < 0)
+				goto nla_put_failure;
 		} else
 #endif
 			if (nla_put_u32(skb, RTA_IIF, iif))
@@ -3363,7 +3356,7 @@ int rt6_dump_route(struct rt6_info *rt, void *p_arg)
 	return rt6_fill_node(arg->net,
 		     arg->skb, rt, NULL, NULL, 0, RTM_NEWROUTE,
 		     NETLINK_CB(arg->cb->skb).portid, arg->cb->nlh->nlmsg_seq,
-		     prefix, 0, NLM_F_MULTI);
+		     prefix, NLM_F_MULTI);
 }
 
 static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
@@ -3454,7 +3447,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 
 	err = rt6_fill_node(net, skb, rt, &fl6.daddr, &fl6.saddr, iif,
 			    RTM_NEWROUTE, NETLINK_CB(in_skb).portid,
-			    nlh->nlmsg_seq, 0, 0, 0);
+			    nlh->nlmsg_seq, 0, 0);
 	if (err < 0) {
 		kfree_skb(skb);
 		goto errout;
@@ -3481,7 +3474,7 @@ void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info,
 		goto errout;
 
 	err = rt6_fill_node(net, skb, rt, NULL, NULL, 0,
-				event, info->portid, seq, 0, 0, nlm_flags);
+				event, info->portid, seq, 0, nlm_flags);
 	if (err < 0) {
 		/* -EMSGSIZE implies BUG in rt6_nlmsg_size() */
 		WARN_ON(err == -EMSGSIZE);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 3/3] net: ipv6: Add option to dump multipath routes via RTA_MULTIPATH attribute
  2017-01-15 20:07 [PATCH net-next 0/3] net: ipv6: Improve user experience with multipath routes David Ahern
  2017-01-15 20:07 ` [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route David Ahern
  2017-01-15 20:07 ` [PATCH net-next 2/3] net: ipv6: remove nowait arg to rt6_fill_node David Ahern
@ 2017-01-15 20:07 ` David Ahern
  2017-01-16 17:40   ` David Ahern
  2 siblings, 1 reply; 11+ messages in thread
From: David Ahern @ 2017-01-15 20:07 UTC (permalink / raw)
  To: netdev; +Cc: ddutt, David Ahern

IPv6 returns multipath routes as a series of individual routes making
their display different than IPv4 and putting the burden on the user to
see that a route is part of a multipath route. This patch addresses this
difference, allowing users to request multipath routes to be returned
using the RTA_MULTIPATH attribute.

To maintain backwards compatibility, a user has to request the change
in behavior. Unfortunately, adding a flag to the header similar to a
previous patch does not work here as the netlink header for route dumps
can be either rtgenmsg or ifinfomsg. sysctl is the other commonly used
option for backwards compatibility but it is overly abused and is not
really appropriate for this situation since each request should be able
to specify the preference for the RTA_MULTIPATH attribute (e.g, one
command may understand RTA_MULTIPATH for IPv6 while another may not).

This patch relies on the existence of the RTA_MULTIPATH attribute
appeneded to the dump request -- similar to how attributes can be appended
to link dumps. Kernel side if the RTA_MULTIPATH attribute exists in the
dump request then the RTM_F_ALL_NEXTHOPS is set in a new rtm_flags
variable added to rt6_rtnl_dump_arg. The rtm_flags is then used by
rt6_dump_route to decided if the RTA_MULTIPATH attribute is used.

If it is enabled then rt6_fill_node walks the sibling list for a route
and encodes each as a next hop within RTA_MULTIPATH, and then
fib6_dump_node advances its cursor to the last sibling route in the
current route.

The end result is that IPv6 multipath routes can be displayed in a format
similar to IPv4:

    $ ip -6 ro ls vrf red
    2001:db8::/120 metric 1024
	    nexthop via 2001:db8:1::62  dev eth1 weight 1
	    nexthop via 2001:db8:1::61  dev eth1 weight 1
	    nexthop via 2001:db8:1::60  dev eth1 weight 1
	    nexthop via 2001:db8:1::59  dev eth1 weight 1
    2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
    ...

Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/ip6_route.h |   1 +
 net/ipv6/ip6_fib.c      |  29 ++++++++++++-
 net/ipv6/route.c        | 107 +++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 116 insertions(+), 21 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 9dc2c182a263..7620974826a5 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -154,6 +154,7 @@ struct rt6_rtnl_dump_arg {
 	struct sk_buff *skb;
 	struct netlink_callback *cb;
 	struct net *net;
+	unsigned int rtm_flags;
 };
 
 int rt6_dump_route(struct rt6_info *rt, void *p_arg);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index ef5485204522..429713819c9f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -308,16 +308,27 @@ static void __net_init fib6_tables_init(struct net *net)
 
 static int fib6_dump_node(struct fib6_walker *w)
 {
+	struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *)w->args;
 	int res;
 	struct rt6_info *rt;
 
 	for (rt = w->leaf; rt; rt = rt->dst.rt6_next) {
-		res = rt6_dump_route(rt, w->args);
+		res = rt6_dump_route(rt, arg);
 		if (res < 0) {
 			/* Frame is full, suspend walking */
 			w->leaf = rt;
 			return 1;
 		}
+
+		/* if multipath routes are dumped in one route with
+		 * the RTA_MULTIPATH attribute, then jump rt to point
+		 * to the last sibling of this route (no need to dump
+		 * the sibling routes again)
+		 */
+		if ((arg->rtm_flags & RTM_F_ALL_NEXTHOPS) && rt->rt6i_nsiblings)
+			rt = list_last_entry(&rt->rt6i_siblings,
+					     struct rt6_info,
+					     rt6i_siblings);
 	}
 	w->leaf = NULL;
 	return 0;
@@ -392,13 +403,16 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb,
 static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct net *net = sock_net(skb->sk);
+	struct nlattr *nlattr[RTA_MAX + 1];
 	unsigned int h, s_h;
 	unsigned int e = 0, s_e;
 	struct rt6_rtnl_dump_arg arg;
 	struct fib6_walker *w;
 	struct fib6_table *tb;
 	struct hlist_head *head;
+	unsigned int flags = 0;
 	int res = 0;
+	int hdrlen;
 
 	s_h = cb->args[0];
 	s_e = cb->args[1];
@@ -422,9 +436,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 		cb->args[2] = (long)w;
 	}
 
+	hdrlen = nlmsg_len(cb->nlh) < sizeof(struct ifinfomsg) ?
+		 sizeof(struct rtgenmsg) : sizeof(struct ifinfomsg);
+
+	if (nlmsg_parse(cb->nlh, hdrlen, nlattr, RTA_MAX, NULL) >= 0) {
+		/* existence of RTA_MULTIPATH attribute in dump
+		 * request means user wants multipath routes
+		 * returned using RTA_MULTIPATH attribute
+		 */
+		if (nlattr[RTA_MULTIPATH])
+			flags |= RTM_F_ALL_NEXTHOPS;
+	}
+
 	arg.skb = skb;
 	arg.cb = cb;
 	arg.net = net;
+	arg.rtm_flags = flags;
 	w->args = &arg;
 
 	rcu_read_lock();
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 9f6f28ecd065..041ff245ba03 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3187,11 +3187,66 @@ static inline size_t rt6_nlmsg_size(struct rt6_info *rt)
 	       + lwtunnel_get_encap_size(rt->dst.lwtstate);
 }
 
+static int rt6_nexthop_info(struct sk_buff *skb, struct rt6_info *rt,
+			    unsigned int *flags)
+{
+	if (!netif_carrier_ok(rt->dst.dev)) {
+		*flags |= RTNH_F_LINKDOWN;
+		if (rt->rt6i_idev->cnf.ignore_routes_with_linkdown)
+			*flags |= RTNH_F_DEAD;
+	}
+
+	if (rt->rt6i_flags & RTF_GATEWAY) {
+		if (nla_put_in6_addr(skb, RTA_GATEWAY, &rt->rt6i_gateway) < 0)
+			goto nla_put_failure;
+	}
+
+	if (rt->dst.dev &&
+	    nla_put_u32(skb, RTA_OIF, rt->dst.dev->ifindex))
+		goto nla_put_failure;
+
+	if (rt->dst.lwtstate &&
+	    lwtunnel_fill_encap(skb, rt->dst.lwtstate) < 0)
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
+static int rt6_add_nexthop(struct sk_buff *skb, struct rt6_info *rt)
+{
+	struct rtnexthop *rtnh;
+	unsigned int flags = 0;
+
+	rtnh = nla_reserve_nohdr(skb, sizeof(*rtnh));
+	if (!rtnh)
+		goto nla_put_failure;
+
+	rtnh->rtnh_hops = 0;
+	rtnh->rtnh_ifindex = rt->dst.dev ? rt->dst.dev->ifindex : 0;
+
+	if (rt6_nexthop_info(skb, rt, &flags) < 0)
+		goto nla_put_failure;
+
+	rtnh->rtnh_flags = flags;
+
+	/* length of rtnetlink header + attributes */
+	rtnh->rtnh_len = nlmsg_get_pos(skb) - (void *)rtnh;
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
 static int rt6_fill_node(struct net *net,
 			 struct sk_buff *skb, struct rt6_info *rt,
 			 struct in6_addr *dst, struct in6_addr *src,
 			 int iif, int type, u32 portid, u32 seq,
-			 int prefix, unsigned int flags)
+			 int prefix, unsigned int nlm_flags,
+			 unsigned int rtm_flags)
 {
 	u32 metrics[RTAX_MAX];
 	struct rtmsg *rtm;
@@ -3206,7 +3261,7 @@ static int rt6_fill_node(struct net *net,
 		}
 	}
 
-	nlh = nlmsg_put(skb, portid, seq, type, sizeof(*rtm), flags);
+	nlh = nlmsg_put(skb, portid, seq, type, sizeof(*rtm), nlm_flags);
 	if (!nlh)
 		return -EMSGSIZE;
 
@@ -3245,11 +3300,6 @@ static int rt6_fill_node(struct net *net,
 	else
 		rtm->rtm_type = RTN_UNICAST;
 	rtm->rtm_flags = 0;
-	if (!netif_carrier_ok(rt->dst.dev)) {
-		rtm->rtm_flags |= RTNH_F_LINKDOWN;
-		if (rt->rt6i_idev->cnf.ignore_routes_with_linkdown)
-			rtm->rtm_flags |= RTNH_F_DEAD;
-	}
 	rtm->rtm_scope = RT_SCOPE_UNIVERSE;
 	rtm->rtm_protocol = rt->rt6i_protocol;
 	if (rt->rt6i_flags & RTF_DYNAMIC)
@@ -3313,17 +3363,36 @@ static int rt6_fill_node(struct net *net,
 	if (rtnetlink_put_metrics(skb, metrics) < 0)
 		goto nla_put_failure;
 
-	if (rt->rt6i_flags & RTF_GATEWAY) {
-		if (nla_put_in6_addr(skb, RTA_GATEWAY, &rt->rt6i_gateway) < 0)
-			goto nla_put_failure;
-	}
-
-	if (rt->dst.dev &&
-	    nla_put_u32(skb, RTA_OIF, rt->dst.dev->ifindex))
-		goto nla_put_failure;
 	if (nla_put_u32(skb, RTA_PRIORITY, rt->rt6i_metric))
 		goto nla_put_failure;
 
+	/* if user wants nexthops included via the RTA_MULTIPATH
+	 * attribute, then walk the siblings list and add each
+	 * as a nexthop
+	 */
+	if ((rtm_flags & RTM_F_ALL_NEXTHOPS) && rt->rt6i_nsiblings) {
+		struct rt6_info *sibling, *next_sibling;
+		struct nlattr *mp;
+
+		mp = nla_nest_start(skb, RTA_MULTIPATH);
+		if (!mp)
+			goto nla_put_failure;
+
+		if (rt6_add_nexthop(skb, rt) < 0)
+			goto nla_put_failure;
+
+		list_for_each_entry_safe(sibling, next_sibling,
+					 &rt->rt6i_siblings, rt6i_siblings) {
+			if (rt6_add_nexthop(skb, sibling) < 0)
+				goto nla_put_failure;
+		}
+
+		nla_nest_end(skb, mp);
+	} else {
+		if (rt6_nexthop_info(skb, rt, &rtm->rtm_flags) < 0)
+			goto nla_put_failure;
+	}
+
 	expires = (rt->rt6i_flags & RTF_EXPIRES) ? rt->dst.expires - jiffies : 0;
 
 	if (rtnl_put_cacheinfo(skb, &rt->dst, 0, expires, rt->dst.error) < 0)
@@ -3332,8 +3401,6 @@ static int rt6_fill_node(struct net *net,
 	if (nla_put_u8(skb, RTA_PREF, IPV6_EXTRACT_PREF(rt->rt6i_flags)))
 		goto nla_put_failure;
 
-	lwtunnel_fill_encap(skb, rt->dst.lwtstate);
-
 	nlmsg_end(skb, nlh);
 	return 0;
 
@@ -3356,7 +3423,7 @@ int rt6_dump_route(struct rt6_info *rt, void *p_arg)
 	return rt6_fill_node(arg->net,
 		     arg->skb, rt, NULL, NULL, 0, RTM_NEWROUTE,
 		     NETLINK_CB(arg->cb->skb).portid, arg->cb->nlh->nlmsg_seq,
-		     prefix, NLM_F_MULTI);
+		     prefix, NLM_F_MULTI, arg->rtm_flags);
 }
 
 static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
@@ -3447,7 +3514,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 
 	err = rt6_fill_node(net, skb, rt, &fl6.daddr, &fl6.saddr, iif,
 			    RTM_NEWROUTE, NETLINK_CB(in_skb).portid,
-			    nlh->nlmsg_seq, 0, 0);
+			    nlh->nlmsg_seq, 0, 0, 0);
 	if (err < 0) {
 		kfree_skb(skb);
 		goto errout;
@@ -3474,7 +3541,7 @@ void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info,
 		goto errout;
 
 	err = rt6_fill_node(net, skb, rt, NULL, NULL, 0,
-				event, info->portid, seq, 0, nlm_flags);
+			    event, info->portid, seq, 0, nlm_flags, 0);
 	if (err < 0) {
 		/* -EMSGSIZE implies BUG in rt6_nlmsg_size() */
 		WARN_ON(err == -EMSGSIZE);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route
  2017-01-15 20:07 ` [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route David Ahern
@ 2017-01-16 15:48   ` Roopa Prabhu
  2017-01-16 15:58     ` David Ahern
  2017-01-17  0:51   ` David Miller
  1 sibling, 1 reply; 11+ messages in thread
From: Roopa Prabhu @ 2017-01-16 15:48 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, ddutt

On 1/15/17, 12:07 PM, David Ahern wrote:
> IPv4 allows multipath routes to be deleted using just the prefix and
> length. For example:
>     $ ip ro ls vrf red
>     unreachable default metric 8192
>     1.1.1.0/24
>         nexthop via 10.100.1.254  dev eth1 weight 1
>         nexthop via 10.11.200.2  dev eth11.200 weight 1
>     10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
>     10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3
>
>     $ ip ro del 1.1.1.0/24 vrf red
>
>     $ ip ro ls vrf red
>     unreachable default metric 8192
>     10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
>     10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3
>
> The same notation does not work with IPv6 because of how multipath routes
> are implemented for IPv6. For IPv6 only the first nexthop of a multipath
> route is deleted if the request contains only a prefix and length. This
> leads to unnecessary complexity in userspace dealing with IPv6 multipath
> routes.
>
> This patch allows all nexthops to be deleted without specifying each one
> in the delete request by passing a new flag, RTM_F_ALL_NEXTHOPS, in
> rtm_flags. Internally, this is done by walking the sibling list of the
> route matching the specifications given (prefix, length, metric, protocol,
> etc).
>
> With this patch (and an updated iproute2 command):
>     $  ip -6 ro ls vrf red
>     2001:db8::/120 via 2001:db8:1::62 dev eth1 metric 256  pref medium
>     2001:db8::/120 via 2001:db8:1::61 dev eth1 metric 256  pref medium
>     2001:db8::/120 via 2001:db8:1::60 dev eth1 metric 256  pref medium
>     2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
>     ...
>
>     $ ip -6 ro del vrf red 1111::1/120
>     $ ip -6 ro ls vrf red
>     2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
>     ...
>
> The flag is added to fib6_config by converting fc_type to a u16 (as
> noted fc_type only uses 8 bits). The new u16 hole is a bitmap with
> fc_delete_all_nexthop as the first bit.
>
> Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ---
> v2
> - switched example to rfc 3849 documentation address per request
> - changed delete loop to explicitly look at siblings list for
>   first route matching specs given (metric, protocol, etc)
>
>  include/net/ip6_fib.h          |  4 +++-
>  include/uapi/linux/rtnetlink.h |  1 +
>  net/ipv6/route.c               | 28 +++++++++++++++++++++++++---
>  3 files changed, 29 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index a74e2aa40ef4..11ab99e87c5f 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -37,7 +37,9 @@ struct fib6_config {
>  	int		fc_ifindex;
>  	u32		fc_flags;
>  	u32		fc_protocol;
> -	u32		fc_type;	/* only 8 bits are used */
> +	u16		fc_type;	/* only 8 bits are used */
> +	u16		fc_delete_all_nexthop : 1,
> +			__unused : 15;
>  
>  	struct in6_addr	fc_dst;
>  	struct in6_addr	fc_src;
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 8c93ad1ef9ab..7fb206bc42f9 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -276,6 +276,7 @@ enum rt_scope_t {
>  #define RTM_F_EQUALIZE		0x400	/* Multipath equalizer: NI	*/
>  #define RTM_F_PREFIX		0x800	/* Prefix addresses		*/
>  #define RTM_F_LOOKUP_TABLE	0x1000	/* set rtm_table to FIB lookup result */
> +#define RTM_F_ALL_NEXTHOPS	0x2000	/* delete all nexthops (IPv6) */
>  
Do we really need the flag ?. It seems like delete with just prefix should delete all the routes in a multipath
route by default... (understand that you have it there to preserve existing behavior...for people who maybe relying on it. But this seems more like a bug fix. route replace went through a few such bug fixes "ipv6: fix ECMP route replacement"). ok with either approach.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route
  2017-01-16 15:48   ` Roopa Prabhu
@ 2017-01-16 15:58     ` David Ahern
  0 siblings, 0 replies; 11+ messages in thread
From: David Ahern @ 2017-01-16 15:58 UTC (permalink / raw)
  To: Roopa Prabhu; +Cc: netdev, ddutt

On 1/16/17 8:48 AM, Roopa Prabhu wrote:
> Do we really need the flag ?. It seems like delete with just prefix should delete all the routes in a multipath
> route by default... (understand that you have it there to preserve existing behavior...for people who maybe relying on it. But this seems more like a bug fix. route replace went through a few such bug fixes "ipv6: fix ECMP route replacement"). ok with either approach.

I'm fine dropping the flag for this patch and just deleting the entire route. The flag only exists because this patch changes behavior visible to the user.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 3/3] net: ipv6: Add option to dump multipath routes via RTA_MULTIPATH attribute
  2017-01-15 20:07 ` [PATCH net-next 3/3] net: ipv6: Add option to dump multipath routes via RTA_MULTIPATH attribute David Ahern
@ 2017-01-16 17:40   ` David Ahern
  0 siblings, 0 replies; 11+ messages in thread
From: David Ahern @ 2017-01-16 17:40 UTC (permalink / raw)
  To: netdev; +Cc: ddutt, roopa

upon further review ...

On 1/15/17 1:07 PM, David Ahern wrote:
> To maintain backwards compatibility, a user has to request the change
> in behavior. Unfortunately, adding a flag to the header similar to a
> previous patch does not work here as the netlink header for route dumps
> can be either rtgenmsg or ifinfomsg. sysctl is the other commonly used
> option for backwards compatibility but it is overly abused and is not
> really appropriate for this situation since each request should be able
> to specify the preference for the RTA_MULTIPATH attribute (e.g, one
> command may understand RTA_MULTIPATH for IPv6 while another may not).

The ancillary header is supposed to be struct rtmsg, so the change in behavior can be done by requiring the rtmsg struct and then passing the RTM_F_ALL_NEXTHOPS flag in rtm_flags. This makes the user api for delete requests and dumps consistent.

...

> @@ -422,9 +436,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>  		cb->args[2] = (long)w;
>  	}
>  
> +	hdrlen = nlmsg_len(cb->nlh) < sizeof(struct ifinfomsg) ?
> +		 sizeof(struct rtgenmsg) : sizeof(struct ifinfomsg);
> +
> +	if (nlmsg_parse(cb->nlh, hdrlen, nlattr, RTA_MAX, NULL) >= 0) {
> +		/* existence of RTA_MULTIPATH attribute in dump
> +		 * request means user wants multipath routes
> +		 * returned using RTA_MULTIPATH attribute
> +		 */
> +		if (nlattr[RTA_MULTIPATH])
> +			flags |= RTM_F_ALL_NEXTHOPS;
> +	}
> +

And the above becomes:

+       /* sadly, the size of the ancillary header can vary: the correct
+        * header is struct rtmsg as passed by libnl. iproute2 sends
+        * ifinfomsg [+ filter]. Technically, passing only rtgenmsg is
+        * sufficient since the header has not been parsed before. Luckily,
+        * the struct sizes sufficiently vary to detect the permutations.
+        * For the dump request to contain flags require rtmsg header.
+        */
+       if (nlmsg_len(cb->nlh) == sizeof(struct rtmsg)) {
+               struct rtmsg *rtm = nlmsg_data(cb->nlh);
+
+               flags = rtm->rtm_flags;
+       }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route
  2017-01-15 20:07 ` [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route David Ahern
  2017-01-16 15:48   ` Roopa Prabhu
@ 2017-01-17  0:51   ` David Miller
  2017-01-17  1:27     ` David Ahern
  1 sibling, 1 reply; 11+ messages in thread
From: David Miller @ 2017-01-17  0:51 UTC (permalink / raw)
  To: dsa; +Cc: netdev, ddutt

From: David Ahern <dsa@cumulusnetworks.com>
Date: Sun, 15 Jan 2017 12:07:04 -0800

> @@ -2143,6 +2143,26 @@ int ip6_del_rt(struct rt6_info *rt)
>  	return __ip6_del_rt(rt, &info);
>  }
>  
> +/* called with table lock held */
 ...
> @@ -2176,10 +2196,9 @@ static int ip6_route_del(struct fib6_config *cfg)
>  				continue;
>  			if (cfg->fc_protocol && cfg->fc_protocol != rt->rt6i_protocol)
>  				continue;
> -			dst_hold(&rt->dst);
> -			read_unlock_bh(&table->tb6_lock);
>  
> -			return __ip6_del_rt(rt, &cfg->fc_nlinfo);
> +			err = __ip6_route_del(rt, cfg);
> +			break;
>  		}

fib6_del() (invoked by __ip6_route_del()) has to be invoked with the
table lock held a sa writer, but here you are only holding it as a
reader.

I also think some more thought has to be put into whether we can
change behavior like this without using a flag, as suggested by Roopa.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route
  2017-01-17  0:51   ` David Miller
@ 2017-01-17  1:27     ` David Ahern
  2017-01-17  1:37       ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: David Ahern @ 2017-01-17  1:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, ddutt

On 1/16/17 5:51 PM, David Miller wrote:
> From: David Ahern <dsa@cumulusnetworks.com>
> Date: Sun, 15 Jan 2017 12:07:04 -0800
> 
>> @@ -2143,6 +2143,26 @@ int ip6_del_rt(struct rt6_info *rt)
>>  	return __ip6_del_rt(rt, &info);
>>  }
>>  
>> +/* called with table lock held */
>  ...
>> @@ -2176,10 +2196,9 @@ static int ip6_route_del(struct fib6_config *cfg)
>>  				continue;
>>  			if (cfg->fc_protocol && cfg->fc_protocol != rt->rt6i_protocol)
>>  				continue;
>> -			dst_hold(&rt->dst);
>> -			read_unlock_bh(&table->tb6_lock);
>>  
>> -			return __ip6_del_rt(rt, &cfg->fc_nlinfo);
>> +			err = __ip6_route_del(rt, cfg);
>> +			break;
>>  		}
> 
> fib6_del() (invoked by __ip6_route_del()) has to be invoked with the
> table lock held a sa writer, but here you are only holding it as a
> reader.

That table lock is still held. If you look up 2 lines I remove the line that releases the lock.

> 
> I also think some more thought has to be put into whether we can
> change behavior like this without using a flag, as suggested by Roopa.
> 

I'm all for changing the behavior by default, but AIUI that breaks the golden rule.

libnl as of libnl3_2_16-16-g29b71371e764  (a patch by Roopa) already handles both flavors -- if RTA_MULTIPATH is sent it handles it fine and if a series of routes are sent they are consolidated to 1 object with next hops.

iproute2 can handle it just fine as well. I made no changes to get the display format consistent. The problem here is scripts that rely on the old format.

quagga does *not* correctly handle the RTA_MULTIPATH attribute for IPv6 but then it does not properly handle IPv6 multipath routes it receives from the kernel at all. Same result with current kernel code or with this patch set - only the last nexthop is retained. (quagga routes pushed to kernel works, but kernel routes fed to quagga does not)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route
  2017-01-17  1:27     ` David Ahern
@ 2017-01-17  1:37       ` David Miller
  2017-01-17  1:38         ` David Ahern
  0 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2017-01-17  1:37 UTC (permalink / raw)
  To: dsa; +Cc: netdev, ddutt

From: David Ahern <dsa@cumulusnetworks.com>
Date: Mon, 16 Jan 2017 18:27:36 -0700

> On 1/16/17 5:51 PM, David Miller wrote:
>> From: David Ahern <dsa@cumulusnetworks.com>
>> Date: Sun, 15 Jan 2017 12:07:04 -0800
>> 
>>> @@ -2143,6 +2143,26 @@ int ip6_del_rt(struct rt6_info *rt)
>>>  	return __ip6_del_rt(rt, &info);
>>>  }
>>>  
>>> +/* called with table lock held */
>>  ...
>>> @@ -2176,10 +2196,9 @@ static int ip6_route_del(struct fib6_config *cfg)
>>>  				continue;
>>>  			if (cfg->fc_protocol && cfg->fc_protocol != rt->rt6i_protocol)
>>>  				continue;
>>> -			dst_hold(&rt->dst);
>>> -			read_unlock_bh(&table->tb6_lock);
>>>  
>>> -			return __ip6_del_rt(rt, &cfg->fc_nlinfo);
>>> +			err = __ip6_route_del(rt, cfg);
>>> +			break;
>>>  		}
>> 
>> fib6_del() (invoked by __ip6_route_del()) has to be invoked with the
>> table lock held a sa writer, but here you are only holding it as a
>> reader.
> 
> That table lock is still held. If you look up 2 lines I remove the line that releases the lock.

It's held in this function as a reader, it needs to be held as a writer.

That's why the lock is dropped in the current code and the existing
wrapper around fib6_del() takes it as a writer.

Is it clear now?

	read_lock_bh(&table->lock);
	fib6_del();

is invalid.

	write_lock_bh(&table->lock);
	fib6_del();

is required.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route
  2017-01-17  1:37       ` David Miller
@ 2017-01-17  1:38         ` David Ahern
  0 siblings, 0 replies; 11+ messages in thread
From: David Ahern @ 2017-01-17  1:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, ddutt

On 1/16/17 6:37 PM, David Miller wrote:
> Is it clear now?

yes. time for a trip to the eye doctor

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-01-17  1:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-15 20:07 [PATCH net-next 0/3] net: ipv6: Improve user experience with multipath routes David Ahern
2017-01-15 20:07 ` [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route David Ahern
2017-01-16 15:48   ` Roopa Prabhu
2017-01-16 15:58     ` David Ahern
2017-01-17  0:51   ` David Miller
2017-01-17  1:27     ` David Ahern
2017-01-17  1:37       ` David Miller
2017-01-17  1:38         ` David Ahern
2017-01-15 20:07 ` [PATCH net-next 2/3] net: ipv6: remove nowait arg to rt6_fill_node David Ahern
2017-01-15 20:07 ` [PATCH net-next 3/3] net: ipv6: Add option to dump multipath routes via RTA_MULTIPATH attribute David Ahern
2017-01-16 17:40   ` David Ahern

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.