All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status
@ 2015-06-11  2:37 Andy Gospodarek
  2015-06-11  2:37 ` [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops Andy Gospodarek
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11  2:37 UTC (permalink / raw)
  To: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen
  Cc: Andy Gospodarek

This series adds the ability to have the Linux kernel track whether or
not a particular route should be used based on the link-status of the
interface associated with the next-hop.

Before this patch any link-failure on an interface that was serving as a
gateway for some systems could result in those systems being isolated
from the rest of the network as the stack would continue to attempt to
send frames out of an interface that is actually linked-down.  When the
kernel is responsible for all forwarding, it should also be responsible
for taking action when the traffic can no longer be forwarded -- there
is no real need to outsource link-monitoring to userspace anymore.

This feature is only enabled with the new per-interface or ipv4 global
sysctls called 'ignore_routes_with_linkdown'.

net.ipv4.conf.all.ignore_routes_with_linkdown = 0
net.ipv4.conf.default.ignore_routes_with_linkdown = 0
net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
...

When the above sysctls are set, the kernel will not only report to
userspace that the link is down, but it will also report to userspace
that a route is dead.  This will signal to userspace that the route will
not be selected.

With the new sysctls set, the following behavior can be observed
(interface p8p1 is link-down):

# ip route show 
default via 10.0.5.2 dev p9p1 
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
# ip route get 90.0.0.1 
90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
    cache 
# ip route get 80.0.0.1 
local 80.0.0.1 dev lo  src 80.0.0.1 
    cache <local> 
# ip route get 80.0.0.2
80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
    cache 

While the route does remain in the table (so it can be modified if
needed rather than being wiped away as it would be if IFF_UP was
cleared), the proper next-hop is chosen automatically when the link is
down.  Now interface p8p1 is linked-up:

# ip route show 
default via 10.0.5.2 dev p9p1 
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
# ip route get 90.0.0.1 
90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
    cache 
# ip route get 80.0.0.1 
local 80.0.0.1 dev lo  src 80.0.0.1 
    cache <local> 
# ip route get 80.0.0.2
80.0.0.2 dev p8p1  src 80.0.0.1 
    cache 

and the output changes to what one would expect.

If the global or interface sysctl is not set, the following output would be
expected when p8p1 is down:

# ip route show 
default via 10.0.5.2 dev p9p1 
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 

If the dead flag does not appear there should be no expectation that the
kernel would skip using this route due to link being down.

v2: Split kernel changes into 2 patches: first to add linkdown flag and
second to add new sysctl settings.  Also took suggestion from Alex to
simplify code by only checking sysctl during fib lookup and suggestion
from Scott to add a per-interface sysctl.  Added iproute2 patch to
recognize and print linkdown flag.

v3: Code cleanups along with reverse-path checks suggested by Alex and
small fixes related to problems found when multipath was disabled.

Though there were some that preferred not to have a configuration option
and to make this behavior the default when it was discussed in Ottawa
earlier this year since "it was time to do this."  I wanted to propose
the config option to preserve the current behavior for those that desire
it.  I'll happily remove it if Dave and Linus approve.

An IPv6 implementation is also needed (DECnet too!), but I wanted to start with
the IPv4 implementation to get people comfortable with the idea before moving
forward.  If this is accepted the IPv6 implementation can be posted shortly.

There was also a request for switchdev support for this, but that will be
posted as a followup as switchdev does not currently handle dead
next-hops in a multi-path case and I felt that infra needed to be added
first.

FWIW, we have been running the original version of this series with a
global sysctl and our customers have been happily using a backported
version for IPv4 and IPv6 for >6 months.

Andy Gospodarek (3):
  net: track link-status of ipv4 nexthops
  net: ipv4 sysctl option to ignore routes when nexthop link is down
  iproute2: add support to print 'linkdown' nexthop flag

 include/linux/inetdevice.h        |   3 +
 include/net/fib_rules.h           |   3 +-
 include/net/ip_fib.h              |  21 ++++---
 include/uapi/linux/ip.h           |   1 +
 include/uapi/linux/rtnetlink.h    |   3 +
 include/uapi/linux/sysctl.h       |   1 +
 kernel/sysctl_binary.c            |   1 +
 net/ipv4/devinet.c                |   2 +
 net/ipv4/fib_frontend.c           |  28 +++++----
 net/ipv4/fib_lookup.h             |   2 +-
 net/ipv4/fib_rules.c              |   5 +-
 net/ipv4/fib_semantics.c          | 123 ++++++++++++++++++++++++++++++++------
 net/ipv4/fib_trie.c               |  11 +++-
 net/ipv4/netfilter/ipt_rpfilter.c |   2 +-
 net/ipv4/route.c                  |  10 ++--
 ip/iproute.c                      |   4 ++
 15 files changed, 170 insertions(+), 50 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11  2:37 [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status Andy Gospodarek
@ 2015-06-11  2:37 ` Andy Gospodarek
  2015-06-11  2:53   ` Scott Feldman
  2015-06-11  6:07   ` Scott Feldman
  2015-06-11  2:37 ` [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down Andy Gospodarek
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11  2:37 UTC (permalink / raw)
  To: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen
  Cc: Andy Gospodarek

Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
reachable via an interface where carrier is off.  No action is taken,
but additional flags are passed to userspace to indicate carrier status.

This also includes a cleanup to fib_disable_ip to more clearly indicate
what event made the function call to replace the more cryptic force
option previously used.

v2: Split out kernel functionality into 2 patches, this patch simply sets and
clears new nexthop flag RTNH_F_LINKDOWN.

v3: Cleanups suggested by Alex as well as a bug noticed in
fib_sync_down_dev and fib_sync_up when multipath was not enabled.

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
---
 include/net/ip_fib.h           |  4 +--
 include/uapi/linux/rtnetlink.h |  3 +++
 net/ipv4/fib_frontend.c        | 22 ++++++++++------
 net/ipv4/fib_semantics.c       | 59 ++++++++++++++++++++++++++++++++----------
 4 files changed, 65 insertions(+), 23 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 54271ed..f73d27c 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -305,9 +305,9 @@ void fib_flush_external(struct net *net);
 
 /* Exported by fib_semantics.c */
 int ip_fib_check_default(__be32 gw, struct net_device *dev);
-int fib_sync_down_dev(struct net_device *dev, int force);
+int fib_sync_down_dev(struct net_device *dev, unsigned long event);
 int fib_sync_down_addr(struct net *net, __be32 local);
-int fib_sync_up(struct net_device *dev);
+int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
 void fib_select_multipath(struct fib_result *res);
 
 /* Exported by fib_trie.c */
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 17fb02f..8ab874a 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -338,6 +338,9 @@ struct rtnexthop {
 #define RTNH_F_PERVASIVE	2	/* Do recursive gateway lookup	*/
 #define RTNH_F_ONLINK		4	/* Gateway is forced on link	*/
 #define RTNH_F_OFFLOAD		8	/* offloaded route */
+#define RTNH_F_LINKDOWN		16	/* carrier-down on nexthop */
+
+#define RTNH_F_COMPARE_MASK	(RTNH_F_DEAD | RTNH_F_LINKDOWN) /* used as mask for route comparisons */
 
 /* Macros to handle hexthops */
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 872494e..872defb 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1063,9 +1063,9 @@ static void nl_fib_lookup_exit(struct net *net)
 	net->ipv4.fibnl = NULL;
 }
 
-static void fib_disable_ip(struct net_device *dev, int force)
+static void fib_disable_ip(struct net_device *dev, unsigned long event)
 {
-	if (fib_sync_down_dev(dev, force))
+	if (fib_sync_down_dev(dev, event))
 		fib_flush(dev_net(dev));
 	rt_cache_flush(dev_net(dev));
 	arp_ifdown(dev);
@@ -1081,7 +1081,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
 	case NETDEV_UP:
 		fib_add_ifaddr(ifa);
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-		fib_sync_up(dev);
+		fib_sync_up(dev, RTNH_F_DEAD);
 #endif
 		atomic_inc(&net->ipv4.dev_addr_genid);
 		rt_cache_flush(dev_net(dev));
@@ -1093,7 +1093,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
 			/* Last address was deleted from this interface.
 			 * Disable IP.
 			 */
-			fib_disable_ip(dev, 1);
+			fib_disable_ip(dev, event);
 		} else {
 			rt_cache_flush(dev_net(dev));
 		}
@@ -1107,9 +1107,10 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct in_device *in_dev;
 	struct net *net = dev_net(dev);
+	unsigned flags;
 
 	if (event == NETDEV_UNREGISTER) {
-		fib_disable_ip(dev, 2);
+		fib_disable_ip(dev, event);
 		rt_flush_dev(dev);
 		return NOTIFY_DONE;
 	}
@@ -1124,16 +1125,21 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 			fib_add_ifaddr(ifa);
 		} endfor_ifa(in_dev);
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-		fib_sync_up(dev);
+		fib_sync_up(dev, RTNH_F_DEAD);
 #endif
 		atomic_inc(&net->ipv4.dev_addr_genid);
 		rt_cache_flush(net);
 		break;
 	case NETDEV_DOWN:
-		fib_disable_ip(dev, 0);
+		fib_disable_ip(dev, event);
 		break;
-	case NETDEV_CHANGEMTU:
 	case NETDEV_CHANGE:
+		flags = dev_get_flags(dev);
+		if (flags & (IFF_RUNNING|IFF_LOWER_UP))
+			fib_sync_up(dev, RTNH_F_LINKDOWN);
+		else
+			fib_sync_down_dev(dev, event);
+	case NETDEV_CHANGEMTU:
 		rt_cache_flush(net);
 		break;
 	}
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 28ec3c1..496507f 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -266,7 +266,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		    nh->nh_tclassid != onh->nh_tclassid ||
 #endif
-		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD))
+		    ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_COMPARE_MASK)))
 			return -1;
 		onh++;
 	} endfor_nexthops(fi);
@@ -318,7 +318,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi)
 		    nfi->fib_type == fi->fib_type &&
 		    memcmp(nfi->fib_metrics, fi->fib_metrics,
 			   sizeof(u32) * RTAX_MAX) == 0 &&
-		    ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 &&
+		    ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_COMPARE_MASK)) == 0 &&
 		    (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0))
 			return fi;
 	}
@@ -604,6 +604,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 				return -ENODEV;
 			if (!(dev->flags & IFF_UP))
 				return -ENETDOWN;
+			if (!netif_carrier_ok(dev))
+				nh->nh_flags |= RTNH_F_LINKDOWN;
 			nh->nh_dev = dev;
 			dev_hold(dev);
 			nh->nh_scope = RT_SCOPE_LINK;
@@ -636,6 +638,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 		if (!dev)
 			goto out;
 		dev_hold(dev);
+		if (!netif_carrier_ok(dev))
+			nh->nh_flags |= RTNH_F_LINKDOWN;
 		err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN;
 	} else {
 		struct in_device *in_dev;
@@ -654,6 +658,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 		nh->nh_dev = in_dev->dev;
 		dev_hold(nh->nh_dev);
 		nh->nh_scope = RT_SCOPE_HOST;
+		if (!netif_carrier_ok(nh->nh_dev))
+			nh->nh_flags |= RTNH_F_LINKDOWN;
 		err = 0;
 	}
 out:
@@ -920,11 +926,17 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 		if (!nh->nh_dev)
 			goto failure;
 	} else {
+		int linkdown = 0;
 		change_nexthops(fi) {
 			err = fib_check_nh(cfg, fi, nexthop_nh);
 			if (err != 0)
 				goto failure;
+			if (nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
+				linkdown++;
 		} endfor_nexthops(fi)
+		if (linkdown == fi->fib_nhs) {
+			fi->fib_flags |= RTNH_F_LINKDOWN;
+		}
 	}
 
 	if (fi->fib_prefsrc) {
@@ -1103,7 +1115,7 @@ int fib_sync_down_addr(struct net *net, __be32 local)
 	return ret;
 }
 
-int fib_sync_down_dev(struct net_device *dev, int force)
+int fib_sync_down_dev(struct net_device *dev, unsigned long event)
 {
 	int ret = 0;
 	int scope = RT_SCOPE_NOWHERE;
@@ -1112,7 +1124,8 @@ int fib_sync_down_dev(struct net_device *dev, int force)
 	struct hlist_head *head = &fib_info_devhash[hash];
 	struct fib_nh *nh;
 
-	if (force)
+	if (event == NETDEV_UNREGISTER ||
+	    event == NETDEV_DOWN)
 		scope = -1;
 
 	hlist_for_each_entry(nh, head, nh_hash) {
@@ -1129,7 +1142,15 @@ int fib_sync_down_dev(struct net_device *dev, int force)
 				dead++;
 			else if (nexthop_nh->nh_dev == dev &&
 				 nexthop_nh->nh_scope != scope) {
-				nexthop_nh->nh_flags |= RTNH_F_DEAD;
+				switch (event) {
+				case NETDEV_DOWN:
+				case NETDEV_UNREGISTER:
+					nexthop_nh->nh_flags |= RTNH_F_DEAD;
+					/* fall through */
+				case NETDEV_CHANGE:
+					nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
+					break;
+				}
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 				spin_lock_bh(&fib_multipath_lock);
 				fi->fib_power -= nexthop_nh->nh_power;
@@ -1139,14 +1160,22 @@ int fib_sync_down_dev(struct net_device *dev, int force)
 				dead++;
 			}
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-			if (force > 1 && nexthop_nh->nh_dev == dev) {
+			if (event == NETDEV_UNREGISTER && nexthop_nh->nh_dev == dev) {
 				dead = fi->fib_nhs;
 				break;
 			}
 #endif
 		} endfor_nexthops(fi)
 		if (dead == fi->fib_nhs) {
-			fi->fib_flags |= RTNH_F_DEAD;
+			switch (event) {
+			case NETDEV_DOWN:
+			case NETDEV_UNREGISTER:
+				fi->fib_flags |= RTNH_F_DEAD;
+				/* fall through */
+			case NETDEV_CHANGE:
+				fi->fib_flags |= RTNH_F_LINKDOWN;
+				break;
+			}
 			ret++;
 		}
 	}
@@ -1210,13 +1239,11 @@ out:
 	return;
 }
 
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
-
 /*
  * Dead device goes up. We wake up dead nexthops.
  * It takes sense only on multipath routes.
  */
-int fib_sync_up(struct net_device *dev)
+int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 {
 	struct fib_info *prev_fi;
 	unsigned int hash;
@@ -1243,7 +1270,7 @@ int fib_sync_up(struct net_device *dev)
 		prev_fi = fi;
 		alive = 0;
 		change_nexthops(fi) {
-			if (!(nexthop_nh->nh_flags & RTNH_F_DEAD)) {
+			if (!(nexthop_nh->nh_flags & nh_flags)) {
 				alive++;
 				continue;
 			}
@@ -1254,14 +1281,18 @@ int fib_sync_up(struct net_device *dev)
 			    !__in_dev_get_rtnl(dev))
 				continue;
 			alive++;
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
 			spin_lock_bh(&fib_multipath_lock);
 			nexthop_nh->nh_power = 0;
-			nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
+			nexthop_nh->nh_flags &= ~nh_flags;
 			spin_unlock_bh(&fib_multipath_lock);
+#else
+			nexthop_nh->nh_flags &= ~nh_flags;
+#endif
 		} endfor_nexthops(fi)
 
 		if (alive > 0) {
-			fi->fib_flags &= ~RTNH_F_DEAD;
+			fi->fib_flags &= ~nh_flags;
 			ret++;
 		}
 	}
@@ -1269,6 +1300,8 @@ int fib_sync_up(struct net_device *dev)
 	return ret;
 }
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+
 /*
  * The algorithm is suboptimal, but it provides really
  * fair weighted route distribution.
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down
  2015-06-11  2:37 [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status Andy Gospodarek
  2015-06-11  2:37 ` [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops Andy Gospodarek
@ 2015-06-11  2:37 ` Andy Gospodarek
  2015-06-11  2:57   ` YOSHIFUJI Hideaki
  2015-06-11  2:37 ` [PATCH net-next 3/3 v3] iproute2: add support to print 'linkdown' nexthop flag Andy Gospodarek
  2015-06-11  3:07 ` [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status Scott Feldman
  3 siblings, 1 reply; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11  2:37 UTC (permalink / raw)
  To: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen
  Cc: Andy Gospodarek

This feature is only enabled with the new per-interface or ipv4 global
sysctls called 'ignore_routes_with_linkdown'.

net.ipv4.conf.all.ignore_routes_with_linkdown = 0
net.ipv4.conf.default.ignore_routes_with_linkdown = 0
net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
...

When the above sysctls are set, will report to userspace that a route is
dead and will no longer resolve to this nexthop when performing a fib
lookup.  This will signal to userspace that the route will not be
selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
if the sysctl is enabled and link is down.  This was done as without it the
netlink listeners would have no idea whether or not a nexthop would be
selected.   The kernel only sets RTNH_F_DEAD internally if the inteface has
IFF_UP cleared.

With the new sysctl set, the following behavior can be observed
(interface p8p1 is link-down):

default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
    cache
local 80.0.0.1 dev lo  src 80.0.0.1
    cache <local>
80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
    cache

While the route does remain in the table (so it can be modified if
needed rather than being wiped away as it would be if IFF_UP was
cleared), the proper next-hop is chosen automatically when the link is
down.  Now interface p8p1 is linked-up:

default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
    cache
local 80.0.0.1 dev lo  src 80.0.0.1
    cache <local>
80.0.0.2 dev p8p1  src 80.0.0.1
    cache

and the output changes to what one would expect.

If the sysctl is not set, the following output would be expected when
p8p1 is down:

default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2

Since the dead flag does not appear, there should be no expectation that
the kernel would skip using this route due to link being down.

v2: Split kernel changes into 2 patches, this actually makes a
behavioral change if the sysctl is set.  Also took suggestion from Alex
to simplify code by only checking sysctl during fib lookup and
suggestion from Scott to add a per-interface sysctl.

v3: Code clean-ups to make it more readable and efficient as well as a
reverse path check fix.

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
---
 include/linux/inetdevice.h        |  3 +++
 include/net/fib_rules.h           |  3 ++-
 include/net/ip_fib.h              | 16 +++++++++-------
 include/uapi/linux/ip.h           |  1 +
 include/uapi/linux/sysctl.h       |  1 +
 kernel/sysctl_binary.c            |  1 +
 net/ipv4/devinet.c                |  2 ++
 net/ipv4/fib_frontend.c           |  6 +++---
 net/ipv4/fib_rules.c              |  5 +++--
 net/ipv4/fib_semantics.c          | 29 ++++++++++++++++++++++++-----
 net/ipv4/fib_trie.c               |  7 +++++++
 net/ipv4/netfilter/ipt_rpfilter.c |  2 +-
 net/ipv4/route.c                  | 10 +++++-----
 13 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 0a21fbe..a4328ce 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -120,6 +120,9 @@ static inline void ipv4_devconf_setall(struct in_device *in_dev)
 	 || (!IN_DEV_FORWARD(in_dev) && \
 	  IN_DEV_ORCONF((in_dev), ACCEPT_REDIRECTS)))
 
+#define IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) \
+	IN_DEV_CONF_GET((in_dev), IGNORE_ROUTES_WITH_LINKDOWN)
+
 #define IN_DEV_ARPFILTER(in_dev)	IN_DEV_ORCONF((in_dev), ARPFILTER)
 #define IN_DEV_ARP_ACCEPT(in_dev)	IN_DEV_ORCONF((in_dev), ARP_ACCEPT)
 #define IN_DEV_ARP_ANNOUNCE(in_dev)	IN_DEV_MAXCONF((in_dev), ARP_ANNOUNCE)
diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 6d67383..903a55e 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -36,7 +36,8 @@ struct fib_lookup_arg {
 	void			*result;
 	struct fib_rule		*rule;
 	int			flags;
-#define FIB_LOOKUP_NOREF	1
+#define FIB_LOOKUP_NOREF		1
+#define FIB_LOOKUP_IGNORE_LINKSTATE	2
 };
 
 struct fib_rules_ops {
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index f73d27c..49c142b 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -226,7 +226,7 @@ static inline struct fib_table *fib_new_table(struct net *net, u32 id)
 }
 
 static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
-			     struct fib_result *res)
+			     struct fib_result *res, unsigned int flags)
 {
 	struct fib_table *tb;
 	int err = -ENETUNREACH;
@@ -234,7 +234,7 @@ static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
 	rcu_read_lock();
 
 	tb = fib_get_table(net, RT_TABLE_MAIN);
-	if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
+	if (tb && !fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF))
 		err = 0;
 
 	rcu_read_unlock();
@@ -249,16 +249,18 @@ void __net_exit fib4_rules_exit(struct net *net);
 struct fib_table *fib_new_table(struct net *net, u32 id);
 struct fib_table *fib_get_table(struct net *net, u32 id);
 
-int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res);
+int __fib_lookup(struct net *net, struct flowi4 *flp,
+		 struct fib_result *res, unsigned int flags);
 
 static inline int fib_lookup(struct net *net, struct flowi4 *flp,
-			     struct fib_result *res)
+			     struct fib_result *res, unsigned int flags)
 {
 	struct fib_table *tb;
 	int err;
 
+	flags |= FIB_LOOKUP_NOREF;
 	if (net->ipv4.fib_has_custom_rules)
-		return __fib_lookup(net, flp, res);
+		return __fib_lookup(net, flp, res, flags);
 
 	rcu_read_lock();
 
@@ -266,11 +268,11 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp,
 
 	for (err = 0; !err; err = -ENETUNREACH) {
 		tb = rcu_dereference_rtnl(net->ipv4.fib_main);
-		if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
+		if (tb && !fib_table_lookup(tb, flp, res, flags))
 			break;
 
 		tb = rcu_dereference_rtnl(net->ipv4.fib_default);
-		if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
+		if (tb && !fib_table_lookup(tb, flp, res, flags))
 			break;
 	}
 
diff --git a/include/uapi/linux/ip.h b/include/uapi/linux/ip.h
index 4119594..08f894d 100644
--- a/include/uapi/linux/ip.h
+++ b/include/uapi/linux/ip.h
@@ -164,6 +164,7 @@ enum
 	IPV4_DEVCONF_ROUTE_LOCALNET,
 	IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL,
 	IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL,
+	IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN,
 	__IPV4_DEVCONF_MAX
 };
 
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 0956373..62fda94 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -482,6 +482,7 @@ enum
 	NET_IPV4_CONF_PROMOTE_SECONDARIES=20,
 	NET_IPV4_CONF_ARP_ACCEPT=21,
 	NET_IPV4_CONF_ARP_NOTIFY=22,
+	NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN=23,
 };
 
 /* /proc/sys/net/ipv4/netfilter */
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 7e7746a..c9d0a0e 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = {
 	{ CTL_INT,	NET_IPV4_CONF_NOPOLICY,			"disable_policy" },
 	{ CTL_INT,	NET_IPV4_CONF_FORCE_IGMP_VERSION,	"force_igmp_version" },
 	{ CTL_INT,	NET_IPV4_CONF_PROMOTE_SECONDARIES,	"promote_secondaries" },
+	{ CTL_INT,	NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN,	"ignore_routes_with_linkdown" },
 	{}
 };
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 419d23c..7498716 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2169,6 +2169,8 @@ static struct devinet_sysctl_table {
 					"igmpv2_unsolicited_report_interval"),
 		DEVINET_SYSCTL_RW_ENTRY(IGMPV3_UNSOLICITED_REPORT_INTERVAL,
 					"igmpv3_unsolicited_report_interval"),
+		DEVINET_SYSCTL_RW_ENTRY(IGNORE_ROUTES_WITH_LINKDOWN,
+					"ignore_routes_with_linkdown"),
 
 		DEVINET_SYSCTL_FLUSHING_ENTRY(NOXFRM, "disable_xfrm"),
 		DEVINET_SYSCTL_FLUSHING_ENTRY(NOPOLICY, "disable_policy"),
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 872defb..b566b7f 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -280,7 +280,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
 		fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
 		fl4.flowi4_scope = scope;
 		fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
-		if (!fib_lookup(net, &fl4, &res))
+		if (!fib_lookup(net, &fl4, &res, 0))
 			return FIB_RES_PREFSRC(net, res);
 	} else {
 		scope = RT_SCOPE_LINK;
@@ -319,7 +319,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
 
 	net = dev_net(dev);
-	if (fib_lookup(net, &fl4, &res))
+	if (fib_lookup(net, &fl4, &res, 0))
 		goto last_resort;
 	if (res.type != RTN_UNICAST &&
 	    (res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev)))
@@ -354,7 +354,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	fl4.flowi4_oif = dev->ifindex;
 
 	ret = 0;
-	if (fib_lookup(net, &fl4, &res) == 0) {
+	if (fib_lookup(net, &fl4, &res, FIB_LOOKUP_IGNORE_LINKSTATE) == 0) {
 		if (res.type == RTN_UNICAST)
 			ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
 	}
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 5615198..18123d5 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -47,11 +47,12 @@ struct fib4_rule {
 #endif
 };
 
-int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
+int __fib_lookup(struct net *net, struct flowi4 *flp,
+		 struct fib_result *res, unsigned int flags)
 {
 	struct fib_lookup_arg arg = {
 		.result = res,
-		.flags = FIB_LOOKUP_NOREF,
+		.flags = flags,
 	};
 	int err;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 496507f..6cb49f6 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -623,7 +623,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			/* It is not necessary, but requires a bit of thinking */
 			if (fl4.flowi4_scope < RT_SCOPE_LINK)
 				fl4.flowi4_scope = RT_SCOPE_LINK;
-			err = fib_lookup(net, &fl4, &res);
+			err = fib_lookup(net, &fl4, &res,
+					 FIB_LOOKUP_IGNORE_LINKSTATE);
 			if (err) {
 				rcu_read_unlock();
 				return err;
@@ -1035,12 +1036,18 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 	    nla_put_in_addr(skb, RTA_PREFSRC, fi->fib_prefsrc))
 		goto nla_put_failure;
 	if (fi->fib_nhs == 1) {
+		struct in_device *in_dev;
 		if (fi->fib_nh->nh_gw &&
 		    nla_put_in_addr(skb, RTA_GATEWAY, fi->fib_nh->nh_gw))
 			goto nla_put_failure;
 		if (fi->fib_nh->nh_oif &&
 		    nla_put_u32(skb, RTA_OIF, fi->fib_nh->nh_oif))
 			goto nla_put_failure;
+		if (fi->fib_nh->nh_flags & RTNH_F_LINKDOWN) {
+		    in_dev = __in_dev_get_rcu(fi->fib_nh->nh_dev);
+		    if (in_dev && IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev))
+			rtm->rtm_flags |= RTNH_F_DEAD;
+		}
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		if (fi->fib_nh[0].nh_tclassid &&
 		    nla_put_u32(skb, RTA_FLOW, fi->fib_nh[0].nh_tclassid))
@@ -1057,11 +1064,17 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 			goto nla_put_failure;
 
 		for_nexthops(fi) {
+			struct in_device *in_dev;
 			rtnh = nla_reserve_nohdr(skb, sizeof(*rtnh));
 			if (!rtnh)
 				goto nla_put_failure;
 
 			rtnh->rtnh_flags = nh->nh_flags & 0xFF;
+			if (nh->nh_flags & RTNH_F_LINKDOWN) {
+				in_dev = __in_dev_get_rcu(nh->nh_dev);
+				if (in_dev && IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev))
+					rtnh->rtnh_flags |= RTNH_F_DEAD;
+			}
 			rtnh->rtnh_hops = nh->nh_weight - 1;
 			rtnh->rtnh_ifindex = nh->nh_oif;
 
@@ -1309,16 +1322,22 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 void fib_select_multipath(struct fib_result *res)
 {
 	struct fib_info *fi = res->fi;
+	struct in_device *in_dev;
 	int w;
 
 	spin_lock_bh(&fib_multipath_lock);
 	if (fi->fib_power <= 0) {
 		int power = 0;
 		change_nexthops(fi) {
-			if (!(nexthop_nh->nh_flags & RTNH_F_DEAD)) {
-				power += nexthop_nh->nh_weight;
-				nexthop_nh->nh_power = nexthop_nh->nh_weight;
-			}
+			in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
+			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
+				continue;
+			if (in_dev &&
+			    IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
+			    nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
+				continue;
+			power += nexthop_nh->nh_weight;
+			nexthop_nh->nh_power = nexthop_nh->nh_weight;
 		} endfor_nexthops(fi);
 		fi->fib_power = power;
 		if (power <= 0) {
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 3c699c4..f75ca20 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1407,11 +1407,18 @@ found:
 		}
 		if (fi->fib_flags & RTNH_F_DEAD)
 			continue;
+
 		for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
 			const struct fib_nh *nh = &fi->fib_nh[nhsel];
+			struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev);
 
 			if (nh->nh_flags & RTNH_F_DEAD)
 				continue;
+			if (in_dev &&
+			    IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
+			    nh->nh_flags & RTNH_F_LINKDOWN &&
+			    !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE))
+				continue;
 			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
 				continue;
 
diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c
index 4bfaedf..8618fd1 100644
--- a/net/ipv4/netfilter/ipt_rpfilter.c
+++ b/net/ipv4/netfilter/ipt_rpfilter.c
@@ -40,7 +40,7 @@ static bool rpfilter_lookup_reverse(struct flowi4 *fl4,
 	struct net *net = dev_net(dev);
 	int ret __maybe_unused;
 
-	if (fib_lookup(net, fl4, &res))
+	if (fib_lookup(net, fl4, &res, FIB_LOOKUP_IGNORE_LINKSTATE))
 		return false;
 
 	if (res.type != RTN_UNICAST) {
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f605598..d0362a2 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -747,7 +747,7 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 		if (!(n->nud_state & NUD_VALID)) {
 			neigh_event_send(n, NULL);
 		} else {
-			if (fib_lookup(net, fl4, &res) == 0) {
+			if (fib_lookup(net, fl4, &res, 0) == 0) {
 				struct fib_nh *nh = &FIB_RES_NH(res);
 
 				update_or_create_fnhe(nh, fl4->daddr, new_gw,
@@ -975,7 +975,7 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 		return;
 
 	rcu_read_lock();
-	if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) {
+	if (fib_lookup(dev_net(dst->dev), fl4, &res, 0) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
 
 		update_or_create_fnhe(nh, fl4->daddr, 0, mtu,
@@ -1186,7 +1186,7 @@ void ip_rt_get_source(u8 *addr, struct sk_buff *skb, struct rtable *rt)
 		fl4.flowi4_mark = skb->mark;
 
 		rcu_read_lock();
-		if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res) == 0)
+		if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res, 0) == 0)
 			src = FIB_RES_PREFSRC(dev_net(rt->dst.dev), res);
 		else
 			src = inet_select_addr(rt->dst.dev,
@@ -1716,7 +1716,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
 	fl4.daddr = daddr;
 	fl4.saddr = saddr;
-	err = fib_lookup(net, &fl4, &res);
+	err = fib_lookup(net, &fl4, &res, 0);
 	if (err != 0) {
 		if (!IN_DEV_FORWARD(in_dev))
 			err = -EHOSTUNREACH;
@@ -2123,7 +2123,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		goto make_route;
 	}
 
-	if (fib_lookup(net, fl4, &res)) {
+	if (fib_lookup(net, fl4, &res, 0)) {
 		res.fi = NULL;
 		res.table = NULL;
 		if (fl4->flowi4_oif) {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 3/3 v3] iproute2: add support to print 'linkdown' nexthop flag
  2015-06-11  2:37 [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status Andy Gospodarek
  2015-06-11  2:37 ` [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops Andy Gospodarek
  2015-06-11  2:37 ` [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down Andy Gospodarek
@ 2015-06-11  2:37 ` Andy Gospodarek
  2015-06-11  3:02   ` Scott Feldman
  2015-06-11  3:07 ` [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status Scott Feldman
  3 siblings, 1 reply; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11  2:37 UTC (permalink / raw)
  To: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen
  Cc: Andy Gospodarek

Signed-off-by: Andy Gospodaerk <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>

---
 ip/iproute.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/ip/iproute.c b/ip/iproute.c
index 3795baf..3369c49 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -451,6 +451,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 		fprintf(fp, "offload ");
 	if (r->rtm_flags & RTM_F_NOTIFY)
 		fprintf(fp, "notify ");
+	if (r->rtm_flags & RTNH_F_LINKDOWN)
+		fprintf(fp, "linkdown ");
 	if (tb[RTA_MARK]) {
 		unsigned int mark = *(unsigned int*)RTA_DATA(tb[RTA_MARK]);
 		if (mark) {
@@ -670,6 +672,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 				fprintf(fp, " onlink");
 			if (nh->rtnh_flags & RTNH_F_PERVASIVE)
 				fprintf(fp, " pervasive");
+			if (nh->rtnh_flags & RTNH_F_LINKDOWN)
+				fprintf(fp, " linkdown");
 			len -= NLMSG_ALIGN(nh->rtnh_len);
 			nh = RTNH_NEXT(nh);
 		}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11  2:37 ` [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops Andy Gospodarek
@ 2015-06-11  2:53   ` Scott Feldman
  2015-06-11  3:28     ` Andy Gospodarek
  2015-06-11  6:07   ` Scott Feldman
  1 sibling, 1 reply; 25+ messages in thread
From: Scott Feldman @ 2015-06-11  2:53 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:

> @@ -1129,7 +1142,15 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>                                 dead++;
>                         else if (nexthop_nh->nh_dev == dev &&
>                                  nexthop_nh->nh_scope != scope) {
> -                               nexthop_nh->nh_flags |= RTNH_F_DEAD;
> +                               switch (event) {
> +                               case NETDEV_DOWN:
> +                               case NETDEV_UNREGISTER:
> +                                       nexthop_nh->nh_flags |= RTNH_F_DEAD;
> +                                       /* fall through */
> +                               case NETDEV_CHANGE:
> +                                       nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
> +                                       break;
> +                               }
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>                                 spin_lock_bh(&fib_multipath_lock);
>                                 fi->fib_power -= nexthop_nh->nh_power;
> @@ -1139,14 +1160,22 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>                                 dead++;
>                         }
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> -                       if (force > 1 && nexthop_nh->nh_dev == dev) {
> +                       if (event == NETDEV_UNREGISTER && nexthop_nh->nh_dev == dev) {
>                                 dead = fi->fib_nhs;
>                                 break;
>                         }
>  #endif
>                 } endfor_nexthops(fi)
>                 if (dead == fi->fib_nhs) {
> -                       fi->fib_flags |= RTNH_F_DEAD;
> +                       switch (event) {
> +                       case NETDEV_DOWN:
> +                       case NETDEV_UNREGISTER:
> +                               fi->fib_flags |= RTNH_F_DEAD;
> +                               /* fall through */
> +                       case NETDEV_CHANGE:
> +                               fi->fib_flags |= RTNH_F_LINKDOWN;

RTNH_F_LINKDOWN is to mark linkdown nexthop devs....why is the route
fi being marked RTNH_F_LINKDOWN?

The RTNH_F_LINKDOWN comment says:

#define RTNH_F_LINKDOWN                16      /* carrier-down on nexthop */

It's a per-nh flag, not per-route flag, correct?

Can you show an ECMP example with only a subset of the nexthops dev
linkdowned?  Show the ip route output after going thru some link
down/up events on some of the nexthops devs.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down
  2015-06-11  2:37 ` [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down Andy Gospodarek
@ 2015-06-11  2:57   ` YOSHIFUJI Hideaki
  2015-06-11  3:00     ` Scott Feldman
  0 siblings, 1 reply; 25+ messages in thread
From: YOSHIFUJI Hideaki @ 2015-06-11  2:57 UTC (permalink / raw)
  To: Andy Gospodarek, netdev, davem, ddutt, sfeldma, alexander.duyck,
	hannes, stephen
  Cc: hideaki.yoshifuji

Hi,

Andy Gospodarek wrote:
> This feature is only enabled with the new per-interface or ipv4 global
> sysctls called 'ignore_routes_with_linkdown'.
> 
> net.ipv4.conf.all.ignore_routes_with_linkdown = 0
> net.ipv4.conf.default.ignore_routes_with_linkdown = 0
> net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
:
> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> ---
:
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 7e7746a..c9d0a0e 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = {
>  	{ CTL_INT,	NET_IPV4_CONF_NOPOLICY,			"disable_policy" },
>  	{ CTL_INT,	NET_IPV4_CONF_FORCE_IGMP_VERSION,	"force_igmp_version" },
>  	{ CTL_INT,	NET_IPV4_CONF_PROMOTE_SECONDARIES,	"promote_secondaries" },
> +	{ CTL_INT,	NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN,	"ignore_routes_with_linkdown" },
>  	{}
>  };
>  

Please do not add new binary sysctl knob. Thank you.

-- 
Hideaki Yoshifuji <hideaki.yoshifuji@miraclelinux.com>
Technical Division, MIRACLE LINUX CORPORATION

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down
  2015-06-11  2:57   ` YOSHIFUJI Hideaki
@ 2015-06-11  3:00     ` Scott Feldman
  2015-06-11  3:36       ` Andy Gospodarek
  0 siblings, 1 reply; 25+ messages in thread
From: Scott Feldman @ 2015-06-11  3:00 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki
  Cc: Andy Gospodarek, Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 7:57 PM, YOSHIFUJI Hideaki
<hideaki.yoshifuji@miraclelinux.com> wrote:
> Hi,
>
> Andy Gospodarek wrote:
>> This feature is only enabled with the new per-interface or ipv4 global
>> sysctls called 'ignore_routes_with_linkdown'.
>>
>> net.ipv4.conf.all.ignore_routes_with_linkdown = 0
>> net.ipv4.conf.default.ignore_routes_with_linkdown = 0
>> net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
> :
>> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
>> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
>> ---
> :
>> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
>> index 7e7746a..c9d0a0e 100644
>> --- a/kernel/sysctl_binary.c
>> +++ b/kernel/sysctl_binary.c
>> @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = {
>>       { CTL_INT,      NET_IPV4_CONF_NOPOLICY,                 "disable_policy" },
>>       { CTL_INT,      NET_IPV4_CONF_FORCE_IGMP_VERSION,       "force_igmp_version" },
>>       { CTL_INT,      NET_IPV4_CONF_PROMOTE_SECONDARIES,      "promote_secondaries" },
>> +     { CTL_INT,      NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN,      "ignore_routes_with_linkdown" },
>>       {}
>>  };
>>
>
> Please do not add new binary sysctl knob. Thank you.

Reason?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 3/3 v3] iproute2: add support to print 'linkdown' nexthop flag
  2015-06-11  2:37 ` [PATCH net-next 3/3 v3] iproute2: add support to print 'linkdown' nexthop flag Andy Gospodarek
@ 2015-06-11  3:02   ` Scott Feldman
  2015-06-11  3:13     ` Andy Gospodarek
  0 siblings, 1 reply; 25+ messages in thread
From: Scott Feldman @ 2015-06-11  3:02 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> Signed-off-by: Andy Gospodaerk <gospo@cumulusnetworks.com>
> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
>
> ---
>  ip/iproute.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/ip/iproute.c b/ip/iproute.c
> index 3795baf..3369c49 100644
> --- a/ip/iproute.c
> +++ b/ip/iproute.c
> @@ -451,6 +451,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
>                 fprintf(fp, "offload ");
>         if (r->rtm_flags & RTM_F_NOTIFY)
>                 fprintf(fp, "notify ");
> +       if (r->rtm_flags & RTNH_F_LINKDOWN)
> +               fprintf(fp, "linkdown ");

This seems confusing for ECMP case where only some nexthop devs are
RTNH_F_LINKDOWN?   Why mark entire route "linkdown" when it still has
viable nexthop devs for ECMP?


>         if (tb[RTA_MARK]) {
>                 unsigned int mark = *(unsigned int*)RTA_DATA(tb[RTA_MARK]);
>                 if (mark) {
> @@ -670,6 +672,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
>                                 fprintf(fp, " onlink");
>                         if (nh->rtnh_flags & RTNH_F_PERVASIVE)
>                                 fprintf(fp, " pervasive");
> +                       if (nh->rtnh_flags & RTNH_F_LINKDOWN)
> +                               fprintf(fp, " linkdown");
>                         len -= NLMSG_ALIGN(nh->rtnh_len);
>                         nh = RTNH_NEXT(nh);
>                 }
> --
> 1.9.3
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status
  2015-06-11  2:37 [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status Andy Gospodarek
                   ` (2 preceding siblings ...)
  2015-06-11  2:37 ` [PATCH net-next 3/3 v3] iproute2: add support to print 'linkdown' nexthop flag Andy Gospodarek
@ 2015-06-11  3:07 ` Scott Feldman
  2015-06-11  3:19   ` Andy Gospodarek
  3 siblings, 1 reply; 25+ messages in thread
From: Scott Feldman @ 2015-06-11  3:07 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:

> There was also a request for switchdev support for this, but that will be
> posted as a followup as switchdev does not currently handle dead
> next-hops in a multi-path case and I felt that infra needed to be added
> first.

That's not true.  switchdev_fib_ipv4_add() passes *fi and all of the
nexthops for the route are hanging off of that, including the
nh->flags where you're setting LINKDOWN.  Multipath is not different
than singlepath in that regard.  Same API for both.  switchdev support
should be added for this NEW feature, especially since you're using
the feature to offload to a switch device.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 3/3 v3] iproute2: add support to print 'linkdown' nexthop flag
  2015-06-11  3:02   ` Scott Feldman
@ 2015-06-11  3:13     ` Andy Gospodarek
  0 siblings, 0 replies; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11  3:13 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 08:02:26PM -0700, Scott Feldman wrote:
> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> > Signed-off-by: Andy Gospodaerk <gospo@cumulusnetworks.com>
> > Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> >
> > ---
> >  ip/iproute.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/ip/iproute.c b/ip/iproute.c
> > index 3795baf..3369c49 100644
> > --- a/ip/iproute.c
> > +++ b/ip/iproute.c
> > @@ -451,6 +451,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
> >                 fprintf(fp, "offload ");
> >         if (r->rtm_flags & RTM_F_NOTIFY)
> >                 fprintf(fp, "notify ");
> > +       if (r->rtm_flags & RTNH_F_LINKDOWN)
> > +               fprintf(fp, "linkdown ");
> 
> This seems confusing for ECMP case where only some nexthop devs are
> RTNH_F_LINKDOWN?   Why mark entire route "linkdown" when it still has
> viable nexthop devs for ECMP?

This is no different than what happens when nexthops are marked dead
today.  This situation happens when a route's nexthop has IFF_UP
cleared.

> 
> 
> >         if (tb[RTA_MARK]) {
> >                 unsigned int mark = *(unsigned int*)RTA_DATA(tb[RTA_MARK]);
> >                 if (mark) {
> > @@ -670,6 +672,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
> >                                 fprintf(fp, " onlink");
> >                         if (nh->rtnh_flags & RTNH_F_PERVASIVE)
> >                                 fprintf(fp, " pervasive");
> > +                       if (nh->rtnh_flags & RTNH_F_LINKDOWN)
> > +                               fprintf(fp, " linkdown");
> >                         len -= NLMSG_ALIGN(nh->rtnh_len);
> >                         nh = RTNH_NEXT(nh);
> >                 }
> > --
> > 1.9.3
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status
  2015-06-11  3:07 ` [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status Scott Feldman
@ 2015-06-11  3:19   ` Andy Gospodarek
  2015-06-11  3:33     ` Andy Gospodarek
  2015-06-11 15:44     ` Scott Feldman
  0 siblings, 2 replies; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11  3:19 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 08:07:10PM -0700, Scott Feldman wrote:
> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> 
> > There was also a request for switchdev support for this, but that will be
> > posted as a followup as switchdev does not currently handle dead
> > next-hops in a multi-path case and I felt that infra needed to be added
> > first.
> 
> That's not true.  switchdev_fib_ipv4_add() passes *fi and all of the
> nexthops for the route are hanging off of that, including the
> nh->flags where you're setting LINKDOWN.  Multipath is not different
> than singlepath in that regard.  Same API for both.  

The API is the same, but I did not see a path that would take a
multipath route and update the dead nexthops when an interface is taken
down with switchdev or rocker today.

I could be wrong (and I will test again), but create a multipath route
with nexthops on swp1 and swp2 and then call 'ip link set swp1 down' and
let me know if you see rocker's ECMP routes get updated so only the
nexthop on swp2 will be used.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11  2:53   ` Scott Feldman
@ 2015-06-11  3:28     ` Andy Gospodarek
  2015-06-11  4:14       ` Scott Feldman
  0 siblings, 1 reply; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11  3:28 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 07:53:59PM -0700, Scott Feldman wrote:
> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> 
> > @@ -1129,7 +1142,15 @@ int fib_sync_down_dev(struct net_device *dev, int force)
> >                                 dead++;
> >                         else if (nexthop_nh->nh_dev == dev &&
> >                                  nexthop_nh->nh_scope != scope) {
> > -                               nexthop_nh->nh_flags |= RTNH_F_DEAD;
> > +                               switch (event) {
> > +                               case NETDEV_DOWN:
> > +                               case NETDEV_UNREGISTER:
> > +                                       nexthop_nh->nh_flags |= RTNH_F_DEAD;
> > +                                       /* fall through */
> > +                               case NETDEV_CHANGE:
> > +                                       nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
> > +                                       break;
> > +                               }
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >                                 spin_lock_bh(&fib_multipath_lock);
> >                                 fi->fib_power -= nexthop_nh->nh_power;
> > @@ -1139,14 +1160,22 @@ int fib_sync_down_dev(struct net_device *dev, int force)
> >                                 dead++;
> >                         }
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> > -                       if (force > 1 && nexthop_nh->nh_dev == dev) {
> > +                       if (event == NETDEV_UNREGISTER && nexthop_nh->nh_dev == dev) {
> >                                 dead = fi->fib_nhs;
> >                                 break;
> >                         }
> >  #endif
> >                 } endfor_nexthops(fi)
> >                 if (dead == fi->fib_nhs) {
> > -                       fi->fib_flags |= RTNH_F_DEAD;
> > +                       switch (event) {
> > +                       case NETDEV_DOWN:
> > +                       case NETDEV_UNREGISTER:
> > +                               fi->fib_flags |= RTNH_F_DEAD;
> > +                               /* fall through */
> > +                       case NETDEV_CHANGE:
> > +                               fi->fib_flags |= RTNH_F_LINKDOWN;
> 
> RTNH_F_LINKDOWN is to mark linkdown nexthop devs....why is the route
> fi being marked RTNH_F_LINKDOWN?
> 
> The RTNH_F_LINKDOWN comment says:
> 
> #define RTNH_F_LINKDOWN                16      /* carrier-down on nexthop */

This is done with the dead flag already.  I'm actually following the
precedent already set there.

> It's a per-nh flag, not per-route flag, correct?
> 
> Can you show an ECMP example with only a subset of the nexthops dev
> linkdowned?  Show the ip route output after going thru some link
> down/up events on some of the nexthops devs.

Sure!  This is exactly what I've been using for testing.

# ip route show
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
90.0.0.0/24 via 70.0.0.2 dev p7p1
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 10
100.0.0.0/24
	nexthop via 70.0.0.2  dev p7p1 weight 1
	nexthop via 80.0.0.2  dev p8p1 weight 1
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
# # take p8p1 link down
# ip route show 
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown 
90.0.0.0/24 via 70.0.0.2 dev p7p1 
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 10 dead linkdown 
100.0.0.0/24 
	nexthop via 70.0.0.2  dev p7p1 weight 1
	nexthop via 80.0.0.2  dev p8p1 weight 1 dead linkdown
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
# ip route get 100.0.0.2 
100.0.0.2 via 70.0.0.2 dev p7p1  src 70.0.0.1 
    cache 
# ip route get 100.0.0.2 
100.0.0.2 via 70.0.0.2 dev p7p1  src 70.0.0.1 
    cache 
# # take p8p1 link up
# ip route show
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
90.0.0.0/24 via 70.0.0.2 dev p7p1 
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 10 
100.0.0.0/24 
	nexthop via 70.0.0.2  dev p7p1 weight 1
	nexthop via 80.0.0.2  dev p8p1 weight 1
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
# ip route show 
100.0.0.2 via 70.0.0.2 dev p7p1  src 70.0.0.1 
    cache 
# ip route get 100.0.0.2 
100.0.0.2 via 80.0.0.2 dev p8p1  src 80.0.0.1 
    cache 
# ip route get 100.0.0.2 
100.0.0.2 via 70.0.0.2 dev p7p1  src 70.0.0.1 
    cache 
# ip route get 100.0.0.2 
100.0.0.2 via 80.0.0.2 dev p8p1  src 80.0.0.1 
    cache 
# # you can see the round robin happening
# # take all ports p8p1 and p7p1 down
# ip route show
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 dead linkdown 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1 dead linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 10 dead linkdown
100.0.0.0/24
	nexthop via 70.0.0.2  dev p7p1 weight 1 dead linkdown
	nexthop via 80.0.0.2  dev p8p1 weight 1 dead linkdown
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
# ip route get 100.0.0.2 
RTNETLINK answers: Network is unreachable
# ip route get 80.0.0.2 
RTNETLINK answers: Network is unreachable
# ip route get 80.0.0.1
local 80.0.0.1 dev lo  src 80.0.0.1 
    cache <local> 
# ip route get 70.0.0.1
local 70.0.0.1 dev lo  src 70.0.0.1 
    cache <local> 
# # local addrs are still reachable 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status
  2015-06-11  3:19   ` Andy Gospodarek
@ 2015-06-11  3:33     ` Andy Gospodarek
  2015-06-11 15:44     ` Scott Feldman
  1 sibling, 0 replies; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11  3:33 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 11:19:29PM -0400, Andy Gospodarek wrote:
> On Wed, Jun 10, 2015 at 08:07:10PM -0700, Scott Feldman wrote:
> > On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
> > <gospo@cumulusnetworks.com> wrote:
> > 
> > > There was also a request for switchdev support for this, but that will be
> > > posted as a followup as switchdev does not currently handle dead
> > > next-hops in a multi-path case and I felt that infra needed to be added
> > > first.
> > 
> > That's not true.  switchdev_fib_ipv4_add() passes *fi and all of the
> > nexthops for the route are hanging off of that, including the
> > nh->flags where you're setting LINKDOWN.  Multipath is not different
> > than singlepath in that regard.  Same API for both.  
> 
> The API is the same, but I did not see a path that would take a
> multipath route and update the dead nexthops when an interface is taken
> down with switchdev or rocker today.
> 
> I could be wrong (and I will test again), but create a multipath route
> with nexthops on swp1 and swp2 and then call 'ip link set swp1 down' and
> let me know if you see rocker's ECMP routes get updated so only the
> nexthop on swp2 will be used.
> 

Scott, as I stated before I have every intention of adding switchdev as well
as ipv6 support and at this point I hope people know I'm good fot it.
Since I'm already on v3 of the ipv4 support that is looking like a good
idea.  :)  I'm sure you can empathize as you have organically grown the
switchdev support in the kernel.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down
  2015-06-11  3:00     ` Scott Feldman
@ 2015-06-11  3:36       ` Andy Gospodarek
  2015-06-11  4:32         ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11  3:36 UTC (permalink / raw)
  To: Scott Feldman
  Cc: YOSHIFUJI Hideaki, Netdev, David S. Miller, ddutt,
	Alexander Duyck, Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 08:00:14PM -0700, Scott Feldman wrote:
> On Wed, Jun 10, 2015 at 7:57 PM, YOSHIFUJI Hideaki
> <hideaki.yoshifuji@miraclelinux.com> wrote:
> > Hi,
> >
> > Andy Gospodarek wrote:
> >> This feature is only enabled with the new per-interface or ipv4 global
> >> sysctls called 'ignore_routes_with_linkdown'.
> >>
> >> net.ipv4.conf.all.ignore_routes_with_linkdown = 0
> >> net.ipv4.conf.default.ignore_routes_with_linkdown = 0
> >> net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
> > :
> >> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> >> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> >> ---
> > :
> >> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> >> index 7e7746a..c9d0a0e 100644
> >> --- a/kernel/sysctl_binary.c
> >> +++ b/kernel/sysctl_binary.c
> >> @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = {
> >>       { CTL_INT,      NET_IPV4_CONF_NOPOLICY,                 "disable_policy" },
> >>       { CTL_INT,      NET_IPV4_CONF_FORCE_IGMP_VERSION,       "force_igmp_version" },
> >>       { CTL_INT,      NET_IPV4_CONF_PROMOTE_SECONDARIES,      "promote_secondaries" },
> >> +     { CTL_INT,      NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN,      "ignore_routes_with_linkdown" },
> >>       {}
> >>  };
> >>
> >
> > Please do not add new binary sysctl knob. Thank you.
> 
> Reason?

I'll echo Scott's request here.  I realize than an abundance of them is
bad, but (to me) this one seems useful.  Unless of course we want to
make this proposed behavior the default.  :-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11  3:28     ` Andy Gospodarek
@ 2015-06-11  4:14       ` Scott Feldman
  0 siblings, 0 replies; 25+ messages in thread
From: Scott Feldman @ 2015-06-11  4:14 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 8:28 PM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> On Wed, Jun 10, 2015 at 07:53:59PM -0700, Scott Feldman wrote:
>> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
>> <gospo@cumulusnetworks.com> wrote:
>>
>> > @@ -1129,7 +1142,15 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>> >                                 dead++;
>> >                         else if (nexthop_nh->nh_dev == dev &&
>> >                                  nexthop_nh->nh_scope != scope) {
>> > -                               nexthop_nh->nh_flags |= RTNH_F_DEAD;
>> > +                               switch (event) {
>> > +                               case NETDEV_DOWN:
>> > +                               case NETDEV_UNREGISTER:
>> > +                                       nexthop_nh->nh_flags |= RTNH_F_DEAD;
>> > +                                       /* fall through */
>> > +                               case NETDEV_CHANGE:
>> > +                                       nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
>> > +                                       break;
>> > +                               }
>> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> >                                 spin_lock_bh(&fib_multipath_lock);
>> >                                 fi->fib_power -= nexthop_nh->nh_power;
>> > @@ -1139,14 +1160,22 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>> >                                 dead++;
>> >                         }
>> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> > -                       if (force > 1 && nexthop_nh->nh_dev == dev) {
>> > +                       if (event == NETDEV_UNREGISTER && nexthop_nh->nh_dev == dev) {
>> >                                 dead = fi->fib_nhs;
>> >                                 break;
>> >                         }
>> >  #endif
>> >                 } endfor_nexthops(fi)
>> >                 if (dead == fi->fib_nhs) {
>> > -                       fi->fib_flags |= RTNH_F_DEAD;
>> > +                       switch (event) {
>> > +                       case NETDEV_DOWN:
>> > +                       case NETDEV_UNREGISTER:
>> > +                               fi->fib_flags |= RTNH_F_DEAD;
>> > +                               /* fall through */
>> > +                       case NETDEV_CHANGE:
>> > +                               fi->fib_flags |= RTNH_F_LINKDOWN;
>>
>> RTNH_F_LINKDOWN is to mark linkdown nexthop devs....why is the route
>> fi being marked RTNH_F_LINKDOWN?
>>
>> The RTNH_F_LINKDOWN comment says:
>>
>> #define RTNH_F_LINKDOWN                16      /* carrier-down on nexthop */
>
> This is done with the dead flag already.  I'm actually following the
> precedent already set there.
>
>> It's a per-nh flag, not per-route flag, correct?
>>
>> Can you show an ECMP example with only a subset of the nexthops dev
>> linkdowned?  Show the ip route output after going thru some link
>> down/up events on some of the nexthops devs.
>
> Sure!  This is exactly what I've been using for testing.
>
> # ip route show
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
> 90.0.0.0/24 via 70.0.0.2 dev p7p1
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 10
> 100.0.0.0/24
>         nexthop via 70.0.0.2  dev p7p1 weight 1
>         nexthop via 80.0.0.2  dev p8p1 weight 1
> 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
> # # take p8p1 link down
> # ip route show
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
> 90.0.0.0/24 via 70.0.0.2 dev p7p1
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 10 dead linkdown
> 100.0.0.0/24
>         nexthop via 70.0.0.2  dev p7p1 weight 1
>         nexthop via 80.0.0.2  dev p8p1 weight 1 dead linkdown
> 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
> # ip route get 100.0.0.2
> 100.0.0.2 via 70.0.0.2 dev p7p1  src 70.0.0.1
>     cache
> # ip route get 100.0.0.2
> 100.0.0.2 via 70.0.0.2 dev p7p1  src 70.0.0.1
>     cache
> # # take p8p1 link up
> # ip route show
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
> 90.0.0.0/24 via 70.0.0.2 dev p7p1
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 10
> 100.0.0.0/24
>         nexthop via 70.0.0.2  dev p7p1 weight 1
>         nexthop via 80.0.0.2  dev p8p1 weight 1
> 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
> # ip route show
> 100.0.0.2 via 70.0.0.2 dev p7p1  src 70.0.0.1
>     cache
> # ip route get 100.0.0.2
> 100.0.0.2 via 80.0.0.2 dev p8p1  src 80.0.0.1
>     cache
> # ip route get 100.0.0.2
> 100.0.0.2 via 70.0.0.2 dev p7p1  src 70.0.0.1
>     cache
> # ip route get 100.0.0.2
> 100.0.0.2 via 80.0.0.2 dev p8p1  src 80.0.0.1
>     cache
> # # you can see the round robin happening
> # # take all ports p8p1 and p7p1 down
> # ip route show
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 dead linkdown
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
> 90.0.0.0/24 via 70.0.0.2 dev p7p1 dead linkdown
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 10 dead linkdown
> 100.0.0.0/24
>         nexthop via 70.0.0.2  dev p7p1 weight 1 dead linkdown
>         nexthop via 80.0.0.2  dev p8p1 weight 1 dead linkdown
> 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
> # ip route get 100.0.0.2
> RTNETLINK answers: Network is unreachable
> # ip route get 80.0.0.2
> RTNETLINK answers: Network is unreachable
> # ip route get 80.0.0.1
> local 80.0.0.1 dev lo  src 80.0.0.1
>     cache <local>
> # ip route get 70.0.0.1
> local 70.0.0.1 dev lo  src 70.0.0.1
>     cache <local>
> # # local addrs are still reachable

Perfect, looks good, thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down
  2015-06-11  3:36       ` Andy Gospodarek
@ 2015-06-11  4:32         ` David Miller
  2015-06-11 19:35           ` Andy Gospodarek
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2015-06-11  4:32 UTC (permalink / raw)
  To: gospo
  Cc: sfeldma, hideaki.yoshifuji, netdev, ddutt, alexander.duyck,
	hannes, stephen

From: Andy Gospodarek <gospo@cumulusnetworks.com>
Date: Wed, 10 Jun 2015 23:36:21 -0400

> On Wed, Jun 10, 2015 at 08:00:14PM -0700, Scott Feldman wrote:
>> On Wed, Jun 10, 2015 at 7:57 PM, YOSHIFUJI Hideaki
>> <hideaki.yoshifuji@miraclelinux.com> wrote:
>> > Hi,
>> >
>> > Andy Gospodarek wrote:
>> >> This feature is only enabled with the new per-interface or ipv4 global
>> >> sysctls called 'ignore_routes_with_linkdown'.
>> >>
>> >> net.ipv4.conf.all.ignore_routes_with_linkdown = 0
>> >> net.ipv4.conf.default.ignore_routes_with_linkdown = 0
>> >> net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
>> > :
>> >> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
>> >> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
>> >> ---
>> > :
>> >> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
>> >> index 7e7746a..c9d0a0e 100644
>> >> --- a/kernel/sysctl_binary.c
>> >> +++ b/kernel/sysctl_binary.c
>> >> @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = {
>> >>       { CTL_INT,      NET_IPV4_CONF_NOPOLICY,                 "disable_policy" },
>> >>       { CTL_INT,      NET_IPV4_CONF_FORCE_IGMP_VERSION,       "force_igmp_version" },
>> >>       { CTL_INT,      NET_IPV4_CONF_PROMOTE_SECONDARIES,      "promote_secondaries" },
>> >> +     { CTL_INT,      NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN,      "ignore_routes_with_linkdown" },
>> >>       {}
>> >>  };
>> >>
>> >
>> > Please do not add new binary sysctl knob. Thank you.
>> 
>> Reason?
> 
> I'll echo Scott's request here.  I realize than an abundance of them is
> bad, but (to me) this one seems useful.  Unless of course we want to
> make this proposed behavior the default.  :-)

Kernel wide, new binary sysctl's are verboten.

Everyone should be accessing sysctls via their name.

You have to remove this.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11  2:37 ` [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops Andy Gospodarek
  2015-06-11  2:53   ` Scott Feldman
@ 2015-06-11  6:07   ` Scott Feldman
  2015-06-11 11:23     ` Andy Gospodarek
  1 sibling, 1 reply; 25+ messages in thread
From: Scott Feldman @ 2015-06-11  6:07 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
> reachable via an interface where carrier is off.  No action is taken,
> but additional flags are passed to userspace to indicate carrier status.

Andy, it seems now RTNH_F_LINKDOWN and RTNH_F_DEAD are very similar
and I'm wondering if this could be done without introducing a new flag
and just use RTNH_F_DEAD.  The link change event would set RTNH_F_DEAD
on nh on dev link down, and clear on link up.  The sysctl knob would
be something like "nexthop_dead_on_linkdown", default off.  So
basically expanding the ways RTNH_F_DEAD can be set.  That would
simplify the patch set quite a bit and require no changes to iproute2.

-scott

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11  6:07   ` Scott Feldman
@ 2015-06-11 11:23     ` Andy Gospodarek
  2015-06-11 14:47       ` Scott Feldman
  2015-06-11 14:50       ` Alexander Duyck
  0 siblings, 2 replies; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11 11:23 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 11:07:28PM -0700, Scott Feldman wrote:
> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> > Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
> > reachable via an interface where carrier is off.  No action is taken,
> > but additional flags are passed to userspace to indicate carrier status.
> 
> Andy, it seems now RTNH_F_LINKDOWN and RTNH_F_DEAD are very similar
> and I'm wondering if this could be done without introducing a new flag
> and just use RTNH_F_DEAD.  The link change event would set RTNH_F_DEAD
> on nh on dev link down, and clear on link up.  The sysctl knob would
> be something like "nexthop_dead_on_linkdown", default off.  So
> basically expanding the ways RTNH_F_DEAD can be set.  That would
> simplify the patch set quite a bit and require no changes to iproute2.
> 

You are absolutely correct that what you describe would be less churn to
userspace.  From a functionality standpoint that is close to what was
originally proposed, but Alex specifically did not like the behavioral
change to what having RTNH_F_DEAD set (at least that was what I
understood).

That was what made me make the move to add this additional flag that was
exported to userspace, so it was possible to differentiate the old dead
routes/nexthop functionality from those that were not going to be dead
due to link being down.  

At this point I think I prefer the additional data provided by the new
flag exported to userspace.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11 11:23     ` Andy Gospodarek
@ 2015-06-11 14:47       ` Scott Feldman
  2015-06-11 14:50       ` Alexander Duyck
  1 sibling, 0 replies; 25+ messages in thread
From: Scott Feldman @ 2015-06-11 14:47 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Thu, Jun 11, 2015 at 4:23 AM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> On Wed, Jun 10, 2015 at 11:07:28PM -0700, Scott Feldman wrote:
>> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
>> <gospo@cumulusnetworks.com> wrote:
>> > Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
>> > reachable via an interface where carrier is off.  No action is taken,
>> > but additional flags are passed to userspace to indicate carrier status.
>>
>> Andy, it seems now RTNH_F_LINKDOWN and RTNH_F_DEAD are very similar
>> and I'm wondering if this could be done without introducing a new flag
>> and just use RTNH_F_DEAD.  The link change event would set RTNH_F_DEAD
>> on nh on dev link down, and clear on link up.  The sysctl knob would
>> be something like "nexthop_dead_on_linkdown", default off.  So
>> basically expanding the ways RTNH_F_DEAD can be set.  That would
>> simplify the patch set quite a bit and require no changes to iproute2.
>>
>
> You are absolutely correct that what you describe would be less churn to
> userspace.  From a functionality standpoint that is close to what was
> originally proposed, but Alex specifically did not like the behavioral
> change to what having RTNH_F_DEAD set (at least that was what I
> understood).
>
> That was what made me make the move to add this additional flag that was
> exported to userspace, so it was possible to differentiate the old dead
> routes/nexthop functionality from those that were not going to be dead
> due to link being down.

Why does user space need to know _why_ a nh is dead?  User space
already knows the state (admin/link) of the nh dev.

I not seeing why user space needs to differentiate why nh is dead.
The kernel only needs to know if nh is dead to exclude nh from ecmp
selection.  Same for an offload device.

Can you explain how this new flag provides user space more information
than what's already available from RTM_NEWLINK notifications?

> At this point I think I prefer the additional data provided by the new
> flag exported to userspace.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11 11:23     ` Andy Gospodarek
  2015-06-11 14:47       ` Scott Feldman
@ 2015-06-11 14:50       ` Alexander Duyck
  2015-06-11 15:25         ` Scott Feldman
  2015-06-11 16:53         ` Dinesh Dutt
  1 sibling, 2 replies; 25+ messages in thread
From: Alexander Duyck @ 2015-06-11 14:50 UTC (permalink / raw)
  To: Andy Gospodarek, Scott Feldman
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen



On 06/11/2015 04:23 AM, Andy Gospodarek wrote:
> On Wed, Jun 10, 2015 at 11:07:28PM -0700, Scott Feldman wrote:
>> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
>> <gospo@cumulusnetworks.com> wrote:
>>> Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
>>> reachable via an interface where carrier is off.  No action is taken,
>>> but additional flags are passed to userspace to indicate carrier status.
>> Andy, it seems now RTNH_F_LINKDOWN and RTNH_F_DEAD are very similar
>> and I'm wondering if this could be done without introducing a new flag
>> and just use RTNH_F_DEAD.  The link change event would set RTNH_F_DEAD
>> on nh on dev link down, and clear on link up.  The sysctl knob would
>> be something like "nexthop_dead_on_linkdown", default off.  So
>> basically expanding the ways RTNH_F_DEAD can be set.  That would
>> simplify the patch set quite a bit and require no changes to iproute2.
>>
> You are absolutely correct that what you describe would be less churn to
> userspace.  From a functionality standpoint that is close to what was
> originally proposed, but Alex specifically did not like the behavioral
> change to what having RTNH_F_DEAD set (at least that was what I
> understood).
>
> That was what made me make the move to add this additional flag that was
> exported to userspace, so it was possible to differentiate the old dead
> routes/nexthop functionality from those that were not going to be dead
> due to link being down.
>   this point I think I prefer the additional data provided by the new
> flag exported to userspace.

I preferred the 2 flag solution as the original solution still required 
2 flags, it just only exposed 1 to user-space.  As a result it was much 
more error prone since it was fairly easy to get into a confused state 
about why the link was dead.

With the 2 flag solution it becomes much easier to sort out why the 
route is not functional and it is much easier to isolate for things like 
the sysctl which only disables the use of LINKDOWN and not DEAD.

- Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11 14:50       ` Alexander Duyck
@ 2015-06-11 15:25         ` Scott Feldman
  2015-06-11 16:53         ` Dinesh Dutt
  1 sibling, 0 replies; 25+ messages in thread
From: Scott Feldman @ 2015-06-11 15:25 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andy Gospodarek, Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Thu, Jun 11, 2015 at 7:50 AM, Alexander Duyck
<alexander.h.duyck@redhat.com> wrote:
>
>
> On 06/11/2015 04:23 AM, Andy Gospodarek wrote:
>>
>> On Wed, Jun 10, 2015 at 11:07:28PM -0700, Scott Feldman wrote:
>>>
>>> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
>>> <gospo@cumulusnetworks.com> wrote:
>>>>
>>>> Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
>>>> reachable via an interface where carrier is off.  No action is taken,
>>>> but additional flags are passed to userspace to indicate carrier status.
>>>
>>> Andy, it seems now RTNH_F_LINKDOWN and RTNH_F_DEAD are very similar
>>> and I'm wondering if this could be done without introducing a new flag
>>> and just use RTNH_F_DEAD.  The link change event would set RTNH_F_DEAD
>>> on nh on dev link down, and clear on link up.  The sysctl knob would
>>> be something like "nexthop_dead_on_linkdown", default off.  So
>>> basically expanding the ways RTNH_F_DEAD can be set.  That would
>>> simplify the patch set quite a bit and require no changes to iproute2.
>>>
>> You are absolutely correct that what you describe would be less churn to
>> userspace.  From a functionality standpoint that is close to what was
>> originally proposed, but Alex specifically did not like the behavioral
>> change to what having RTNH_F_DEAD set (at least that was what I
>> understood).
>>
>> That was what made me make the move to add this additional flag that was
>> exported to userspace, so it was possible to differentiate the old dead
>> routes/nexthop functionality from those that were not going to be dead
>> due to link being down.
>>   this point I think I prefer the additional data provided by the new
>> flag exported to userspace.
>
>
> I preferred the 2 flag solution as the original solution still required 2
> flags, it just only exposed 1 to user-space.  As a result it was much more
> error prone since it was fairly easy to get into a confused state about why
> the link was dead.
>
> With the 2 flag solution it becomes much easier to sort out why the route is
> not functional and it is much easier to isolate for things like the sysctl
> which only disables the use of LINKDOWN and not DEAD.

Ok, for the user troubleshooting, that make sense.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status
  2015-06-11  3:19   ` Andy Gospodarek
  2015-06-11  3:33     ` Andy Gospodarek
@ 2015-06-11 15:44     ` Scott Feldman
  2015-06-11 18:17       ` Andy Gospodarek
  1 sibling, 1 reply; 25+ messages in thread
From: Scott Feldman @ 2015-06-11 15:44 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Wed, Jun 10, 2015 at 8:19 PM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> On Wed, Jun 10, 2015 at 08:07:10PM -0700, Scott Feldman wrote:
>> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
>> <gospo@cumulusnetworks.com> wrote:
>>
>> > There was also a request for switchdev support for this, but that will be
>> > posted as a followup as switchdev does not currently handle dead
>> > next-hops in a multi-path case and I felt that infra needed to be added
>> > first.
>>
>> That's not true.  switchdev_fib_ipv4_add() passes *fi and all of the
>> nexthops for the route are hanging off of that, including the
>> nh->flags where you're setting LINKDOWN.  Multipath is not different
>> than singlepath in that regard.  Same API for both.
>
> The API is the same, but I did not see a path that would take a
> multipath route and update the dead nexthops when an interface is taken
> down with switchdev or rocker today.
>
> I could be wrong (and I will test again), but create a multipath route
> with nexthops on swp1 and swp2 and then call 'ip link set swp1 down' and
> let me know if you see rocker's ECMP routes get updated so only the
> nexthop on swp2 will be used.

I don't have ecmp support in rocker yet, but switchdev should be ready
for ecmp.  I tried the test you suggest and switchdev is calling into
the driver with updates to the routes with nhs marked DEAD.  So maybe
your patchset is switchdev-ready?  I'd have to apply your patch to
test. I'll wait for your v4 to address sysctl naming.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops
  2015-06-11 14:50       ` Alexander Duyck
  2015-06-11 15:25         ` Scott Feldman
@ 2015-06-11 16:53         ` Dinesh Dutt
  1 sibling, 0 replies; 25+ messages in thread
From: Dinesh Dutt @ 2015-06-11 16:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andy Gospodarek, Scott Feldman, Netdev, David S. Miller,
	Alexander Duyck, Hannes Frederic Sowa, stephen

Yes, this is what I liked about the 2 flag solution too compared to
the original.

Dinesh

On Thu, Jun 11, 2015 at 7:50 AM, Alexander Duyck
<alexander.h.duyck@redhat.com> wrote:
>
>
> On 06/11/2015 04:23 AM, Andy Gospodarek wrote:
>>
>> On Wed, Jun 10, 2015 at 11:07:28PM -0700, Scott Feldman wrote:
>>>
>>> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
>>> <gospo@cumulusnetworks.com> wrote:
>>>>
>>>> Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
>>>> reachable via an interface where carrier is off.  No action is taken,
>>>> but additional flags are passed to userspace to indicate carrier status.
>>>
>>> Andy, it seems now RTNH_F_LINKDOWN and RTNH_F_DEAD are very similar
>>> and I'm wondering if this could be done without introducing a new flag
>>> and just use RTNH_F_DEAD.  The link change event would set RTNH_F_DEAD
>>> on nh on dev link down, and clear on link up.  The sysctl knob would
>>> be something like "nexthop_dead_on_linkdown", default off.  So
>>> basically expanding the ways RTNH_F_DEAD can be set.  That would
>>> simplify the patch set quite a bit and require no changes to iproute2.
>>>
>> You are absolutely correct that what you describe would be less churn to
>> userspace.  From a functionality standpoint that is close to what was
>> originally proposed, but Alex specifically did not like the behavioral
>> change to what having RTNH_F_DEAD set (at least that was what I
>> understood).
>>
>> That was what made me make the move to add this additional flag that was
>> exported to userspace, so it was possible to differentiate the old dead
>> routes/nexthop functionality from those that were not going to be dead
>> due to link being down.
>>   this point I think I prefer the additional data provided by the new
>> flag exported to userspace.
>
>
> I preferred the 2 flag solution as the original solution still required 2
> flags, it just only exposed 1 to user-space.  As a result it was much more
> error prone since it was fairly easy to get into a confused state about why
> the link was dead.
>
> With the 2 flag solution it becomes much easier to sort out why the route is
> not functional and it is much easier to isolate for things like the sysctl
> which only disables the use of LINKDOWN and not DEAD.
>
> - Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status
  2015-06-11 15:44     ` Scott Feldman
@ 2015-06-11 18:17       ` Andy Gospodarek
  0 siblings, 0 replies; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11 18:17 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Netdev, David S. Miller, ddutt, Alexander Duyck,
	Hannes Frederic Sowa, stephen

On Thu, Jun 11, 2015 at 08:44:55AM -0700, Scott Feldman wrote:
> On Wed, Jun 10, 2015 at 8:19 PM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> > On Wed, Jun 10, 2015 at 08:07:10PM -0700, Scott Feldman wrote:
> >> On Wed, Jun 10, 2015 at 7:37 PM, Andy Gospodarek
> >> <gospo@cumulusnetworks.com> wrote:
> >>
> >> > There was also a request for switchdev support for this, but that will be
> >> > posted as a followup as switchdev does not currently handle dead
> >> > next-hops in a multi-path case and I felt that infra needed to be added
> >> > first.
> >>
> >> That's not true.  switchdev_fib_ipv4_add() passes *fi and all of the
> >> nexthops for the route are hanging off of that, including the
> >> nh->flags where you're setting LINKDOWN.  Multipath is not different
> >> than singlepath in that regard.  Same API for both.
> >
> > The API is the same, but I did not see a path that would take a
> > multipath route and update the dead nexthops when an interface is taken
> > down with switchdev or rocker today.
> >
> > I could be wrong (and I will test again), but create a multipath route
> > with nexthops on swp1 and swp2 and then call 'ip link set swp1 down' and
> > let me know if you see rocker's ECMP routes get updated so only the
> > nexthop on swp2 will be used.
> 
> I don't have ecmp support in rocker yet, but switchdev should be ready
> for ecmp.  I tried the test you suggest and switchdev is calling into
> the driver with updates to the routes with nhs marked DEAD.  So maybe
> your patchset is switchdev-ready?  I'd have to apply your patch to
> test. I'll wait for your v4 to address sysctl naming.

That isn't exactly what I expected by inspection, so I'm pleasantly
surprised.  I'll hold off on the excitement until I get v4 out (which
should be shortly as the request should not be too bad to resolve).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down
  2015-06-11  4:32         ` David Miller
@ 2015-06-11 19:35           ` Andy Gospodarek
  0 siblings, 0 replies; 25+ messages in thread
From: Andy Gospodarek @ 2015-06-11 19:35 UTC (permalink / raw)
  To: David Miller
  Cc: sfeldma, hideaki.yoshifuji, netdev, ddutt, alexander.duyck,
	hannes, stephen

On Wed, Jun 10, 2015 at 09:32:46PM -0700, David Miller wrote:
> From: Andy Gospodarek <gospo@cumulusnetworks.com>
> Date: Wed, 10 Jun 2015 23:36:21 -0400
> 
> > On Wed, Jun 10, 2015 at 08:00:14PM -0700, Scott Feldman wrote:
> >> On Wed, Jun 10, 2015 at 7:57 PM, YOSHIFUJI Hideaki
> >> <hideaki.yoshifuji@miraclelinux.com> wrote:
> >> > Hi,
> >> >
> >> > Andy Gospodarek wrote:
> >> >> This feature is only enabled with the new per-interface or ipv4 global
> >> >> sysctls called 'ignore_routes_with_linkdown'.
> >> >>
> >> >> net.ipv4.conf.all.ignore_routes_with_linkdown = 0
> >> >> net.ipv4.conf.default.ignore_routes_with_linkdown = 0
> >> >> net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
> >> > :
> >> >> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> >> >> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> >> >> ---
> >> > :
> >> >> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> >> >> index 7e7746a..c9d0a0e 100644
> >> >> --- a/kernel/sysctl_binary.c
> >> >> +++ b/kernel/sysctl_binary.c
> >> >> @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = {
> >> >>       { CTL_INT,      NET_IPV4_CONF_NOPOLICY,                 "disable_policy" },
> >> >>       { CTL_INT,      NET_IPV4_CONF_FORCE_IGMP_VERSION,       "force_igmp_version" },
> >> >>       { CTL_INT,      NET_IPV4_CONF_PROMOTE_SECONDARIES,      "promote_secondaries" },
> >> >> +     { CTL_INT,      NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN,      "ignore_routes_with_linkdown" },
> >> >>       {}
> >> >>  };
> >> >>
> >> >
> >> > Please do not add new binary sysctl knob. Thank you.
> >> 
> >> Reason?
> > 
> > I'll echo Scott's request here.  I realize than an abundance of them is
> > bad, but (to me) this one seems useful.  Unless of course we want to
> > make this proposed behavior the default.  :-)
> 
> Kernel wide, new binary sysctl's are verboten.
> 
> Everyone should be accessing sysctls via their name.
> 
> You have to remove this.
> 

No problem, the code as-is works just fine without, so I'll submit a v4
with this line removed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-06-11 19:35 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-11  2:37 [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status Andy Gospodarek
2015-06-11  2:37 ` [PATCH net-next 1/3 v3] net: track link-status of ipv4 nexthops Andy Gospodarek
2015-06-11  2:53   ` Scott Feldman
2015-06-11  3:28     ` Andy Gospodarek
2015-06-11  4:14       ` Scott Feldman
2015-06-11  6:07   ` Scott Feldman
2015-06-11 11:23     ` Andy Gospodarek
2015-06-11 14:47       ` Scott Feldman
2015-06-11 14:50       ` Alexander Duyck
2015-06-11 15:25         ` Scott Feldman
2015-06-11 16:53         ` Dinesh Dutt
2015-06-11  2:37 ` [PATCH net-next 2/3 v3] net: ipv4 sysctl option to ignore routes when nexthop link is down Andy Gospodarek
2015-06-11  2:57   ` YOSHIFUJI Hideaki
2015-06-11  3:00     ` Scott Feldman
2015-06-11  3:36       ` Andy Gospodarek
2015-06-11  4:32         ` David Miller
2015-06-11 19:35           ` Andy Gospodarek
2015-06-11  2:37 ` [PATCH net-next 3/3 v3] iproute2: add support to print 'linkdown' nexthop flag Andy Gospodarek
2015-06-11  3:02   ` Scott Feldman
2015-06-11  3:13     ` Andy Gospodarek
2015-06-11  3:07 ` [PATCH net-next 0/3 v3] changes to make ipv4 routing table aware of next-hop link status Scott Feldman
2015-06-11  3:19   ` Andy Gospodarek
2015-06-11  3:33     ` Andy Gospodarek
2015-06-11 15:44     ` Scott Feldman
2015-06-11 18:17       ` Andy Gospodarek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.