Netdev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions
@ 2019-06-18 13:20 Stefano Brivio
  2019-06-18 13:20 ` [PATCH net v5 1/6] fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED Stefano Brivio
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Stefano Brivio @ 2019-06-18 13:20 UTC (permalink / raw)
  To: David Miller, David Ahern
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

For IPv6 cached routes, the commands 'ip -6 route list cache' and
'ip -6 route flush cache' don't work at all after route exceptions have
been moved to a separate hash table in commit 2b760fcf5cfb ("ipv6: hook
up exception table to store dst cache").

For IPv4 cached routes, the command 'ip route list cache' has also
stopped working in kernel 3.5 after commit 4895c771c7f0 ("ipv4: Add FIB
nexthop exceptions.") introduced storage for route exceptions as a
separate entity.

Fix this by allowing userspace to clearly request cached routes with
the RTM_F_CLONED flag used as a filter (in conjuction with strict
checking) and by retrieving and dumping cached routes if requested.

If strict checking is not requested (iproute2 < 5.0.0), we don't have a
way to consistently filter results on other selectors (e.g. on tables),
so skip filtering entirely and dump both regular routes and exceptions.

I'm submitting this for net as these changes fix rather relevant
breakages. However, the scope might be a bit broad, and said breakages
have been introduced 7 and 2 years ago, respectively, for IPv4 and IPv6.
Let me know if I should rebase this on net-next instead.

For IPv4, cache flushing uses a completely different mechanism, so it
wasn't affected. Listing of exception routes (modified routes pre-3.5) was
tested against these versions of kernel and iproute2:

                    iproute2
kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
 3.5-rc4         +        +        +        +       +
 4.4
 4.9
 4.14
 4.15
 4.19
 5.0
 5.1
 fixed           +        +        +        +       +


For IPv6, a separate iproute2 patch is required. Versions of iproute2
and kernel tested:

                    iproute2
kernel             4.14.0   4.15.0   4.19.0   5.0.0   5.1.0    5.1.0, patched
 3.18    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.4     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.9     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.14    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.15    list
         flush
 4.19    list
         flush
 5.0     list
         flush
 5.1     list
         flush
 with    list        +        +        +        +       +            +
 fix     flush       +        +        +                             +

v5: Skip filtering altogether if no strict checking is requested: selecting
    routes or exceptions only would be inconsistent with the fact we can't
    filter on tables. Drop 1/8 (non-strict dump filter function no longer
    needed), replace 2/8 (don't use NLM_F_MATCH, decide to skip routes or
    exceptions in filter function), drop 6/8 (2/8 is enough for IPv6 too).
    Introduce dump_routes and dump_exceptions flags in filter, adapt other
    patches to that.

v4: Fix the listing issue also for IPv4, making the behaviour consistent
    with IPv6. Honour NLM_F_MATCH as per RFC 3549 and allow usage of
    RTM_F_CLONED filter. Split patches into smaller logical changes.

v3: Drop check on RTM_F_CLONED and rework logic of return values of
    rt6_dump_route()

v2: Add count of routes handled in partial dumps, and skip them, in patch 1/2.

Stefano Brivio (6):
  fib_frontend, ip6_fib: Select routes or exceptions dump from
    RTM_F_CLONED
  ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering
  ipv4: Dump route exceptions if requested
  Revert "net/ipv6: Bail early if user only wants cloned entries"
  ipv6: Dump route exceptions if requested
  ip6_fib: Don't discard nodes with valid routing information in
    fib6_locate_1()

 include/net/ip6_fib.h   |   1 +
 include/net/ip6_route.h |   2 +-
 include/net/ip_fib.h    |   2 +
 include/net/route.h     |   3 ++
 net/ipv4/fib_frontend.c |  12 +++--
 net/ipv4/fib_trie.c     | 101 +++++++++++++++++++++++++++++++++++-----
 net/ipv4/route.c        |   6 +--
 net/ipv6/ip6_fib.c      |  27 +++++++----
 net/ipv6/route.c        |  85 ++++++++++++++++++++++++++++-----
 9 files changed, 199 insertions(+), 40 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net v5 1/6] fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED
  2019-06-18 13:20 [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
@ 2019-06-18 13:20 ` Stefano Brivio
  2019-06-18 14:49   ` David Ahern
  2019-06-18 13:20 ` [PATCH net v5 2/6] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering Stefano Brivio
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2019-06-18 13:20 UTC (permalink / raw)
  To: David Miller, David Ahern
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

The following patches add back the ability to dump IPv4 and IPv6 exception
routes, and we need to allow selection of regular routes or exceptions.

Use RTM_F_CLONED as filter to decide whether to dump routes or exceptions:
iproute2 passes it in dump requests (except for IPv6 cache flush requests,
this will be fixed in iproute2) and this used to work as long as
exceptions were stored directly in the FIB, for both IPv4 and IPv6.

Caveat: if strict checking is not requested (that is, if the dump request
doesn't go through ip_valid_fib_dump_req()), we can't filter on protocol,
tables or route types.

In this case, filtering on RTM_F_CLONED would be inconsistent: we would
fix 'ip route list cache' by returning exception routes and at the same
time introduce another bug in case another selector is present, e.g. on
'ip route list cache table main' we would return all exception routes,
without filtering on tables.

Keep this consistent by applying no filters at all, and dumping both
routes and exceptions, if strict checking is not requested. iproute2
currently filters results anyway, and no unwanted results will be
presented to the user. The kernel will just dump more data than needed.

v5: New patch: add dump_routes and dump_exceptions flags in filter and
    simply clear the unwanted one if strict checking is enabled, don't
    ignore NLM_F_MATCH and don't set filter_set if NLM_F_MATCH is set.
    Skip filtering altogether if no strict checking is requested:
    selecting routes or exceptions only would be inconsistent with the
    fact we can't filter on tables.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/net/ip_fib.h    | 2 ++
 net/ipv4/fib_frontend.c | 8 +++++++-
 net/ipv6/ip6_fib.c      | 3 ++-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index bbeff32fb6cb..32a37f1afb8e 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -243,6 +243,8 @@ struct fib_dump_filter {
 	/* filter_set is an optimization that an entry is set */
 	bool			filter_set;
 	bool			dump_all_families;
+	bool			dump_routes;
+	bool			dump_exceptions;
 	unsigned char		protocol;
 	unsigned char		rt_type;
 	unsigned int		flags;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index e54c2bcbb465..c28d60d6c9d0 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -881,10 +881,15 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
 		NL_SET_ERR_MSG(extack, "Invalid values in header for FIB dump request");
 		return -EINVAL;
 	}
+
 	if (rtm->rtm_flags & ~(RTM_F_CLONED | RTM_F_PREFIX)) {
 		NL_SET_ERR_MSG(extack, "Invalid flags for FIB dump request");
 		return -EINVAL;
 	}
+	if (rtm->rtm_flags & RTM_F_CLONED)
+		filter->dump_routes = false;
+	else
+		filter->dump_exceptions = false;
 
 	filter->dump_all_families = (rtm->rtm_family == AF_UNSPEC);
 	filter->flags    = rtm->rtm_flags;
@@ -931,9 +936,10 @@ EXPORT_SYMBOL_GPL(ip_valid_fib_dump_req);
 
 static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 {
+	struct fib_dump_filter filter = { .dump_routes = true,
+					  .dump_exceptions = true };
 	const struct nlmsghdr *nlh = cb->nlh;
 	struct net *net = sock_net(skb->sk);
-	struct fib_dump_filter filter = {};
 	unsigned int h, s_h;
 	unsigned int e = 0, s_e;
 	struct fib_table *tb;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 9180c8b6f764..0f58596fd0b1 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -558,9 +558,10 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb,
 
 static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 {
+	struct rt6_rtnl_dump_arg arg = { .filter.dump_exceptions = true,
+					 .filter.dump_routes = true };
 	const struct nlmsghdr *nlh = cb->nlh;
 	struct net *net = sock_net(skb->sk);
-	struct rt6_rtnl_dump_arg arg = {};
 	unsigned int h, s_h;
 	unsigned int e = 0, s_e;
 	struct fib6_walker *w;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net v5 2/6] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering
  2019-06-18 13:20 [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
  2019-06-18 13:20 ` [PATCH net v5 1/6] fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED Stefano Brivio
@ 2019-06-18 13:20 ` Stefano Brivio
  2019-06-18 14:49   ` David Ahern
  2019-06-18 13:20 ` [PATCH net v5 3/6] ipv4: Dump route exceptions if requested Stefano Brivio
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2019-06-18 13:20 UTC (permalink / raw)
  To: David Miller, David Ahern
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

This functionally reverts the check introduced by commit
e8ba330ac0c5 ("rtnetlink: Update fib dumps for strict data checking")
as modified by commit e4e92fb160d7 ("net/ipv4: Bail early if user only
wants prefix entries").

As we are preparing to fix listing of IPv4 cached routes, we need to
give userspace a way to request them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v5: No changes

v4: New patch

 net/ipv4/fib_frontend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c28d60d6c9d0..fced49e473c7 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -956,8 +956,8 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 		filter.flags = rtm->rtm_flags & (RTM_F_PREFIX | RTM_F_CLONED);
 	}
 
-	/* fib entries are never clones and ipv4 does not use prefix flag */
-	if (filter.flags & (RTM_F_PREFIX | RTM_F_CLONED))
+	/* ipv4 does not use prefix flag */
+	if (filter.flags & RTM_F_PREFIX)
 		return skb->len;
 
 	if (filter.table_id) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net v5 3/6] ipv4: Dump route exceptions if requested
  2019-06-18 13:20 [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
  2019-06-18 13:20 ` [PATCH net v5 1/6] fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED Stefano Brivio
  2019-06-18 13:20 ` [PATCH net v5 2/6] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering Stefano Brivio
@ 2019-06-18 13:20 ` Stefano Brivio
  2019-06-18 14:48   ` David Ahern
  2019-06-18 13:20 ` [PATCH net v5 4/6] Revert "net/ipv6: Bail early if user only wants cloned entries" Stefano Brivio
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2019-06-18 13:20 UTC (permalink / raw)
  To: David Miller, David Ahern
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

Since commit 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), cached
exception routes are stored as a separate entity, so they are not dumped
on a FIB dump, even if the RTM_F_CLONED flag is passed.

This implies that the command 'ip route list cache' doesn't return any
result anymore.

If the RTM_F_CLONED flag is passed, and strict checking requested,
retrieve nexthop exception routes and dump them. If no strict checking
is requested, filtering can't be performed consistently: dump everything
in that case.

With this, we need to add an argument to the netlink callback in order to
track how many entries were already dumped for the last leaf included in
a partial netlink dump.

Note that this is only as accurate as the existing tracking mechanism for
leaves: if a partial dump is restarted after exceptions are removed or
expired, we might skip some non-dumped entries. To improve this, we could
attach a 'sernum' attribute (similar to the one used for IPv6) to nexthop
entities, and bump this counter whenever exceptions change.

Listing of exception routes (pre-3.5: modified routes) was tested against
these versions of kernel and iproute2:

                    iproute2
kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
 3.5-rc4         +        +        +        +       +
 4.4
 4.9
 4.14
 4.15
 4.19
 5.0
 5.1
 fixed           +        +        +        +       +

Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v5: Trivial adaptation for 1/6

v4: New patch

 include/net/route.h |   3 ++
 net/ipv4/fib_trie.c | 101 ++++++++++++++++++++++++++++++++++++++------
 net/ipv4/route.c    |   6 +--
 3 files changed, 95 insertions(+), 15 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 065b47754f05..f0d0086e76ce 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -221,6 +221,9 @@ void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt);
 struct rtable *rt_dst_alloc(struct net_device *dev,
 			     unsigned int flags, u16 type,
 			     bool nopolicy, bool noxfrm, bool will_cache);
+int rt_fill_info(struct net *net, __be32 dst, __be32 src, struct rtable *rt,
+		 u32 table_id, struct flowi4 *fl4, struct sk_buff *skb,
+		 u32 portid, u32 seq);
 
 struct in_ifaddr;
 void fib_add_ifaddr(struct in_ifaddr *);
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 868c74771fa9..a00408827ae8 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2000,28 +2000,92 @@ void fib_free_table(struct fib_table *tb)
 	call_rcu(&tb->rcu, __trie_free_rcu);
 }
 
+static int fib_dump_fnhe_from_leaf(struct fib_alias *fa, struct sk_buff *skb,
+				   struct netlink_callback *cb,
+				   int *fa_index, int fa_start)
+{
+	struct net *net = sock_net(cb->skb->sk);
+	struct fib_info *fi = fa->fa_info;
+	struct fnhe_hash_bucket *bucket;
+	struct fib_nh_common *nhc;
+	int i, genid;
+
+	if (!fi || fi->fib_flags & RTNH_F_DEAD)
+		return 0;
+
+	nhc = fib_info_nhc(fi, 0);
+	if (nhc->nhc_flags & RTNH_F_DEAD)
+		return 0;
+
+	bucket = rcu_dereference(nhc->nhc_exceptions);
+	if (!bucket)
+		return 0;
+
+	genid = fnhe_genid(net);
+
+	for (i = 0; i < FNHE_HASH_SIZE; i++) {
+		struct fib_nh_exception *fnhe;
+
+		for (fnhe = rcu_dereference(bucket[i].chain); fnhe;
+		     fnhe = rcu_dereference(fnhe->fnhe_next)) {
+			struct flowi4 fl4 = {};
+			struct rtable *rt;
+			int err;
+
+			if (*fa_index < fa_start)
+				goto next;
+
+			if (fnhe->fnhe_genid != genid)
+				goto next;
+
+			if (fnhe->fnhe_expires &&
+			    time_after(jiffies, fnhe->fnhe_expires))
+				goto next;
+
+			rt = rcu_dereference(fnhe->fnhe_rth_input);
+			if (!rt)
+				rt = rcu_dereference(fnhe->fnhe_rth_output);
+			if (!rt)
+				goto next;
+
+			err = rt_fill_info(net, fnhe->fnhe_daddr, 0, rt,
+					   fa->tb_id, &fl4, skb,
+					   NETLINK_CB(cb->skb).portid,
+					   cb->nlh->nlmsg_seq);
+			if (err)
+				return err;
+next:
+			(*fa_index)++;
+		}
+	}
+
+	return 0;
+}
+
 static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
 			     struct sk_buff *skb, struct netlink_callback *cb,
 			     struct fib_dump_filter *filter)
 {
 	unsigned int flags = NLM_F_MULTI;
 	__be32 xkey = htonl(l->key);
+	int i, s_i, i_fa, s_fa, err;
 	struct fib_alias *fa;
-	int i, s_i;
 
-	if (filter->filter_set)
+	if (filter->filter_set ||
+	    !filter->dump_exceptions || !filter->dump_routes)
 		flags |= NLM_F_DUMP_FILTERED;
 
 	s_i = cb->args[4];
+	s_fa = cb->args[5];
 	i = 0;
 
 	/* rcu_read_lock is hold by caller */
 	hlist_for_each_entry_rcu(fa, &l->leaf, fa_list) {
-		int err;
-
 		if (i < s_i)
 			goto next;
 
+		i_fa = 0;
+
 		if (tb->tb_id != fa->tb_id)
 			goto next;
 
@@ -2038,21 +2102,34 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
 				goto next;
 		}
 
-		err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
-				    cb->nlh->nlmsg_seq, RTM_NEWROUTE,
-				    tb->tb_id, fa->fa_type,
-				    xkey, KEYLENGTH - fa->fa_slen,
-				    fa->fa_tos, fa->fa_info, flags);
-		if (err < 0) {
-			cb->args[4] = i;
-			return err;
+		if (filter->dump_routes && !s_fa) {
+			err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
+					    cb->nlh->nlmsg_seq, RTM_NEWROUTE,
+					    tb->tb_id, fa->fa_type,
+					    xkey, KEYLENGTH - fa->fa_slen,
+					    fa->fa_tos, fa->fa_info, flags);
+			if (err < 0)
+				goto stop;
+			i_fa++;
+		}
+
+		if (filter->dump_exceptions) {
+			err = fib_dump_fnhe_from_leaf(fa, skb, cb, &i_fa, s_fa);
+			if (err < 0)
+				goto stop;
 		}
+
 next:
 		i++;
 	}
 
 	cb->args[4] = i;
 	return skb->len;
+
+stop:
+	cb->args[4] = i;
+	cb->args[5] = i_fa;
+	return err;
 }
 
 /* rcu_read_lock needs to be hold by caller from readside */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6cb7cff22db9..cc970fd861e8 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2663,9 +2663,9 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
 /* called with rcu_read_lock held */
-static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
-			struct rtable *rt, u32 table_id, struct flowi4 *fl4,
-			struct sk_buff *skb, u32 portid, u32 seq)
+int rt_fill_info(struct net *net, __be32 dst, __be32 src, struct rtable *rt,
+		 u32 table_id, struct flowi4 *fl4, struct sk_buff *skb,
+		 u32 portid, u32 seq)
 {
 	struct rtmsg *r;
 	struct nlmsghdr *nlh;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net v5 4/6] Revert "net/ipv6: Bail early if user only wants cloned entries"
  2019-06-18 13:20 [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
                   ` (2 preceding siblings ...)
  2019-06-18 13:20 ` [PATCH net v5 3/6] ipv4: Dump route exceptions if requested Stefano Brivio
@ 2019-06-18 13:20 ` Stefano Brivio
  2019-06-18 14:51   ` David Ahern
  2019-06-18 13:20 ` [PATCH net v5 5/6] ipv6: Dump route exceptions if requested Stefano Brivio
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2019-06-18 13:20 UTC (permalink / raw)
  To: David Miller, David Ahern
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

This reverts commit 08e814c9e8eb5a982cbd1e8f6bd255d97c51026f: as we
are preparing to fix listing and dumping of IPv6 cached routes, we
need to allow RTM_F_CLONED as a flag to match routes against while
dumping them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v5: No changes

v4: New patch

 net/ipv6/ip6_fib.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 0f58596fd0b1..e846192573b0 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -578,13 +578,10 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
 		struct rtmsg *rtm = nlmsg_data(nlh);
 
-		arg.filter.flags = rtm->rtm_flags & (RTM_F_PREFIX|RTM_F_CLONED);
+		if (rtm->rtm_flags & RTM_F_PREFIX)
+			arg.filter.flags = RTM_F_PREFIX;
 	}
 
-	/* fib entries are never clones */
-	if (arg.filter.flags & RTM_F_CLONED)
-		goto out;
-
 	w = (void *)cb->args[2];
 	if (!w) {
 		/* New dump:
-- 
2.20.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net v5 5/6] ipv6: Dump route exceptions if requested
  2019-06-18 13:20 [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
                   ` (3 preceding siblings ...)
  2019-06-18 13:20 ` [PATCH net v5 4/6] Revert "net/ipv6: Bail early if user only wants cloned entries" Stefano Brivio
@ 2019-06-18 13:20 ` Stefano Brivio
  2019-06-18 15:19   ` David Ahern
  2019-06-18 13:20 ` [PATCH net v5 6/6] ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1() Stefano Brivio
  2019-06-18 14:51 ` [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions David Ahern
  6 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2019-06-18 13:20 UTC (permalink / raw)
  To: David Miller, David Ahern
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

Since commit 2b760fcf5cfb ("ipv6: hook up exception table to store dst
cache"), route exceptions reside in a separate hash table, and won't be
found by walking the FIB, so they won't be dumped to userspace on a
RTM_GETROUTE message.

This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
have no function anymore:

 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
 # ip -6 route list cache
 # ip -6 route flush cache
 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium

because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
by listing all the routes, and deleting them with RTM_DELROUTE one by one.

If cached routes are requested using the RTM_F_CLONED flag together with
strict checking, or if no strict checking is requested (and hence we can't
consistently apply filters), look up exceptions in the hash table
associated with the current fib6_info in rt6_dump_route(), and, if present
and not expired, add them to the dump.

We might be unable to dump all the entries for a given node in a single
message, so keep track of how many entries were handled for the current
node in fib6_walker, and skip that amount in case we start from the same
partially dumped node.

Note that, with the current version of iproute2, this only fixes the
'ip -6 route list cache': on a flush command, iproute2 doesn't pass
RTM_F_CLONED and, due to this inconsistency, 'ip -6 route flush cache' is
still unable to fetch the routes to be flushed. This is now addressed in a
patch for iproute2.

To flush cached routes, a procfs entry could be introduced instead: that's
how it works for IPv4. We already have a rt6_flush_exception() function
ready to be wired to it. However, this would not solve the issue for
listing.

Versions of iproute2 and kernel tested:

                    iproute2
kernel             4.14.0   4.15.0   4.19.0   5.0.0   5.1.0    5.1.0, patched
 3.18    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.4     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.9     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.14    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.15    list
         flush
 4.19    list
         flush
 5.0     list
         flush
 5.1     list
         flush
 with    list        +        +        +        +       +            +
 fix     flush       +        +        +                             +

v5:
  - use dump_routes and dump_exceptions from filter, ignore NLM_F_MATCH,
    update test results (flushing works with iproute2 < 5.0.0 now)

v4:
  - split NLM_F_MATCH and strict check handling in separate patches
  - filter routes using RTM_F_CLONED: if it's not set, only return
    non-cached routes, and if it's set, only return cached routes:
    change requested by David Ahern and Martin Lau. This implies that
    iproute2 needs a separate patch to be able to flush IPv6 cached
    routes. This is not ideal because we can't fix the breakage caused
    by 2b760fcf5cfb entirely in kernel. However, two years have passed
    since then, and this makes it more tolerable

v3:
  - more descriptive comment about expired exceptions in rt6_dump_route()
  - swap return values of rt6_dump_route() (suggested by Martin Lau)
  - don't zero skip_in_node in case we don't dump anything in a given pass
    (also suggested by Martin Lau)
  - remove check on RTM_F_CLONED altogether: in the current UAPI semantic,
    it's just a flag to indicate the route was cloned, not to filter on
    routes

v2: Add tracking of number of entries to be skipped in current node after
    a partial dump. As we restart from the same node, if not all the
    exceptions for a given node fit in a single message, the dump will
    not terminate, as suggested by Martin Lau. This is a concrete
    possibility, setting up a big number of exceptions for the same route
    actually causes the issue, suggested by David Ahern.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
This will cause a non-trivial conflict with commit cc5c073a693f
("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
an equivalent patch against net-next, if it helps.

 include/net/ip6_fib.h   |  1 +
 include/net/ip6_route.h |  2 +-
 net/ipv6/ip6_fib.c      | 14 ++++++-
 net/ipv6/route.c        | 85 +++++++++++++++++++++++++++++++++++------
 4 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 855b352b660f..5909a9d8ff67 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -312,6 +312,7 @@ struct fib6_walker {
 	enum fib6_walk_state state;
 	unsigned int skip;
 	unsigned int count;
+	unsigned int skip_in_node;
 	int (*func)(struct fib6_walker *);
 	void *args;
 };
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 4790beaa86e0..b66c4aac56ab 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -178,7 +178,7 @@ struct rt6_rtnl_dump_arg {
 	struct fib_dump_filter filter;
 };
 
-int rt6_dump_route(struct fib6_info *f6i, void *p_arg);
+int rt6_dump_route(struct fib6_info *f6i, void *p_arg, unsigned int skip);
 void rt6_mtu_change(struct net_device *dev, unsigned int mtu);
 void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
 void rt6_clean_tohost(struct net *net, struct in6_addr *gateway);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index e846192573b0..fc93e1b439a3 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -469,12 +469,19 @@ static int fib6_dump_node(struct fib6_walker *w)
 	struct fib6_info *rt;
 
 	for_each_fib6_walker_rt(w) {
-		res = rt6_dump_route(rt, w->args);
-		if (res < 0) {
+		res = rt6_dump_route(rt, w->args, w->skip_in_node);
+		if (res >= 0) {
 			/* Frame is full, suspend walking */
 			w->leaf = rt;
+
+			/* We'll restart from this node, so if some routes were
+			 * already dumped, skip them next time.
+			 */
+			w->skip_in_node += res;
+
 			return 1;
 		}
+		w->skip_in_node = 0;
 
 		/* Multipath routes are dumped in one route with the
 		 * RTA_MULTIPATH attribute. Jump 'rt' to point to the
@@ -526,6 +533,7 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb,
 	if (cb->args[4] == 0) {
 		w->count = 0;
 		w->skip = 0;
+		w->skip_in_node = 0;
 
 		spin_lock_bh(&table->tb6_lock);
 		res = fib6_walk(net, w);
@@ -541,6 +549,7 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb,
 			w->state = FWS_INIT;
 			w->node = w->root;
 			w->skip = w->count;
+			w->skip_in_node = 0;
 		} else
 			w->skip = 0;
 
@@ -2039,6 +2048,7 @@ static void fib6_clean_tree(struct net *net, struct fib6_node *root,
 	c.w.func = fib6_clean_node;
 	c.w.count = 0;
 	c.w.skip = 0;
+	c.w.skip_in_node = 0;
 	c.func = func;
 	c.sernum = sernum;
 	c.arg = arg;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0f60eb3a2873..7375f3b7d310 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4854,33 +4854,94 @@ static bool fib6_info_uses_dev(const struct fib6_info *f6i,
 	return false;
 }
 
-int rt6_dump_route(struct fib6_info *rt, void *p_arg)
+/* Return -1 if done with node, number of handled routes on partial dump */
+int rt6_dump_route(struct fib6_info *rt, void *p_arg, unsigned int skip)
 {
 	struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
 	struct fib_dump_filter *filter = &arg->filter;
+	struct rt6_exception_bucket *bucket;
 	unsigned int flags = NLM_F_MULTI;
+	struct rt6_exception *rt6_ex;
 	struct net *net = arg->net;
+	int i, count = 0;
 
 	if (rt == net->ipv6.fib6_null_entry)
-		return 0;
+		return -1;
 
 	if ((filter->flags & RTM_F_PREFIX) &&
 	    !(rt->fib6_flags & RTF_PREFIX_RT)) {
 		/* success since this is not a prefix route */
-		return 1;
+		return -1;
 	}
-	if (filter->filter_set) {
-		if ((filter->rt_type && rt->fib6_type != filter->rt_type) ||
-		    (filter->dev && !fib6_info_uses_dev(rt, filter->dev)) ||
-		    (filter->protocol && rt->fib6_protocol != filter->protocol)) {
-			return 1;
-		}
+	if (filter->filter_set &&
+	    ((filter->rt_type  && rt->fib6_type != filter->rt_type) ||
+	     (filter->dev      && !fib6_info_uses_dev(rt, filter->dev)) ||
+	     (filter->protocol && rt->fib6_protocol != filter->protocol))) {
+		return -1;
+	}
+
+	if (filter->filter_set ||
+	    !filter->dump_routes || !filter->dump_exceptions) {
 		flags |= NLM_F_DUMP_FILTERED;
 	}
 
-	return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
-			     RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
-			     arg->cb->nlh->nlmsg_seq, flags);
+	if (filter->dump_routes) {
+		if (skip) {
+			skip--;
+		} else {
+			if (rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL,
+					  0, RTM_NEWROUTE,
+					  NETLINK_CB(arg->cb->skb).portid,
+					  arg->cb->nlh->nlmsg_seq, flags)) {
+				return 0;
+			}
+			count++;
+		}
+	}
+
+	if (!filter->dump_exceptions)
+		return -1;
+
+	bucket = rcu_dereference(rt->rt6i_exception_bucket);
+	if (!bucket)
+		return -1;
+
+	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
+		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
+			if (skip) {
+				skip--;
+				continue;
+			}
+
+			/* Expiration of entries doesn't bump sernum, insertion
+			 * does. Removal is triggered by insertion, so we can
+			 * rely on the fact that if entries change between two
+			 * partial dumps, this node is scanned again completely,
+			 * see rt6_insert_exception() and fib6_dump_table().
+			 *
+			 * Count expired entries we go through as handled
+			 * entries that we'll skip next time, in case of partial
+			 * node dump. Otherwise, if entries expire meanwhile,
+			 * we'll skip the wrong amount.
+			 */
+			if (rt6_check_expired(rt6_ex->rt6i)) {
+				count++;
+				continue;
+			}
+
+			if (rt6_fill_node(net, arg->skb, rt, &rt6_ex->rt6i->dst,
+					  NULL, NULL, 0, RTM_NEWROUTE,
+					  NETLINK_CB(arg->cb->skb).portid,
+					  arg->cb->nlh->nlmsg_seq, flags)) {
+				return count;
+			}
+
+			count++;
+		}
+		bucket++;
+	}
+
+	return -1;
 }
 
 static int inet6_rtm_valid_getroute_req(struct sk_buff *skb,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net v5 6/6] ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1()
  2019-06-18 13:20 [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
                   ` (4 preceding siblings ...)
  2019-06-18 13:20 ` [PATCH net v5 5/6] ipv6: Dump route exceptions if requested Stefano Brivio
@ 2019-06-18 13:20 ` Stefano Brivio
  2019-06-18 14:51 ` [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions David Ahern
  6 siblings, 0 replies; 16+ messages in thread
From: Stefano Brivio @ 2019-06-18 13:20 UTC (permalink / raw)
  To: David Miller, David Ahern
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

When we perform an inexact match on FIB nodes via fib6_locate_1(), longer
prefixes will be preferred to shorter ones. However, it might happen that
a node, with higher fn_bit value than some other, has no valid routing
information.

In this case, we'll pick that node, but it will be discarded by the check
on RTN_RTINFO in fib6_locate(), and we might miss nodes with valid routing
information but with lower fn_bit value.

This is apparent when a routing exception is created for a default route:
 # ip -6 route list
 fc00:1::/64 dev veth_A-R1 proto kernel metric 256 pref medium
 fc00:2::/64 dev veth_A-R2 proto kernel metric 256 pref medium
 fc00:4::1 via fc00:2::2 dev veth_A-R2 metric 1024 pref medium
 fe80::/64 dev veth_A-R1 proto kernel metric 256 pref medium
 fe80::/64 dev veth_A-R2 proto kernel metric 256 pref medium
 default via fc00:1::2 dev veth_A-R1 metric 1024 pref medium
 # ip -6 route list cache
 fc00:4::1 via fc00:2::2 dev veth_A-R2 metric 1024 expires 593sec mtu 1500 pref medium
 fc00:3::1 via fc00:1::2 dev veth_A-R1 metric 1024 expires 593sec mtu 1500 pref medium
 # ip -6 route flush cache    # node for default route is discarded
 Failed to send flush request: No such process
 # ip -6 route list cache
 fc00:3::1 via fc00:1::2 dev veth_A-R1 metric 1024 expires 586sec mtu 1500 pref medium

Check right away if the node has a RTN_RTINFO flag, before replacing the
'prev' pointer, that indicates the longest matching prefix found so far.

Fixes: 38fbeeeeccdb ("ipv6: prepare fib6_locate() for exception table")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v5: No changes

v4: No changes

v3: No changes

v2: No changes

 net/ipv6/ip6_fib.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index fc93e1b439a3..17c75ff2fa63 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1543,7 +1543,8 @@ static struct fib6_node *fib6_locate_1(struct fib6_node *root,
 		if (plen == fn->fn_bit)
 			return fn;
 
-		prev = fn;
+		if (fn->fn_flags & RTN_RTINFO)
+			prev = fn;
 
 next:
 		/*
-- 
2.20.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net v5 3/6] ipv4: Dump route exceptions if requested
  2019-06-18 13:20 ` [PATCH net v5 3/6] ipv4: Dump route exceptions if requested Stefano Brivio
@ 2019-06-18 14:48   ` David Ahern
  2019-06-19 23:57     ` Stefano Brivio
  0 siblings, 1 reply; 16+ messages in thread
From: David Ahern @ 2019-06-18 14:48 UTC (permalink / raw)
  To: Stefano Brivio, David Miller
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

On 6/18/19 7:20 AM, Stefano Brivio wrote:
> diff --git a/include/net/route.h b/include/net/route.h
> index 065b47754f05..f0d0086e76ce 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -221,6 +221,9 @@ void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt);
>  struct rtable *rt_dst_alloc(struct net_device *dev,
>  			     unsigned int flags, u16 type,
>  			     bool nopolicy, bool noxfrm, bool will_cache);
> +int rt_fill_info(struct net *net, __be32 dst, __be32 src, struct rtable *rt,
> +		 u32 table_id, struct flowi4 *fl4, struct sk_buff *skb,
> +		 u32 portid, u32 seq);
>  
>  struct in_ifaddr;
>  void fib_add_ifaddr(struct in_ifaddr *);
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 868c74771fa9..a00408827ae8 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -2000,28 +2000,92 @@ void fib_free_table(struct fib_table *tb)
>  	call_rcu(&tb->rcu, __trie_free_rcu);
>  }
>  
> +static int fib_dump_fnhe_from_leaf(struct fib_alias *fa, struct sk_buff *skb,
> +				   struct netlink_callback *cb,
> +				   int *fa_index, int fa_start)
> +{
> +	struct net *net = sock_net(cb->skb->sk);
> +	struct fib_info *fi = fa->fa_info;
> +	struct fnhe_hash_bucket *bucket;
> +	struct fib_nh_common *nhc;
> +	int i, genid;
> +
> +	if (!fi || fi->fib_flags & RTNH_F_DEAD)
> +		return 0;
> +
> +	nhc = fib_info_nhc(fi, 0);

This should be a loop over fi->fib_nhs for net:
	for (i = 0; i < fi->fib_nhs; i++) {
		nhc = fib_info_nhc(fi, 0);
		...

and a loop over fib_info_num_path(fi) for net-next:
	for (i = 0; i < fib_info_num_path(fi); i++) {
		nhc = fib_info_nhc(fi, 0);
		...


> +	if (nhc->nhc_flags & RTNH_F_DEAD)
> +		return 0;

And then the loop over the exception bucket could be a helper in route.c
in which case you don't need to export rt_fill_info and nhc_exceptions
code does not spread to fib_trie.c


> +
> +	bucket = rcu_dereference(nhc->nhc_exceptions);
> +	if (!bucket)
> +		return 0;
> +
> +	genid = fnhe_genid(net);
> +
> +	for (i = 0; i < FNHE_HASH_SIZE; i++) {
> +		struct fib_nh_exception *fnhe;
> +
> +		for (fnhe = rcu_dereference(bucket[i].chain); fnhe;
> +		     fnhe = rcu_dereference(fnhe->fnhe_next)) {
> +			struct flowi4 fl4 = {};

rather than pass an empty flow struct, update rt_fill_info to handle a
NULL fl4; it's only a few checks.

> +			struct rtable *rt;
> +			int err;
> +
> +			if (*fa_index < fa_start)
> +				goto next;
> +
> +			if (fnhe->fnhe_genid != genid)
> +				goto next;
> +
> +			if (fnhe->fnhe_expires &&
> +			    time_after(jiffies, fnhe->fnhe_expires))
> +				goto next;
> +
> +			rt = rcu_dereference(fnhe->fnhe_rth_input);
> +			if (!rt)
> +				rt = rcu_dereference(fnhe->fnhe_rth_output);
> +			if (!rt)
> +				goto next;
> +
> +			err = rt_fill_info(net, fnhe->fnhe_daddr, 0, rt,
> +					   fa->tb_id, &fl4, skb,
> +					   NETLINK_CB(cb->skb).portid,
> +					   cb->nlh->nlmsg_seq);
> +			if (err)
> +				return err;
> +next:
> +			(*fa_index)++;
> +		}
> +	}
> +
> +	return 0;
> +}
> +



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net v5 1/6] fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED
  2019-06-18 13:20 ` [PATCH net v5 1/6] fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED Stefano Brivio
@ 2019-06-18 14:49   ` David Ahern
  0 siblings, 0 replies; 16+ messages in thread
From: David Ahern @ 2019-06-18 14:49 UTC (permalink / raw)
  To: Stefano Brivio, David Miller
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

On 6/18/19 7:20 AM, Stefano Brivio wrote:
> The following patches add back the ability to dump IPv4 and IPv6 exception
> routes, and we need to allow selection of regular routes or exceptions.
> 
> Use RTM_F_CLONED as filter to decide whether to dump routes or exceptions:
> iproute2 passes it in dump requests (except for IPv6 cache flush requests,
> this will be fixed in iproute2) and this used to work as long as
> exceptions were stored directly in the FIB, for both IPv4 and IPv6.
> 
> Caveat: if strict checking is not requested (that is, if the dump request
> doesn't go through ip_valid_fib_dump_req()), we can't filter on protocol,
> tables or route types.
> 
> In this case, filtering on RTM_F_CLONED would be inconsistent: we would
> fix 'ip route list cache' by returning exception routes and at the same
> time introduce another bug in case another selector is present, e.g. on
> 'ip route list cache table main' we would return all exception routes,
> without filtering on tables.
> 
> Keep this consistent by applying no filters at all, and dumping both
> routes and exceptions, if strict checking is not requested. iproute2
> currently filters results anyway, and no unwanted results will be
> presented to the user. The kernel will just dump more data than needed.
> 
> v5: New patch: add dump_routes and dump_exceptions flags in filter and
>     simply clear the unwanted one if strict checking is enabled, don't
>     ignore NLM_F_MATCH and don't set filter_set if NLM_F_MATCH is set.
>     Skip filtering altogether if no strict checking is requested:
>     selecting routes or exceptions only would be inconsistent with the
>     fact we can't filter on tables.
> 
> Suggested-by: David Ahern <dsahern@gmail.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  include/net/ip_fib.h    | 2 ++
>  net/ipv4/fib_frontend.c | 8 +++++++-
>  net/ipv6/ip6_fib.c      | 3 ++-
>  3 files changed, 11 insertions(+), 2 deletions(-)
> 

Reviewed-by: David Ahern <dsahern@gmail.com>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net v5 2/6] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering
  2019-06-18 13:20 ` [PATCH net v5 2/6] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering Stefano Brivio
@ 2019-06-18 14:49   ` David Ahern
  0 siblings, 0 replies; 16+ messages in thread
From: David Ahern @ 2019-06-18 14:49 UTC (permalink / raw)
  To: Stefano Brivio, David Miller
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

On 6/18/19 7:20 AM, Stefano Brivio wrote:
> This functionally reverts the check introduced by commit
> e8ba330ac0c5 ("rtnetlink: Update fib dumps for strict data checking")
> as modified by commit e4e92fb160d7 ("net/ipv4: Bail early if user only
> wants prefix entries").
> 
> As we are preparing to fix listing of IPv4 cached routes, we need to
> give userspace a way to request them.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---

Reviewed-by: David Ahern <dsahern@gmail.com>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions
  2019-06-18 13:20 [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
                   ` (5 preceding siblings ...)
  2019-06-18 13:20 ` [PATCH net v5 6/6] ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1() Stefano Brivio
@ 2019-06-18 14:51 ` David Ahern
  2019-06-18 16:25   ` David Miller
  6 siblings, 1 reply; 16+ messages in thread
From: David Ahern @ 2019-06-18 14:51 UTC (permalink / raw)
  To: Stefano Brivio, David Miller
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

On 6/18/19 7:20 AM, Stefano Brivio wrote:
> For IPv6 cached routes, the commands 'ip -6 route list cache' and
> 'ip -6 route flush cache' don't work at all after route exceptions have
> been moved to a separate hash table in commit 2b760fcf5cfb ("ipv6: hook
> up exception table to store dst cache").
> 
> For IPv4 cached routes, the command 'ip route list cache' has also
> stopped working in kernel 3.5 after commit 4895c771c7f0 ("ipv4: Add FIB
> nexthop exceptions.") introduced storage for route exceptions as a
> separate entity.
> 
> Fix this by allowing userspace to clearly request cached routes with
> the RTM_F_CLONED flag used as a filter (in conjuction with strict
> checking) and by retrieving and dumping cached routes if requested.
> 
> If strict checking is not requested (iproute2 < 5.0.0), we don't have a
> way to consistently filter results on other selectors (e.g. on tables),
> so skip filtering entirely and dump both regular routes and exceptions.
> 
> I'm submitting this for net as these changes fix rather relevant
> breakages. However, the scope might be a bit broad, and said breakages
> have been introduced 7 and 2 years ago, respectively, for IPv4 and IPv6.
> Let me know if I should rebase this on net-next instead.
> 
> For IPv4, cache flushing uses a completely different mechanism, so it
> wasn't affected. Listing of exception routes (modified routes pre-3.5) was
> tested against these versions of kernel and iproute2:
> 

Changing the dump code has been notoriously tricky to get right in one
go, no matter how much testing you have done. Given that I think this
should go to net-next first and once it proves ok there we can look at a
backport to stable trees.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net v5 4/6] Revert "net/ipv6: Bail early if user only wants cloned entries"
  2019-06-18 13:20 ` [PATCH net v5 4/6] Revert "net/ipv6: Bail early if user only wants cloned entries" Stefano Brivio
@ 2019-06-18 14:51   ` David Ahern
  0 siblings, 0 replies; 16+ messages in thread
From: David Ahern @ 2019-06-18 14:51 UTC (permalink / raw)
  To: Stefano Brivio, David Miller
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

On 6/18/19 7:20 AM, Stefano Brivio wrote:
> This reverts commit 08e814c9e8eb5a982cbd1e8f6bd255d97c51026f: as we
> are preparing to fix listing and dumping of IPv6 cached routes, we
> need to allow RTM_F_CLONED as a flag to match routes against while
> dumping them.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> v5: No changes
> 
> v4: New patch
> 
>  net/ipv6/ip6_fib.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 

Reviewed-by: David Ahern <dsahern@gmail.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net v5 5/6] ipv6: Dump route exceptions if requested
  2019-06-18 13:20 ` [PATCH net v5 5/6] ipv6: Dump route exceptions if requested Stefano Brivio
@ 2019-06-18 15:19   ` David Ahern
  2019-06-19 23:57     ` Stefano Brivio
  0 siblings, 1 reply; 16+ messages in thread
From: David Ahern @ 2019-06-18 15:19 UTC (permalink / raw)
  To: Stefano Brivio, David Miller
  Cc: Jianlin Shi, Wei Wang, Martin KaFai Lau, Eric Dumazet,
	Matti Vaittinen, netdev

On 6/18/19 7:20 AM, Stefano Brivio wrote:
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 0f60eb3a2873..7375f3b7d310 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -4854,33 +4854,94 @@ static bool fib6_info_uses_dev(const struct fib6_info *f6i,
>  	return false;
>  }
>  
> -int rt6_dump_route(struct fib6_info *rt, void *p_arg)
> +/* Return -1 if done with node, number of handled routes on partial dump */
> +int rt6_dump_route(struct fib6_info *rt, void *p_arg, unsigned int skip)

Changing the return code of rt6_dump_route should be a separate patch.


>  {
>  	struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
>  	struct fib_dump_filter *filter = &arg->filter;
> +	struct rt6_exception_bucket *bucket;
>  	unsigned int flags = NLM_F_MULTI;
> +	struct rt6_exception *rt6_ex;
>  	struct net *net = arg->net;
> +	int i, count = 0;
>  
>  	if (rt == net->ipv6.fib6_null_entry)
> -		return 0;
> +		return -1;
>  
>  	if ((filter->flags & RTM_F_PREFIX) &&
>  	    !(rt->fib6_flags & RTF_PREFIX_RT)) {
>  		/* success since this is not a prefix route */
> -		return 1;
> +		return -1;
>  	}
> -	if (filter->filter_set) {
> -		if ((filter->rt_type && rt->fib6_type != filter->rt_type) ||
> -		    (filter->dev && !fib6_info_uses_dev(rt, filter->dev)) ||
> -		    (filter->protocol && rt->fib6_protocol != filter->protocol)) {
> -			return 1;
> -		}
> +	if (filter->filter_set &&
> +	    ((filter->rt_type  && rt->fib6_type != filter->rt_type) ||
> +	     (filter->dev      && !fib6_info_uses_dev(rt, filter->dev)) ||
> +	     (filter->protocol && rt->fib6_protocol != filter->protocol))) {
> +		return -1;
> +	}
> +
> +	if (filter->filter_set ||
> +	    !filter->dump_routes || !filter->dump_exceptions) {
>  		flags |= NLM_F_DUMP_FILTERED;
>  	}
>  
> -	return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
> -			     RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
> -			     arg->cb->nlh->nlmsg_seq, flags);
> +	if (filter->dump_routes) {
> +		if (skip) {
> +			skip--;
> +		} else {
> +			if (rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL,
> +					  0, RTM_NEWROUTE,
> +					  NETLINK_CB(arg->cb->skb).portid,
> +					  arg->cb->nlh->nlmsg_seq, flags)) {
> +				return 0;
> +			}
> +			count++;
> +		}
> +	}
> +
> +	if (!filter->dump_exceptions)
> +		return -1;
> +

And the dump of the exception bucket should be a standalone function.
You will see why with net-next (it is per fib6_nh).

> +	bucket = rcu_dereference(rt->rt6i_exception_bucket);
> +	if (!bucket)
> +		return -1;
> +
> +	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
> +		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
> +			if (skip) {
> +				skip--;
> +				continue;
> +			}
> +
> +			/* Expiration of entries doesn't bump sernum, insertion
> +			 * does. Removal is triggered by insertion, so we can
> +			 * rely on the fact that if entries change between two
> +			 * partial dumps, this node is scanned again completely,
> +			 * see rt6_insert_exception() and fib6_dump_table().
> +			 *
> +			 * Count expired entries we go through as handled
> +			 * entries that we'll skip next time, in case of partial
> +			 * node dump. Otherwise, if entries expire meanwhile,
> +			 * we'll skip the wrong amount.
> +			 */
> +			if (rt6_check_expired(rt6_ex->rt6i)) {
> +				count++;
> +				continue;
> +			}
> +
> +			if (rt6_fill_node(net, arg->skb, rt, &rt6_ex->rt6i->dst,
> +					  NULL, NULL, 0, RTM_NEWROUTE,
> +					  NETLINK_CB(arg->cb->skb).portid,
> +					  arg->cb->nlh->nlmsg_seq, flags)) {
> +				return count;
> +			}
> +
> +			count++;
> +		}
> +		bucket++;
> +	}
> +
> +	return -1;
>  }
>  
>  static int inet6_rtm_valid_getroute_req(struct sk_buff *skb,
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions
  2019-06-18 14:51 ` [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions David Ahern
@ 2019-06-18 16:25   ` David Miller
  0 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2019-06-18 16:25 UTC (permalink / raw)
  To: dsahern; +Cc: sbrivio, jishi, weiwan, kafai, edumazet, matti.vaittinen, netdev

From: David Ahern <dsahern@gmail.com>
Date: Tue, 18 Jun 2019 08:51:01 -0600

> Changing the dump code has been notoriously tricky to get right in one
> go, no matter how much testing you have done. Given that I think this
> should go to net-next first and once it proves ok there we can look at a
> backport to stable trees.

I agree, this is probably the wisest way forward with these changes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net v5 3/6] ipv4: Dump route exceptions if requested
  2019-06-18 14:48   ` David Ahern
@ 2019-06-19 23:57     ` Stefano Brivio
  0 siblings, 0 replies; 16+ messages in thread
From: Stefano Brivio @ 2019-06-19 23:57 UTC (permalink / raw)
  To: David Ahern
  Cc: David Miller, Jianlin Shi, Wei Wang, Martin KaFai Lau,
	Eric Dumazet, Matti Vaittinen, netdev

On Tue, 18 Jun 2019 08:48:23 -0600
David Ahern <dsahern@gmail.com> wrote:

> > +++ b/net/ipv4/fib_trie.c
> > @@ -2000,28 +2000,92 @@ void fib_free_table(struct fib_table *tb)
> >  	call_rcu(&tb->rcu, __trie_free_rcu);
> >  }
> >  
> > +static int fib_dump_fnhe_from_leaf(struct fib_alias *fa, struct sk_buff *skb,
> > +				   struct netlink_callback *cb,
> > +				   int *fa_index, int fa_start)
> > +{
> > +	struct net *net = sock_net(cb->skb->sk);
> > +	struct fib_info *fi = fa->fa_info;
> > +	struct fnhe_hash_bucket *bucket;
> > +	struct fib_nh_common *nhc;
> > +	int i, genid;
> > +
> > +	if (!fi || fi->fib_flags & RTNH_F_DEAD)
> > +		return 0;
> > +
> > +	nhc = fib_info_nhc(fi, 0);  
> 
> This should be a loop over fi->fib_nhs for net:
> 	for (i = 0; i < fi->fib_nhs; i++) {
> 		nhc = fib_info_nhc(fi, 0);
> 		...
> 
> and a loop over fib_info_num_path(fi) for net-next:
> 	for (i = 0; i < fib_info_num_path(fi); i++) {
> 		nhc = fib_info_nhc(fi, 0);
> 		...

Right, I started this from net-next and only later "adapted" to net
clearly in the wrong way. Thanks for providing both expressions. Fixed
in v6.

> 
> > +	if (nhc->nhc_flags & RTNH_F_DEAD)
> > +		return 0;  
> 
> And then the loop over the exception bucket could be a helper in route.c
> in which case you don't need to export rt_fill_info and nhc_exceptions
> code does not spread to fib_trie.c

Cleaner I guess, changed in v6.
 
> > +
> > +	bucket = rcu_dereference(nhc->nhc_exceptions);
> > +	if (!bucket)
> > +		return 0;
> > +
> > +	genid = fnhe_genid(net);
> > +
> > +	for (i = 0; i < FNHE_HASH_SIZE; i++) {
> > +		struct fib_nh_exception *fnhe;
> > +
> > +		for (fnhe = rcu_dereference(bucket[i].chain); fnhe;
> > +		     fnhe = rcu_dereference(fnhe->fnhe_next)) {
> > +			struct flowi4 fl4 = {};  
> 
> rather than pass an empty flow struct, update rt_fill_info to handle a
> NULL fl4; it's only a few checks.

Added patch and changed in v6.

-- 
Stefano

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net v5 5/6] ipv6: Dump route exceptions if requested
  2019-06-18 15:19   ` David Ahern
@ 2019-06-19 23:57     ` Stefano Brivio
  0 siblings, 0 replies; 16+ messages in thread
From: Stefano Brivio @ 2019-06-19 23:57 UTC (permalink / raw)
  To: David Ahern
  Cc: David Miller, Jianlin Shi, Wei Wang, Martin KaFai Lau,
	Eric Dumazet, Matti Vaittinen, netdev

On Tue, 18 Jun 2019 09:19:53 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/18/19 7:20 AM, Stefano Brivio wrote:
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index 0f60eb3a2873..7375f3b7d310 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -4854,33 +4854,94 @@ static bool fib6_info_uses_dev(const struct fib6_info *f6i,
> >  	return false;
> >  }
> >  
> > -int rt6_dump_route(struct fib6_info *rt, void *p_arg)
> > +/* Return -1 if done with node, number of handled routes on partial dump */
> > +int rt6_dump_route(struct fib6_info *rt, void *p_arg, unsigned int skip)  
> 
> Changing the return code of rt6_dump_route should be a separate patch.

I guess the purpose would be to highlight how existing cases are
changed, but that looks rather trivial to me. Anyway, changed in v6.

> > +	if (filter->dump_routes) {
> > +		if (skip) {
> > +			skip--;
> > +		} else {
> > +			if (rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL,
> > +					  0, RTM_NEWROUTE,
> > +					  NETLINK_CB(arg->cb->skb).portid,
> > +					  arg->cb->nlh->nlmsg_seq, flags)) {
> > +				return 0;
> > +			}
> > +			count++;
> > +		}
> > +	}
> > +
> > +	if (!filter->dump_exceptions)
> > +		return -1;
> > +  
> 
> And the dump of the exception bucket should be a standalone function.
> You will see why with net-next (it is per fib6_nh).

Sure, no way around it now, changed in v6.

-- 
Stefano

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, back to index

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-18 13:20 [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
2019-06-18 13:20 ` [PATCH net v5 1/6] fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED Stefano Brivio
2019-06-18 14:49   ` David Ahern
2019-06-18 13:20 ` [PATCH net v5 2/6] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering Stefano Brivio
2019-06-18 14:49   ` David Ahern
2019-06-18 13:20 ` [PATCH net v5 3/6] ipv4: Dump route exceptions if requested Stefano Brivio
2019-06-18 14:48   ` David Ahern
2019-06-19 23:57     ` Stefano Brivio
2019-06-18 13:20 ` [PATCH net v5 4/6] Revert "net/ipv6: Bail early if user only wants cloned entries" Stefano Brivio
2019-06-18 14:51   ` David Ahern
2019-06-18 13:20 ` [PATCH net v5 5/6] ipv6: Dump route exceptions if requested Stefano Brivio
2019-06-18 15:19   ` David Ahern
2019-06-19 23:57     ` Stefano Brivio
2019-06-18 13:20 ` [PATCH net v5 6/6] ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1() Stefano Brivio
2019-06-18 14:51 ` [PATCH net v5 0/6] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions David Ahern
2019-06-18 16:25   ` David Miller

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org netdev@archiver.kernel.org
	public-inbox-index netdev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox