netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions
@ 2019-06-15  1:32 Stefano Brivio
  2019-06-15  1:32 ` [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version Stefano Brivio
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  1:32 UTC (permalink / raw)
  To: David Miller, David Ahern, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

For IPv6 cached routes, the commands 'ip -6 route list cache' and
'ip -6 route flush cache' don't work at all after route exceptions have
been moved to a separate hash table in commit 2b760fcf5cfb ("ipv6: hook
up exception table to store dst cache").

For IPv4 cached routes, the command 'ip route list cache' has also
stopped working in kernel 3.5 after commit 4895c771c7f0 ("ipv4: Add FIB
nexthop exceptions.") introduced storage for route exceptions as a
separate entity.

Fix this by allowing userspace to clearly request cached routes with
the RTM_F_CLONED flag used as a filter (in conjuction with strict
checking or NLM_F_MATCH) and by retrieving and dumping cached routes
if requested.

I'm submitting this for net as these changes fix rather relevant
breakages. However, the scope might be a bit broad, and said breakages
have been introduced 7 and 2 years ago, respectively, for IPv4 and IPv6.
Let me know if I should rebase this on net-next instead.

For IPv4, cache flushing uses a completely different mechanism, so it
wasn't affected. Listing of exception routes (modified routes pre-3.5) was
tested against these versions of kernel and iproute2:

                    iproute2
kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
 3.5-rc4         +        +        +        +       +
 4.4
 4.9
 4.14
 4.15
 4.19
 5.0
 5.1
 fixed           +        +        +        +       +


For IPv6, a separate iproute2 patch is required. Versions of iproute2
and kernel tested:

                    iproute2
kernel             4.14.0   4.15.0   4.19.0   5.0.0   5.1.0    5.1.0, patched
 3.18    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.4     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.9     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.14    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.15    list
         flush
 4.19    list
         flush
 5.0     list
         flush
 5.1     list
         flush
 with    list        +        +        +        +       +            +
 fix     flush                                                       +


v4: Fix the listing issue also for IPv4, making the behaviour consistent
    with IPv6. Honour NLM_F_MATCH as per RFC 3549 and allow usage of
    RTM_F_CLONED filter. Split patches into smaller logical changes.

v3: Drop check on RTM_F_CLONED and rework logic of return values of
    rt6_dump_route()

v2: Add count of routes handled in partial dumps, and skip them, in patch 1/2.
*** BLURB HERE ***

Stefano Brivio (8):
  ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict
    version
  ipv4: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK
    consistent
  ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering
  ipv4: Dump routed caches if requested
  Revert "net/ipv6: Bail early if user only wants cloned entries"
  ipv6: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK
    consistent
  ipv6: Dump route exceptions too in rt6_dump_route()
  ip6_fib: Don't discard nodes with valid routing information in
    fib6_locate_1()

 include/net/ip6_fib.h   |   1 +
 include/net/ip6_route.h |   2 +-
 include/net/ip_fib.h    |   6 +--
 include/net/route.h     |   3 ++
 net/ipv4/fib_frontend.c |  50 ++++++++++++-------
 net/ipv4/fib_trie.c     | 103 +++++++++++++++++++++++++++++++++++-----
 net/ipv4/ipmr.c         |   4 +-
 net/ipv4/route.c        |   6 +--
 net/ipv6/ip6_fib.c      |  37 ++++++++++-----
 net/ipv6/ip6mr.c        |   4 +-
 net/ipv6/route.c        |  74 ++++++++++++++++++++++++++---
 net/mpls/af_mpls.c      |   2 +-
 12 files changed, 230 insertions(+), 62 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-15  1:32 [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
@ 2019-06-15  1:32 ` Stefano Brivio
  2019-06-15  2:54   ` David Ahern
  2019-06-15  1:32 ` [PATCH net v4 2/8] ipv4: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent Stefano Brivio
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  1:32 UTC (permalink / raw)
  To: David Miller, David Ahern, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

ip_valid_fib_dump_req() does two things: performs strict checking on
netlink attributes for dump requests, and sets a dump filter if netlink
attributes require it.

We might want to just set a filter, without performing strict validation.

Rename it to ip_filter_fib_dump_req(), and add a 'strict' boolean
argument that must be set if strict validation is requested.

This patch doesn't introduce any functional changes.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: New patch

 include/net/ip_fib.h    |  6 +++---
 net/ipv4/fib_frontend.c | 34 ++++++++++++++++++++++------------
 net/ipv4/ipmr.c         |  4 ++--
 net/ipv6/ip6_fib.c      |  2 +-
 net/ipv6/ip6mr.c        |  4 ++--
 net/mpls/af_mpls.c      |  2 +-
 6 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index bbeff32fb6cb..76094a0b97cf 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -493,9 +493,9 @@ static inline void fib_proc_exit(struct net *net)
 
 u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr);
 
-int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
-			  struct fib_dump_filter *filter,
-			  struct netlink_callback *cb);
+int ip_filter_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
+			   struct fib_dump_filter *filter,
+			   struct netlink_callback *cb, bool strict);
 
 int fib_nexthop_info(struct sk_buff *skb, const struct fib_nh_common *nh,
 		     unsigned char *flags, bool skip_oif);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index e54c2bcbb465..873fc5c4721c 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -859,9 +859,9 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh,
 	return err;
 }
 
-int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
-			  struct fib_dump_filter *filter,
-			  struct netlink_callback *cb)
+int ip_filter_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
+			   struct fib_dump_filter *filter,
+			   struct netlink_callback *cb, bool strict)
 {
 	struct netlink_ext_ack *extack = cb->extack;
 	struct nlattr *tb[RTA_MAX + 1];
@@ -876,12 +876,12 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
 	}
 
 	rtm = nlmsg_data(nlh);
-	if (rtm->rtm_dst_len || rtm->rtm_src_len  || rtm->rtm_tos   ||
-	    rtm->rtm_scope) {
+	if (strict && (rtm->rtm_dst_len || rtm->rtm_src_len || rtm->rtm_tos ||
+		       rtm->rtm_scope)) {
 		NL_SET_ERR_MSG(extack, "Invalid values in header for FIB dump request");
 		return -EINVAL;
 	}
-	if (rtm->rtm_flags & ~(RTM_F_CLONED | RTM_F_PREFIX)) {
+	if (strict && rtm->rtm_flags & ~(RTM_F_CLONED | RTM_F_PREFIX)) {
 		NL_SET_ERR_MSG(extack, "Invalid flags for FIB dump request");
 		return -EINVAL;
 	}
@@ -892,10 +892,18 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
 	filter->rt_type  = rtm->rtm_type;
 	filter->table_id = rtm->rtm_table;
 
-	err = nlmsg_parse_deprecated_strict(nlh, sizeof(*rtm), tb, RTA_MAX,
-					    rtm_ipv4_policy, extack);
-	if (err < 0)
-		return err;
+	if (strict) {
+		err = nlmsg_parse_deprecated_strict(nlh, sizeof(*rtm), tb,
+						    RTA_MAX, rtm_ipv4_policy,
+						    extack);
+		if (err < 0)
+			return err;
+	} else {
+		err = nlmsg_parse_deprecated(nlh, sizeof(*rtm), tb, RTA_MAX,
+					     rtm_ipv4_policy, extack);
+		if (err < 0)
+			return err;
+	}
 
 	for (i = 0; i <= RTA_MAX; ++i) {
 		int ifindex;
@@ -914,6 +922,8 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
 				return -ENODEV;
 			break;
 		default:
+			if (!strict)
+				break;
 			NL_SET_ERR_MSG(extack, "Unsupported attribute in dump request");
 			return -EINVAL;
 		}
@@ -927,7 +937,7 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(ip_valid_fib_dump_req);
+EXPORT_SYMBOL_GPL(ip_filter_fib_dump_req);
 
 static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 {
@@ -941,7 +951,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	int dumped = 0, err;
 
 	if (cb->strict_check) {
-		err = ip_valid_fib_dump_req(net, nlh, &filter, cb);
+		err = ip_filter_fib_dump_req(net, nlh, &filter, cb, true);
 		if (err < 0)
 			return err;
 	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index c07bc82cbbe9..1e089acc9479 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2597,8 +2597,8 @@ static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 	int err;
 
 	if (cb->strict_check) {
-		err = ip_valid_fib_dump_req(sock_net(skb->sk), cb->nlh,
-					    &filter, cb);
+		err = ip_filter_fib_dump_req(sock_net(skb->sk), cb->nlh,
+					     &filter, cb, true);
 		if (err < 0)
 			return err;
 	}
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 9180c8b6f764..b21a9ec02891 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -571,7 +571,7 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	if (cb->strict_check) {
 		int err;
 
-		err = ip_valid_fib_dump_req(net, nlh, &arg.filter, cb);
+		err = ip_filter_fib_dump_req(net, nlh, &arg.filter, cb, true);
 		if (err < 0)
 			return err;
 	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index e80d36c5073d..4960c3fe8e83 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -2487,8 +2487,8 @@ static int ip6mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 	int err;
 
 	if (cb->strict_check) {
-		err = ip_valid_fib_dump_req(sock_net(skb->sk), nlh,
-					    &filter, cb);
+		err = ip_filter_fib_dump_req(sock_net(skb->sk), nlh, &filter,
+					     cb, true);
 		if (err < 0)
 			return err;
 	}
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 198ec4fe4148..f54d2f5834f8 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -2078,7 +2078,7 @@ static int mpls_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
 				   struct fib_dump_filter *filter,
 				   struct netlink_callback *cb)
 {
-	return ip_valid_fib_dump_req(net, nlh, filter, cb);
+	return ip_filter_fib_dump_req(net, nlh, filter, cb, true);
 }
 #else
 static int mpls_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net v4 2/8] ipv4: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent
  2019-06-15  1:32 [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
  2019-06-15  1:32 ` [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version Stefano Brivio
@ 2019-06-15  1:32 ` Stefano Brivio
  2019-06-15  3:13   ` David Ahern
  2019-06-15  1:32 ` [PATCH net v4 3/8] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering Stefano Brivio
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  1:32 UTC (permalink / raw)
  To: David Miller, David Ahern, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

Socket option NETLINK_GET_STRICT_CHK, quoting from commit 89d35528d17d
("netlink: Add new socket option to enable strict checking on dumps"),
is used to "request strict checking of headers and attributes on dump
requests".

If some attributes are set (including flags), setting this option causes
dump functions to filter results according to these attributes, via the
filter_set flag. However, if strict checking is requested, this should
imply that we also filter results based on flags that are *not* set.

This is currently not the case, at least for IPv4 FIB dumps: if the
RTM_F_CLONED flag is not set, and strict checking is required, we should
not return routes with the RTM_F_CLONED flag set.

Set the filter_set flag whenever strict checking is requested, limiting
the scope to IPv4 FIB dumps for the moment being, as other users of the
flag might not present this inconsistency.

Note that this partially duplicates the semantics of NLM_F_MATCH as
described by RFC 3549, par. 3.1.1. Instead of setting a filter based on
the size of the netlink message, properly support NLM_F_MATCH, by
setting a filter via ip_filter_fib_dump_req() and setting the filter_set
flag.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: New patch

 net/ipv4/fib_frontend.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 873fc5c4721c..32a04318d725 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -954,10 +954,14 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 		err = ip_filter_fib_dump_req(net, nlh, &filter, cb, true);
 		if (err < 0)
 			return err;
-	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
-		struct rtmsg *rtm = nlmsg_data(nlh);
-
-		filter.flags = rtm->rtm_flags & (RTM_F_PREFIX | RTM_F_CLONED);
+		filter.filter_set = 1;
+	} else if (nlh->nlmsg_flags & NLM_F_MATCH) {
+		err = ip_filter_fib_dump_req(net, nlh, &filter, cb, false);
+		if (err == -ENODEV)
+			return skb->len;
+		if (err)
+			return err;
+		filter.filter_set = 1;
 	}
 
 	/* fib entries are never clones and ipv4 does not use prefix flag */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net v4 3/8] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering
  2019-06-15  1:32 [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
  2019-06-15  1:32 ` [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version Stefano Brivio
  2019-06-15  1:32 ` [PATCH net v4 2/8] ipv4: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent Stefano Brivio
@ 2019-06-15  1:32 ` Stefano Brivio
  2019-06-15  1:32 ` [PATCH 4/8] ipv4: Dump routed caches if requested Stefano Brivio
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  1:32 UTC (permalink / raw)
  To: David Miller, David Ahern, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

This functionally reverts the check introduced by commit
e8ba330ac0c5 ("rtnetlink: Update fib dumps for strict data checking")
as modified by commit e4e92fb160d7 ("net/ipv4: Bail early if user only
wants prefix entries").

As we are preparing to fix listing of IPv4 cached routes, we need to
give userspace a way to request for cached routes only.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: New patch

 net/ipv4/fib_frontend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 32a04318d725..815997487247 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -964,8 +964,8 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 		filter.filter_set = 1;
 	}
 
-	/* fib entries are never clones and ipv4 does not use prefix flag */
-	if (filter.flags & (RTM_F_PREFIX | RTM_F_CLONED))
+	/* ipv4 does not use prefix flag */
+	if (filter.flags & RTM_F_PREFIX)
 		return skb->len;
 
 	if (filter.table_id) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/8] ipv4: Dump routed caches if requested
  2019-06-15  1:32 [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
                   ` (2 preceding siblings ...)
  2019-06-15  1:32 ` [PATCH net v4 3/8] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering Stefano Brivio
@ 2019-06-15  1:32 ` Stefano Brivio
  2019-06-15  1:32 ` [PATCH 5/8] Revert "net/ipv6: Bail early if user only wants cloned entries" Stefano Brivio
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  1:32 UTC (permalink / raw)
  To: David Miller, David Ahern, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

Since commit 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), cached
exception routes are stored as a separate entity, so they are not dumped
on a FIB dump, even if the RTM_F_CLONED flag is passed.

This implies that the command 'ip route list cache' doesn't return any
result anymore.

If the RTM_F_CLONED is passed as filter, together with strict checking or
the NLM_F_MATCH flag described by RFC 3549, retrieve nexthop exception
routes and dump them.

With this, we need to add an argument to the netlink callback in order to
track how many entries were already dumped for the last leaf included in
a partial netlink dump.

Note that this is only as accurate as the existing tracking mechanism for
leaves: if a partial dump is restarted after exceptions are removed or
expired, we might skip some non-dumped entries. To improve this, we could
attach a 'sernum' attribute (similar to the one used for IPv6) to nexthop
entities, and bump this counter whenever exceptions change.

Listing of exception routes (modified routes pre-3.5) was tested against
these versions of kernel and iproute2:

                    iproute2
kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
 3.5-rc4         +        +        +        +       +
 4.4
 4.9
 4.14
 4.15
 4.19
 5.0
 5.1
 fixed           +        +        +        +       +

Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: New patch

 include/net/route.h |   3 ++
 net/ipv4/fib_trie.c | 103 ++++++++++++++++++++++++++++++++++++++------
 net/ipv4/route.c    |   6 +--
 3 files changed, 97 insertions(+), 15 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 065b47754f05..f0d0086e76ce 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -221,6 +221,9 @@ void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt);
 struct rtable *rt_dst_alloc(struct net_device *dev,
 			     unsigned int flags, u16 type,
 			     bool nopolicy, bool noxfrm, bool will_cache);
+int rt_fill_info(struct net *net, __be32 dst, __be32 src, struct rtable *rt,
+		 u32 table_id, struct flowi4 *fl4, struct sk_buff *skb,
+		 u32 portid, u32 seq);
 
 struct in_ifaddr;
 void fib_add_ifaddr(struct in_ifaddr *);
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 868c74771fa9..4beeca778eab 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2000,28 +2000,94 @@ void fib_free_table(struct fib_table *tb)
 	call_rcu(&tb->rcu, __trie_free_rcu);
 }
 
+static int fib_dump_fnhe_from_leaf(struct fib_alias *fa, struct sk_buff *skb,
+				   struct netlink_callback *cb,
+				   int *fa_index, int fa_start)
+{
+	struct net *net = sock_net(cb->skb->sk);
+	struct fib_info *fi = fa->fa_info;
+	struct fnhe_hash_bucket *bucket;
+	struct fib_nh_common *nhc;
+	int i, genid;
+
+	if (!fi || fi->fib_flags & RTNH_F_DEAD)
+		return 0;
+
+	nhc = fib_info_nhc(fi, 0);
+	if (nhc->nhc_flags & RTNH_F_DEAD)
+		return 0;
+
+	bucket = rcu_dereference(nhc->nhc_exceptions);
+	if (!bucket)
+		return 0;
+
+	genid = fnhe_genid(net);
+
+	for (i = 0; i < FNHE_HASH_SIZE; i++) {
+		struct fib_nh_exception *fnhe;
+
+		for (fnhe = rcu_dereference(bucket[i].chain); fnhe;
+		     fnhe = rcu_dereference(fnhe->fnhe_next)) {
+			struct flowi4 fl4 = {};
+			struct rtable *rt;
+			int err;
+
+			if (*fa_index < fa_start)
+				goto next;
+
+			if (fnhe->fnhe_genid != genid)
+				goto next;
+
+			if (fnhe->fnhe_expires &&
+			    time_after(jiffies, fnhe->fnhe_expires))
+				goto next;
+
+			rt = rcu_dereference(fnhe->fnhe_rth_input);
+			if (!rt)
+				rt = rcu_dereference(fnhe->fnhe_rth_output);
+			if (!rt)
+				goto next;
+
+			err = rt_fill_info(net, fnhe->fnhe_daddr, 0, rt,
+					   fa->tb_id, &fl4, skb,
+					   NETLINK_CB(cb->skb).portid,
+					   cb->nlh->nlmsg_seq);
+			if (err)
+				return err;
+next:
+			(*fa_index)++;
+		}
+	}
+
+	return 0;
+}
+
 static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
 			     struct sk_buff *skb, struct netlink_callback *cb,
 			     struct fib_dump_filter *filter)
 {
+	bool dump_exceptions = true, dump_routes = true;
 	unsigned int flags = NLM_F_MULTI;
 	__be32 xkey = htonl(l->key);
+	int i, s_i, i_fa, s_fa, err;
 	struct fib_alias *fa;
-	int i, s_i;
 
-	if (filter->filter_set)
+	if (filter->filter_set) {
 		flags |= NLM_F_DUMP_FILTERED;
+		dump_routes = !(dump_exceptions = filter->flags & RTM_F_CLONED);
+	}
 
 	s_i = cb->args[4];
+	s_fa = cb->args[5];
 	i = 0;
 
 	/* rcu_read_lock is hold by caller */
 	hlist_for_each_entry_rcu(fa, &l->leaf, fa_list) {
-		int err;
-
 		if (i < s_i)
 			goto next;
 
+		i_fa = 0;
+
 		if (tb->tb_id != fa->tb_id)
 			goto next;
 
@@ -2038,21 +2104,34 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
 				goto next;
 		}
 
-		err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
-				    cb->nlh->nlmsg_seq, RTM_NEWROUTE,
-				    tb->tb_id, fa->fa_type,
-				    xkey, KEYLENGTH - fa->fa_slen,
-				    fa->fa_tos, fa->fa_info, flags);
-		if (err < 0) {
-			cb->args[4] = i;
-			return err;
+		if (dump_routes && !s_fa) {
+			err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
+					    cb->nlh->nlmsg_seq, RTM_NEWROUTE,
+					    tb->tb_id, fa->fa_type,
+					    xkey, KEYLENGTH - fa->fa_slen,
+					    fa->fa_tos, fa->fa_info, flags);
+			if (err < 0)
+				goto stop;
+			i_fa++;
+		}
+
+		if (dump_exceptions) {
+			err = fib_dump_fnhe_from_leaf(fa, skb, cb, &i_fa, s_fa);
+			if (err < 0)
+				goto stop;
 		}
+
 next:
 		i++;
 	}
 
 	cb->args[4] = i;
 	return skb->len;
+
+stop:
+	cb->args[4] = i;
+	cb->args[5] = i_fa;
+	return err;
 }
 
 /* rcu_read_lock needs to be hold by caller from readside */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6cb7cff22db9..cc970fd861e8 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2663,9 +2663,9 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
 /* called with rcu_read_lock held */
-static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
-			struct rtable *rt, u32 table_id, struct flowi4 *fl4,
-			struct sk_buff *skb, u32 portid, u32 seq)
+int rt_fill_info(struct net *net, __be32 dst, __be32 src, struct rtable *rt,
+		 u32 table_id, struct flowi4 *fl4, struct sk_buff *skb,
+		 u32 portid, u32 seq)
 {
 	struct rtmsg *r;
 	struct nlmsghdr *nlh;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/8] Revert "net/ipv6: Bail early if user only wants cloned entries"
  2019-06-15  1:32 [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
                   ` (3 preceding siblings ...)
  2019-06-15  1:32 ` [PATCH 4/8] ipv4: Dump routed caches if requested Stefano Brivio
@ 2019-06-15  1:32 ` Stefano Brivio
  2019-06-15  1:32 ` [PATCH 6/8] ipv6: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent Stefano Brivio
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  1:32 UTC (permalink / raw)
  To: David Miller, David Ahern, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

This reverts commit 08e814c9e8eb5a982cbd1e8f6bd255d97c51026f: as we
are preparing to fix listing and dumping of IPv6 cached routes, we
need to allow RTM_F_CLONED as a flag to match routes against while
dumping them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: New patch, split from 6/8

 net/ipv6/ip6_fib.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index b21a9ec02891..bc5cb359c8a6 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -577,13 +577,10 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
 		struct rtmsg *rtm = nlmsg_data(nlh);
 
-		arg.filter.flags = rtm->rtm_flags & (RTM_F_PREFIX|RTM_F_CLONED);
+		if (rtm->rtm_flags & RTM_F_PREFIX)
+			arg.filter.flags = RTM_F_PREFIX;
 	}
 
-	/* fib entries are never clones */
-	if (arg.filter.flags & RTM_F_CLONED)
-		goto out;
-
 	w = (void *)cb->args[2];
 	if (!w) {
 		/* New dump:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/8] ipv6: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent
  2019-06-15  1:32 [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
                   ` (4 preceding siblings ...)
  2019-06-15  1:32 ` [PATCH 5/8] Revert "net/ipv6: Bail early if user only wants cloned entries" Stefano Brivio
@ 2019-06-15  1:32 ` Stefano Brivio
  2019-06-15  1:32 ` [PATCH 7/8] ipv6: Dump route exceptions too in rt6_dump_route() Stefano Brivio
  2019-06-15  1:32 ` [PATCH 8/8] ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1() Stefano Brivio
  7 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  1:32 UTC (permalink / raw)
  To: David Miller, David Ahern, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

Socket option NETLINK_GET_STRICT_CHK, quoting from commit 89d35528d17d
("netlink: Add new socket option to enable strict checking on dumps"),
is used to "request strict checking of headers and attributes on dump
requests".

If some attributes are set (including flags), setting this option causes
dump functions to filter results according to these attributes, via the
filter_set flag. However, if strict checking is requested, this should
imply that we also filter results based on flags that are *not* set.

This is currently not the case, at least for IPv6 FIB dumps: if the
RTM_F_CLONED flag is not set, and strict checking is required, we should
not return routes with the RTM_F_CLONED flag set.

Set the filter_set flag whenever strict checking is requested, limiting
the scope to IPv6 FIB dumps for the moment being, as other users of the
flag might not present this inconsistency.

Note that this partially duplicates the semantics of NLM_F_MATCH as
described by RFC 3549, par. 3.1.1. Instead of setting a filter based on
the size of the netlink message, properly support NLM_F_MATCH, by
setting a filter via ip_filter_fib_dump_req() and setting the filter_set
flag.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: New patch, split from 6/8

 net/ipv6/ip6_fib.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index bc5cb359c8a6..54bbc97beb6f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -570,15 +570,18 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 
 	if (cb->strict_check) {
 		int err;
-
 		err = ip_filter_fib_dump_req(net, nlh, &arg.filter, cb, true);
 		if (err < 0)
 			return err;
-	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
-		struct rtmsg *rtm = nlmsg_data(nlh);
-
-		if (rtm->rtm_flags & RTM_F_PREFIX)
-			arg.filter.flags = RTM_F_PREFIX;
+		arg.filter.filter_set = 1;
+	} else if (nlh->nlmsg_flags & NLM_F_MATCH) {
+		res = ip_filter_fib_dump_req(net, nlh, &arg.filter, cb, false);
+		if (res) {
+			if (res == -ENODEV)
+				res = 0;
+			goto out;
+		}
+		arg.filter.filter_set = 1;
 	}
 
 	w = (void *)cb->args[2];
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/8] ipv6: Dump route exceptions too in rt6_dump_route()
  2019-06-15  1:32 [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
                   ` (5 preceding siblings ...)
  2019-06-15  1:32 ` [PATCH 6/8] ipv6: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent Stefano Brivio
@ 2019-06-15  1:32 ` Stefano Brivio
  2019-06-15  1:32 ` [PATCH 8/8] ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1() Stefano Brivio
  7 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  1:32 UTC (permalink / raw)
  To: David Miller, David Ahern, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

Since commit 2b760fcf5cfb ("ipv6: hook up exception table to store dst
cache"), route exceptions reside in a separate hash table, and won't be
found by walking the FIB, so they won't be dumped to userspace on a
RTM_GETROUTE message.

This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
have no function anymore:

 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
 # ip -6 route list cache
 # ip -6 route flush cache
 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium

because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
by listing all the routes, and deleting them with RTM_DELROUTE one by one.

If cached routes are requested using the RTM_F_CLONED flag (together with
strict checking or the NLM_F_MATCH flag), look up exceptions in the hash
table associated with the current fib6_info in rt6_dump_route(), and, if
present and not expired, add them to the dump.

We might be unable to dump all the entries for a given node in a single
message, so keep track of how many entries were handled for the current
node in fib6_walker, and skip that amount in case we start from the same
partially dumped node.

Note that, with the current version of iproute2, this only fixes the
'ip -6 route list cache': on a flush command, iproute2 doesn't pass
RTM_F_CLONED and, due to this inconsistency, 'ip -6 route flush cache' is
still unable to fetch the routes to be flushed. This will be addressed in
a patch for iproute2.

To flush cached routes, a procfs entry could be introduced instead: that's
how it works for IPv4. We already have a rt6_flush_exception() function
ready to be wired to it. However, this would not solve the issue for
listing.

Versions of iproute2 and kernel tested:

                    iproute2
kernel             4.14.0   4.15.0   4.19.0   5.0.0   5.1.0    5.1.0, patched
 3.18    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.4     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.9     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.14    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.15    list
         flush
 4.19    list
         flush
 5.0     list
         flush
 5.1     list
         flush
 with    list        +        +        +        +       +            +
 fix     flush                                                       +

v4:
  - split NLM_F_MATCH and strict check handling in separate patches
  - filter routes using RTM_F_CLONED: if it's not set, only return
    non-cached routes, and if it's set, only return cached routes:
    change requested by David Ahern and Martin Lau. This implies that
    iproute2 needs a separate patch to be able to flush IPv6 cached
    routes. This is not ideal because we can't fix the breakage caused
    by 2b760fcf5cfb entirely in kernel. However, two years have passed
    since then, and this makes it more tolerable

v3:
  - more descriptive comment about expired exceptions in rt6_dump_route()
  - swap return values of rt6_dump_route() (suggested by Martin Lau)
  - don't zero skip_in_node in case we don't dump anything in a given pass
    (also suggested by Martin Lau)
  - remove check on RTM_F_CLONED altogether: in the current UAPI semantic,
    it's just a flag to indicate the route was cloned, not to filter on
    routes

v2: Add tracking of number of entries to be skipped in current node after
    a partial dump. As we restart from the same node, if not all the
    exceptions for a given node fit in a single message, the dump will
    not terminate, as suggested by Martin Lau. This is a concrete
    possibility, setting up a big number of exceptions for the same route
    actually causes the issue, suggested by David Ahern.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
This will cause a non-trivial conflict with commit cc5c073a693f
("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
an equivalent patch against net-next, if it helps.

 include/net/ip6_fib.h   |  1 +
 include/net/ip6_route.h |  2 +-
 net/ipv6/ip6_fib.c      | 14 ++++++--
 net/ipv6/route.c        | 74 +++++++++++++++++++++++++++++++++++++----
 4 files changed, 81 insertions(+), 10 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 855b352b660f..5909a9d8ff67 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -312,6 +312,7 @@ struct fib6_walker {
 	enum fib6_walk_state state;
 	unsigned int skip;
 	unsigned int count;
+	unsigned int skip_in_node;
 	int (*func)(struct fib6_walker *);
 	void *args;
 };
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 4790beaa86e0..b66c4aac56ab 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -178,7 +178,7 @@ struct rt6_rtnl_dump_arg {
 	struct fib_dump_filter filter;
 };
 
-int rt6_dump_route(struct fib6_info *f6i, void *p_arg);
+int rt6_dump_route(struct fib6_info *f6i, void *p_arg, unsigned int skip);
 void rt6_mtu_change(struct net_device *dev, unsigned int mtu);
 void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
 void rt6_clean_tohost(struct net *net, struct in6_addr *gateway);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 54bbc97beb6f..65cd47a84bcc 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -469,12 +469,19 @@ static int fib6_dump_node(struct fib6_walker *w)
 	struct fib6_info *rt;
 
 	for_each_fib6_walker_rt(w) {
-		res = rt6_dump_route(rt, w->args);
-		if (res < 0) {
+		res = rt6_dump_route(rt, w->args, w->skip_in_node);
+		if (res >= 0) {
 			/* Frame is full, suspend walking */
 			w->leaf = rt;
+
+			/* We'll restart from this node, so if some routes were
+			 * already dumped, skip them next time.
+			 */
+			w->skip_in_node += res;
+
 			return 1;
 		}
+		w->skip_in_node = 0;
 
 		/* Multipath routes are dumped in one route with the
 		 * RTA_MULTIPATH attribute. Jump 'rt' to point to the
@@ -526,6 +533,7 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb,
 	if (cb->args[4] == 0) {
 		w->count = 0;
 		w->skip = 0;
+		w->skip_in_node = 0;
 
 		spin_lock_bh(&table->tb6_lock);
 		res = fib6_walk(net, w);
@@ -541,6 +549,7 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb,
 			w->state = FWS_INIT;
 			w->node = w->root;
 			w->skip = w->count;
+			w->skip_in_node = 0;
 		} else
 			w->skip = 0;
 
@@ -2041,6 +2050,7 @@ static void fib6_clean_tree(struct net *net, struct fib6_node *root,
 	c.w.func = fib6_clean_node;
 	c.w.count = 0;
 	c.w.skip = 0;
+	c.w.skip_in_node = 0;
 	c.func = func;
 	c.sernum = sernum;
 	c.arg = arg;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0f60eb3a2873..46bbd8f37da7 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4854,33 +4854,93 @@ static bool fib6_info_uses_dev(const struct fib6_info *f6i,
 	return false;
 }
 
-int rt6_dump_route(struct fib6_info *rt, void *p_arg)
+/* Return -1 if done with node, number of handled routes on partial dump */
+int rt6_dump_route(struct fib6_info *rt, void *p_arg, unsigned int skip)
 {
 	struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
+	bool dump_exceptions = true, dump_routes = true;
 	struct fib_dump_filter *filter = &arg->filter;
+	struct rt6_exception_bucket *bucket;
 	unsigned int flags = NLM_F_MULTI;
+	struct rt6_exception *rt6_ex;
 	struct net *net = arg->net;
+	int i, count = 0;
 
 	if (rt == net->ipv6.fib6_null_entry)
-		return 0;
+		return -1;
 
 	if ((filter->flags & RTM_F_PREFIX) &&
 	    !(rt->fib6_flags & RTF_PREFIX_RT)) {
 		/* success since this is not a prefix route */
-		return 1;
+		return -1;
 	}
 	if (filter->filter_set) {
 		if ((filter->rt_type && rt->fib6_type != filter->rt_type) ||
 		    (filter->dev && !fib6_info_uses_dev(rt, filter->dev)) ||
 		    (filter->protocol && rt->fib6_protocol != filter->protocol)) {
-			return 1;
+			return -1;
 		}
 		flags |= NLM_F_DUMP_FILTERED;
+		dump_routes = !(dump_exceptions = filter->flags & RTM_F_CLONED);
+	}
+
+	if (dump_routes) {
+		if (skip) {
+			skip--;
+		} else {
+			if (rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL,
+					  0, RTM_NEWROUTE,
+					  NETLINK_CB(arg->cb->skb).portid,
+					  arg->cb->nlh->nlmsg_seq, flags)) {
+				return 0;
+			}
+			count++;
+		}
+	}
+
+	if (!dump_exceptions)
+		return -1;
+
+	bucket = rcu_dereference(rt->rt6i_exception_bucket);
+	if (!bucket)
+		return -1;
+
+	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
+		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
+			if (skip) {
+				skip--;
+				continue;
+			}
+
+			/* Expiration of entries doesn't bump sernum, insertion
+			 * does. Removal is triggered by insertion, so we can
+			 * rely on the fact that if entries change between two
+			 * partial dumps, this node is scanned again completely,
+			 * see rt6_insert_exception() and fib6_dump_table().
+			 *
+			 * Count expired entries we go through as handled
+			 * entries that we'll skip next time, in case of partial
+			 * node dump. Otherwise, if entries expire meanwhile,
+			 * we'll skip the wrong amount.
+			 */
+			if (rt6_check_expired(rt6_ex->rt6i)) {
+				count++;
+				continue;
+			}
+
+			if (rt6_fill_node(net, arg->skb, rt, &rt6_ex->rt6i->dst,
+					  NULL, NULL, 0, RTM_NEWROUTE,
+					  NETLINK_CB(arg->cb->skb).portid,
+					  arg->cb->nlh->nlmsg_seq, flags)) {
+				return count;
+			}
+
+			count++;
+		}
+		bucket++;
 	}
 
-	return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
-			     RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
-			     arg->cb->nlh->nlmsg_seq, flags);
+	return -1;
 }
 
 static int inet6_rtm_valid_getroute_req(struct sk_buff *skb,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/8] ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1()
  2019-06-15  1:32 [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
                   ` (6 preceding siblings ...)
  2019-06-15  1:32 ` [PATCH 7/8] ipv6: Dump route exceptions too in rt6_dump_route() Stefano Brivio
@ 2019-06-15  1:32 ` Stefano Brivio
  7 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  1:32 UTC (permalink / raw)
  To: David Miller, David Ahern, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

When we perform an inexact match on FIB nodes via fib6_locate_1(), longer
prefixes will be preferred to shorter ones. However, it might happen that
a node, with higher fn_bit value than some other, has no valid routing
information.

In this case, we'll pick that node, but it will be discarded by the check
on RTN_RTINFO in fib6_locate(), and we might miss nodes with valid routing
information but with lower fn_bit value.

This is apparent when a routing exception is created for a default route:
 # ip -6 route list
 fc00:1::/64 dev veth_A-R1 proto kernel metric 256 pref medium
 fc00:2::/64 dev veth_A-R2 proto kernel metric 256 pref medium
 fc00:4::1 via fc00:2::2 dev veth_A-R2 metric 1024 pref medium
 fe80::/64 dev veth_A-R1 proto kernel metric 256 pref medium
 fe80::/64 dev veth_A-R2 proto kernel metric 256 pref medium
 default via fc00:1::2 dev veth_A-R1 metric 1024 pref medium
 # ip -6 route list cache
 fc00:4::1 via fc00:2::2 dev veth_A-R2 metric 1024 expires 593sec mtu 1500 pref medium
 fc00:3::1 via fc00:1::2 dev veth_A-R1 metric 1024 expires 593sec mtu 1500 pref medium
 # ip -6 route flush cache    # node for default route is discarded
 Failed to send flush request: No such process
 # ip -6 route list cache
 fc00:3::1 via fc00:1::2 dev veth_A-R1 metric 1024 expires 586sec mtu 1500 pref medium

Check right away if the node has a RTN_RTINFO flag, before replacing the
'prev' pointer, that indicates the longest matching prefix found so far.

Fixes: 38fbeeeeccdb ("ipv6: prepare fib6_locate() for exception table")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: No changes

v3: No changes

v2: No changes

 net/ipv6/ip6_fib.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 65cd47a84bcc..7644cd5cdf15 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1545,7 +1545,8 @@ static struct fib6_node *fib6_locate_1(struct fib6_node *root,
 		if (plen == fn->fn_bit)
 			return fn;
 
-		prev = fn;
+		if (fn->fn_flags & RTN_RTINFO)
+			prev = fn;
 
 next:
 		/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-15  1:32 ` [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version Stefano Brivio
@ 2019-06-15  2:54   ` David Ahern
  2019-06-15  3:13     ` Stefano Brivio
  0 siblings, 1 reply; 22+ messages in thread
From: David Ahern @ 2019-06-15  2:54 UTC (permalink / raw)
  To: Stefano Brivio, David Miller, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

On 6/14/19 7:32 PM, Stefano Brivio wrote:
> ip_valid_fib_dump_req() does two things: performs strict checking on
> netlink attributes for dump requests, and sets a dump filter if netlink
> attributes require it.
> 
> We might want to just set a filter, without performing strict validation.
> 
> Rename it to ip_filter_fib_dump_req(), and add a 'strict' boolean
> argument that must be set if strict validation is requested.
> 
> This patch doesn't introduce any functional changes.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> v4: New patch
> 

Can you explain why this patch is needed? The existing function requires
strict mode and is needed to enable any of the kernel side filtering
beyond the RTM_F_CLONED setting in rtm_flags.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 2/8] ipv4: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent
  2019-06-15  1:32 ` [PATCH net v4 2/8] ipv4: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent Stefano Brivio
@ 2019-06-15  3:13   ` David Ahern
  2019-06-15  3:23     ` Stefano Brivio
  0 siblings, 1 reply; 22+ messages in thread
From: David Ahern @ 2019-06-15  3:13 UTC (permalink / raw)
  To: Stefano Brivio, David Miller, Martin KaFai Lau
  Cc: Jianlin Shi, Wei Wang, Eric Dumazet, Matti Vaittinen, netdev

On 6/14/19 7:32 PM, Stefano Brivio wrote:
> Socket option NETLINK_GET_STRICT_CHK, quoting from commit 89d35528d17d
> ("netlink: Add new socket option to enable strict checking on dumps"),
> is used to "request strict checking of headers and attributes on dump
> requests".
> 
> If some attributes are set (including flags), setting this option causes
> dump functions to filter results according to these attributes, via the
> filter_set flag. However, if strict checking is requested, this should
> imply that we also filter results based on flags that are *not* set.

I don't agree with that comment. If a request does not specify a bit or
specify an attribute on the request, it is a wildcard in the sense of
nothing to be considered when matching records to be returned.


> 
> This is currently not the case, at least for IPv4 FIB dumps: if the
> RTM_F_CLONED flag is not set, and strict checking is required, we should
> not return routes with the RTM_F_CLONED flag set.

IPv4 currently ignores the CLONED flag and just returns - regardless of
whether strict checking is enabled. This is the original short cut added
many years ago.

> 
> Set the filter_set flag whenever strict checking is requested, limiting
> the scope to IPv4 FIB dumps for the moment being, as other users of the
> flag might not present this inconsistency.
> 
> Note that this partially duplicates the semantics of NLM_F_MATCH as
> described by RFC 3549, par. 3.1.1. Instead of setting a filter based on
> the size of the netlink message, properly support NLM_F_MATCH, by
> setting a filter via ip_filter_fib_dump_req() and setting the filter_set
> flag.
> 

your commit description is very confusing given the end goal. can you
explain again?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-15  2:54   ` David Ahern
@ 2019-06-15  3:13     ` Stefano Brivio
  2019-06-15  3:16       ` David Ahern
  0 siblings, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  3:13 UTC (permalink / raw)
  To: David Ahern
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On Fri, 14 Jun 2019 20:54:49 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/14/19 7:32 PM, Stefano Brivio wrote:
> > ip_valid_fib_dump_req() does two things: performs strict checking on
> > netlink attributes for dump requests, and sets a dump filter if netlink
> > attributes require it.
> > 
> > We might want to just set a filter, without performing strict validation.
> > 
> > Rename it to ip_filter_fib_dump_req(), and add a 'strict' boolean
> > argument that must be set if strict validation is requested.
> > 
> > This patch doesn't introduce any functional changes.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> > v4: New patch
> >   
> 
> Can you explain why this patch is needed? The existing function requires
> strict mode and is needed to enable any of the kernel side filtering
> beyond the RTM_F_CLONED setting in rtm_flags.

It's mostly to have proper NLM_F_MATCH support. Let's pick an iproute2
version without strict checking support (< 5.0), that sets NLM_F_MATCH
though. Then we need this check:

	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm)))

and to set filter parameters not just based on flags (i.e. RTM_F_CLONED),
but also on table, protocol, etc.

For example one might want to: 'ip route list cache table main', and this
is then taken into account in fn_trie_dump_leaf() and rt6_dump_route().

Reusing this function avoids a nice amount of duplicated code and allows
to have an almost common path with strict checking.

-- 
Stefano

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-15  3:13     ` Stefano Brivio
@ 2019-06-15  3:16       ` David Ahern
  2019-06-15  3:27         ` Stefano Brivio
  0 siblings, 1 reply; 22+ messages in thread
From: David Ahern @ 2019-06-15  3:16 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On 6/14/19 9:13 PM, Stefano Brivio wrote:
> On Fri, 14 Jun 2019 20:54:49 -0600
> David Ahern <dsahern@gmail.com> wrote:
> 
>> On 6/14/19 7:32 PM, Stefano Brivio wrote:
>>> ip_valid_fib_dump_req() does two things: performs strict checking on
>>> netlink attributes for dump requests, and sets a dump filter if netlink
>>> attributes require it.
>>>
>>> We might want to just set a filter, without performing strict validation.
>>>
>>> Rename it to ip_filter_fib_dump_req(), and add a 'strict' boolean
>>> argument that must be set if strict validation is requested.
>>>
>>> This patch doesn't introduce any functional changes.
>>>
>>> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
>>> ---
>>> v4: New patch
>>>   
>>
>> Can you explain why this patch is needed? The existing function requires
>> strict mode and is needed to enable any of the kernel side filtering
>> beyond the RTM_F_CLONED setting in rtm_flags.
> 
> It's mostly to have proper NLM_F_MATCH support. Let's pick an iproute2
> version without strict checking support (< 5.0), that sets NLM_F_MATCH
> though. Then we need this check:
> 
> 	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm)))

but that check existed long before any of the strict checking and kernel
side filtering was added.

> 
> and to set filter parameters not just based on flags (i.e. RTM_F_CLONED),
> but also on table, protocol, etc.

and to do that you *must* have strict checking. There is no way to trust
userspace without that strict flag set because iproute2 for the longest
time sent the wrong header for almost all dump requests.

> 
> For example one might want to: 'ip route list cache table main', and this
> is then taken into account in fn_trie_dump_leaf() and rt6_dump_route().
> 
> Reusing this function avoids a nice amount of duplicated code and allows
> to have an almost common path with strict checking.
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 2/8] ipv4: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent
  2019-06-15  3:13   ` David Ahern
@ 2019-06-15  3:23     ` Stefano Brivio
  2019-06-17 13:29       ` David Ahern
  0 siblings, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  3:23 UTC (permalink / raw)
  To: David Ahern
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On Fri, 14 Jun 2019 21:13:38 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/14/19 7:32 PM, Stefano Brivio wrote:
> > Socket option NETLINK_GET_STRICT_CHK, quoting from commit 89d35528d17d
> > ("netlink: Add new socket option to enable strict checking on dumps"),
> > is used to "request strict checking of headers and attributes on dump
> > requests".
> > 
> > If some attributes are set (including flags), setting this option causes
> > dump functions to filter results according to these attributes, via the
> > filter_set flag. However, if strict checking is requested, this should
> > imply that we also filter results based on flags that are *not* set.  
> 
> I don't agree with that comment. If a request does not specify a bit or
> specify an attribute on the request, it is a wildcard in the sense of
> nothing to be considered when matching records to be returned.

This is what I had in v1. Then:

On Thu, 6 Jun 2019 16:47:00 -0600
David Ahern <dsahern@gmail.com> wrote:

> That's the use case I was targeting:
> 1. fib dumps - RTM_F_CLONED not set
> 2. exception dump - RTM_F_CLONED set

On Mon, 10 Jun 2019 15:38:06 -0600
David Ahern <dsahern@gmail.com> wrote:

> By that I mean without the CLONED flag, no exceptions are returned
> (default FIB dump). With the CLONED flag only exceptions are returned.

and this looks to me like a sensible way (if strict checking is
requested, or if NLM_F_MATCH is passed) to filter the results.

> > This is currently not the case, at least for IPv4 FIB dumps: if the
> > RTM_F_CLONED flag is not set, and strict checking is required, we should
> > not return routes with the RTM_F_CLONED flag set.  
> 
> IPv4 currently ignores the CLONED flag and just returns - regardless of
> whether strict checking is enabled. This is the original short cut added
> many years ago.

Sure, and I'm removing that, because there's no way to fetch cached
routes otherwise.

> > Set the filter_set flag whenever strict checking is requested, limiting
> > the scope to IPv4 FIB dumps for the moment being, as other users of the
> > flag might not present this inconsistency.
> > 
> > Note that this partially duplicates the semantics of NLM_F_MATCH as
> > described by RFC 3549, par. 3.1.1. Instead of setting a filter based on
> > the size of the netlink message, properly support NLM_F_MATCH, by
> > setting a filter via ip_filter_fib_dump_req() and setting the filter_set
> > flag.
> >   
> 
> your commit description is very confusing given the end goal. can you
> explain again?

1. we need a way to filter on cached routes

2. RTM_F_CLONED, by itself, doesn't specify a filter

3. how do we turn that into a filter? NLM_F_MATCH, says RFC 3549

4. but if strict checking is requested, you also turn some attributes
   and flags into filters -- so let's make that apply to RTM_F_CLONED
   too, I don't see any reason why that should be special

-- 
Stefano

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-15  3:16       ` David Ahern
@ 2019-06-15  3:27         ` Stefano Brivio
  2019-06-16 20:04           ` Stefano Brivio
  2019-06-17 13:18           ` David Ahern
  0 siblings, 2 replies; 22+ messages in thread
From: Stefano Brivio @ 2019-06-15  3:27 UTC (permalink / raw)
  To: David Ahern
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On Fri, 14 Jun 2019 21:16:54 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/14/19 9:13 PM, Stefano Brivio wrote:
> > On Fri, 14 Jun 2019 20:54:49 -0600
> > David Ahern <dsahern@gmail.com> wrote:
> >   
> >> On 6/14/19 7:32 PM, Stefano Brivio wrote:  
> >>> ip_valid_fib_dump_req() does two things: performs strict checking on
> >>> netlink attributes for dump requests, and sets a dump filter if netlink
> >>> attributes require it.
> >>>
> >>> We might want to just set a filter, without performing strict validation.
> >>>
> >>> Rename it to ip_filter_fib_dump_req(), and add a 'strict' boolean
> >>> argument that must be set if strict validation is requested.
> >>>
> >>> This patch doesn't introduce any functional changes.
> >>>
> >>> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> >>> ---
> >>> v4: New patch
> >>>     
> >>
> >> Can you explain why this patch is needed? The existing function requires
> >> strict mode and is needed to enable any of the kernel side filtering
> >> beyond the RTM_F_CLONED setting in rtm_flags.  
> > 
> > It's mostly to have proper NLM_F_MATCH support. Let's pick an iproute2
> > version without strict checking support (< 5.0), that sets NLM_F_MATCH
> > though. Then we need this check:
> > 
> > 	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm)))  
> 
> but that check existed long before any of the strict checking and kernel
> side filtering was added.

Indeed. And now I'm recycling it, even if strict checking is not
requested.

> > and to set filter parameters not just based on flags (i.e. RTM_F_CLONED),
> > but also on table, protocol, etc.  
> 
> and to do that you *must* have strict checking. There is no way to trust
> userspace without that strict flag set because iproute2 for the longest
> time sent the wrong header for almost all dump requests.

So you're implying that:

- we shouldn't support NLM_F_MATCH

- we should keep this broken for iproute2 < 5.0.0?

I guess this might be acceptable, but please state it clearly.

By the way, if really needed, we can do strict checking even if not
requested. But this might add more and more userspace breakage, I guess.

-- 
Stefano

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-15  3:27         ` Stefano Brivio
@ 2019-06-16 20:04           ` Stefano Brivio
  2019-06-17 13:38             ` David Ahern
  2019-06-17 13:18           ` David Ahern
  1 sibling, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2019-06-16 20:04 UTC (permalink / raw)
  To: David Ahern
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On Sat, 15 Jun 2019 05:27:05 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Fri, 14 Jun 2019 21:16:54 -0600
> David Ahern <dsahern@gmail.com> wrote:
> 
> > On 6/14/19 9:13 PM, Stefano Brivio wrote:  
> > > On Fri, 14 Jun 2019 20:54:49 -0600
> > > David Ahern <dsahern@gmail.com> wrote:
> > >     
> > >> On 6/14/19 7:32 PM, Stefano Brivio wrote:    
> > >>> ip_valid_fib_dump_req() does two things: performs strict checking on
> > >>> netlink attributes for dump requests, and sets a dump filter if netlink
> > >>> attributes require it.
> > >>>
> > >>> We might want to just set a filter, without performing strict validation.
> > >>>
> > >>> Rename it to ip_filter_fib_dump_req(), and add a 'strict' boolean
> > >>> argument that must be set if strict validation is requested.
> > >>>
> > >>> This patch doesn't introduce any functional changes.
> > >>>
> > >>> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > >>> ---
> > >>> v4: New patch
> > >>>       
> > >>
> > >> Can you explain why this patch is needed? The existing function requires
> > >> strict mode and is needed to enable any of the kernel side filtering
> > >> beyond the RTM_F_CLONED setting in rtm_flags.    
> > > 
> > > It's mostly to have proper NLM_F_MATCH support. Let's pick an iproute2
> > > version without strict checking support (< 5.0), that sets NLM_F_MATCH
> > > though. Then we need this check:
> > > 
> > > 	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm)))    
> > 
> > but that check existed long before any of the strict checking and kernel
> > side filtering was added.  
> 
> Indeed. And now I'm recycling it, even if strict checking is not
> requested.
> 
> > > and to set filter parameters not just based on flags (i.e. RTM_F_CLONED),
> > > but also on table, protocol, etc.    
> > 
> > and to do that you *must* have strict checking. There is no way to trust
> > userspace without that strict flag set because iproute2 for the longest
> > time sent the wrong header for almost all dump requests.  
> 
> So you're implying that:
> 
> - we shouldn't support NLM_F_MATCH
> 
> - we should keep this broken for iproute2 < 5.0.0?
> 
> I guess this might be acceptable, but please state it clearly.
> 
> By the way, if really needed, we can do strict checking even if not
> requested. But this might add more and more userspace breakage, I guess.

Maybe I have a simpler alternative, that doesn't allow filters without
strict checking (your concern above) and fixes the issue for most
iproute2 versions (except for 'ip -6 route cache flush' from 5.0.0 to
current, unpatched version). I would also like to avoid introducing
this bug:

- 'ip route list cache table main' currently returns nothing (bug)

- 'ip route list cache table main' with v1-v3 would return all cached
  routes (new bug)

and retain this feature from v4:

- if neither NLM_F_MATCH nor other filters are set, dump all cached and
  uncached routes. There's no way to get cached and uncached ones with
  a single request, otherwise. This would also fit RFC 3549.

We could do this:

- strict checking enabled (iproute2 >= 5.0.0):
  - in inet{,6}_dump_fib(): if NLM_F_MATCH is set, set
    filter->filter_set in any case

  - in fn_trie_dump_leaf() and rt6_dump_route(): use filter->filter_set
    to decide if we want to filter depending on RTM_F_CLONED being
    set/unset. If other filters (rt_type, dev, protocol) are not set,
    they are still wildcards (existing implementation)

- no strict checking (iproute2 < 5.0.0):
  - we can't filter consistently, so apply no filters at all: dump all
    the routes (filter->filter_set not set), cached and uncached. That
    means more netlink messages, but no spam as iproute2 filters them
    anyway, and list/flush cache commands work again.

I would drop 1/8, turn 2/8 and 6/8 into a straightforward:

 	if (cb->strict_check) {
 		err = ip_valid_fib_dump_req(net, nlh, &filter, cb);
 		if (err < 0)
 			return err;
+		if (nlh->nlmsg_flags & NLM_F_MATCH)
+			filter.filter_set = 1;
 	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
 		struct rtmsg *rtm = nlmsg_data(nlh);

and other patches remain the same.

What do you think?

-- 
Stefano

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-15  3:27         ` Stefano Brivio
  2019-06-16 20:04           ` Stefano Brivio
@ 2019-06-17 13:18           ` David Ahern
  1 sibling, 0 replies; 22+ messages in thread
From: David Ahern @ 2019-06-17 13:18 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On 6/14/19 9:27 PM, Stefano Brivio wrote:
>>>> Can you explain why this patch is needed? The existing function requires
>>>> strict mode and is needed to enable any of the kernel side filtering
>>>> beyond the RTM_F_CLONED setting in rtm_flags.  
>>>
>>> It's mostly to have proper NLM_F_MATCH support. Let's pick an iproute2
>>> version without strict checking support (< 5.0), that sets NLM_F_MATCH
>>> though. Then we need this check:
>>>
>>> 	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm)))  
>>
>> but that check existed long before any of the strict checking and kernel
>> side filtering was added.
> 
> Indeed. And now I'm recycling it, even if strict checking is not
> requested.
> 
>>> and to set filter parameters not just based on flags (i.e. RTM_F_CLONED),
>>> but also on table, protocol, etc.  
>>
>> and to do that you *must* have strict checking. There is no way to trust
>> userspace without that strict flag set because iproute2 for the longest
>> time sent the wrong header for almost all dump requests.
> 
> So you're implying that:
> 
> - we shouldn't support NLM_F_MATCH
> 
> - we should keep this broken for iproute2 < 5.0.0?
> 
> I guess this might be acceptable, but please state it clearly.
> 
> By the way, if really needed, we can do strict checking even if not
> requested. But this might add more and more userspace breakage, I guess.
> 

Prior to 5.0 and strict checking, iproute2 was sending ifinfomsg as the
header struct - which is wrong for routes. ifi_flags just happens to
have the same offset as rtm_flags so the check for RTM_F_CLONED is ok,
but nothing else sent in the get request (e.g., potentially appended
attributes) can be trusted, so the !strict path you are adding with
nlmsg_parse_deprecated is wrong. The kernel side filter argument can be
used and treating RTM_F_CLONED as a filter is ok, but not the new
parsing code.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 2/8] ipv4: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent
  2019-06-15  3:23     ` Stefano Brivio
@ 2019-06-17 13:29       ` David Ahern
  0 siblings, 0 replies; 22+ messages in thread
From: David Ahern @ 2019-06-17 13:29 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On 6/14/19 9:23 PM, Stefano Brivio wrote:
> 
> 1. we need a way to filter on cached routes
> 
> 2. RTM_F_CLONED, by itself, doesn't specify a filter
> 
> 3. how do we turn that into a filter? NLM_F_MATCH, says RFC 3549
> 
> 4. but if strict checking is requested, you also turn some attributes
>    and flags into filters -- so let's make that apply to RTM_F_CLONED
>    too, I don't see any reason why that should be special
> 

I guess I am arguing (and Martin seems to agree with end goal) that
RTM_F_CLONED is special. There are really 2 "databases" to be dumped
here: FIB entries and exceptions. Which one to dump is controlled by
RTM_F_CLONED.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-16 20:04           ` Stefano Brivio
@ 2019-06-17 13:38             ` David Ahern
  2019-06-17 14:13               ` Stefano Brivio
  0 siblings, 1 reply; 22+ messages in thread
From: David Ahern @ 2019-06-17 13:38 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On 6/16/19 2:04 PM, Stefano Brivio wrote:
> We could do this:
> 
> - strict checking enabled (iproute2 >= 5.0.0):
>   - in inet{,6}_dump_fib(): if NLM_F_MATCH is set, set
>     filter->filter_set in any case
> 
>   - in fn_trie_dump_leaf() and rt6_dump_route(): use filter->filter_set
>     to decide if we want to filter depending on RTM_F_CLONED being
>     set/unset. If other filters (rt_type, dev, protocol) are not set,
>     they are still wildcards (existing implementation)
> 
> - no strict checking (iproute2 < 5.0.0):
>   - we can't filter consistently, so apply no filters at all: dump all
>     the routes (filter->filter_set not set), cached and uncached. That
>     means more netlink messages, but no spam as iproute2 filters them
>     anyway, and list/flush cache commands work again.
> 
> I would drop 1/8, turn 2/8 and 6/8 into a straightforward:
> 
>  	if (cb->strict_check) {
>  		err = ip_valid_fib_dump_req(net, nlh, &filter, cb);
>  		if (err < 0)
>  			return err;
> +		if (nlh->nlmsg_flags & NLM_F_MATCH)
> +			filter.filter_set = 1;
>  	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
>  		struct rtmsg *rtm = nlmsg_data(nlh);
> 
> and other patches remain the same.
> 
> What do you think?
> 

With strict checking (5.0 and forward):
- RTM_F_CLONED NOT set means dump only FIB entries
- RTM_F_CLONED set means dump only exceptions

Without strict checking (old iproute2 on any kernel):
- dump all, userspace has to sort

Kernel side this can be handled with new field, dump_exceptions, in the
filter that defaults to true and then is reset in the strict path if the
flag is not set.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-17 13:38             ` David Ahern
@ 2019-06-17 14:13               ` Stefano Brivio
  2019-06-17 17:06                 ` David Ahern
  0 siblings, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2019-06-17 14:13 UTC (permalink / raw)
  To: David Ahern
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On Mon, 17 Jun 2019 07:38:54 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/16/19 2:04 PM, Stefano Brivio wrote:
> > We could do this:
> > 
> > - strict checking enabled (iproute2 >= 5.0.0):
> >   - in inet{,6}_dump_fib(): if NLM_F_MATCH is set, set
> >     filter->filter_set in any case
> > 
> >   - in fn_trie_dump_leaf() and rt6_dump_route(): use filter->filter_set
> >     to decide if we want to filter depending on RTM_F_CLONED being
> >     set/unset. If other filters (rt_type, dev, protocol) are not set,
> >     they are still wildcards (existing implementation)
> > 
> > - no strict checking (iproute2 < 5.0.0):
> >   - we can't filter consistently, so apply no filters at all: dump all
> >     the routes (filter->filter_set not set), cached and uncached. That
> >     means more netlink messages, but no spam as iproute2 filters them
> >     anyway, and list/flush cache commands work again.
> > 
> > I would drop 1/8, turn 2/8 and 6/8 into a straightforward:
> > 
> >  	if (cb->strict_check) {
> >  		err = ip_valid_fib_dump_req(net, nlh, &filter, cb);
> >  		if (err < 0)
> >  			return err;
> > +		if (nlh->nlmsg_flags & NLM_F_MATCH)
> > +			filter.filter_set = 1;
> >  	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
> >  		struct rtmsg *rtm = nlmsg_data(nlh);
> > 
> > and other patches remain the same.
> > 
> > What do you think?
> >   
> 
> With strict checking (5.0 and forward):
> - RTM_F_CLONED NOT set means dump only FIB entries
> - RTM_F_CLONED set means dump only exceptions

Okay. Should we really ignore the RFC and NLM_F_MATCH though? If we add
field(s) to the filter, it comes almost for free, something like:

	if (nlh->nlmsg_flags & NLM_F_MATCH)
		filter->dump_exceptions = rtm->rtm_flags & RTM_F_CLONED;

instead of:

	filter->dump_exceptions = rtm->rtm_flags & RTM_F_CLONED;

> Without strict checking (old iproute2 on any kernel):
> - dump all, userspace has to sort
> 
> Kernel side this can be handled with new field, dump_exceptions, in the
> filter that defaults to true and then is reset in the strict path if the
> flag is not set.

I guess we need to add two fields, we'll need a 'dump_routes' too.

Otherwise, the dump functions can't distinguish between the three cases
('no strict checking', 'strict checking and RTM_F_CLONED', 'strict
checking and no RTM_F_CLONED'). How would you do this with a single
additional field?

-- 
Stefano

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-17 14:13               ` Stefano Brivio
@ 2019-06-17 17:06                 ` David Ahern
  2019-06-17 18:28                   ` Stefano Brivio
  0 siblings, 1 reply; 22+ messages in thread
From: David Ahern @ 2019-06-17 17:06 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On 6/17/19 8:13 AM, Stefano Brivio wrote:
>>
>> With strict checking (5.0 and forward):
>> - RTM_F_CLONED NOT set means dump only FIB entries
>> - RTM_F_CLONED set means dump only exceptions
> 
> Okay. Should we really ignore the RFC and NLM_F_MATCH though? If we add
> field(s) to the filter, it comes almost for free, something like:
> 
> 	if (nlh->nlmsg_flags & NLM_F_MATCH)
> 		filter->dump_exceptions = rtm->rtm_flags & RTM_F_CLONED;
> 
> instead of:
> 
> 	filter->dump_exceptions = rtm->rtm_flags & RTM_F_CLONED;

This is where you keep losing me. iproute2 has always set NLM_F_MATCH on
dump requests, so that flag can not be used as a discriminator here.

> 
>> Without strict checking (old iproute2 on any kernel):
>> - dump all, userspace has to sort
>>
>> Kernel side this can be handled with new field, dump_exceptions, in the
>> filter that defaults to true and then is reset in the strict path if the
>> flag is not set.
> 
> I guess we need to add two fields, we'll need a 'dump_routes' too.
> 
> Otherwise, the dump functions can't distinguish between the three cases
> ('no strict checking', 'strict checking and RTM_F_CLONED', 'strict
> checking and no RTM_F_CLONED'). How would you do this with a single
> additional field?
> 

sure, separate fields are needed for the pre-strict mode use case. So, I
take it we are converging on this:

1. non-strict mode, dump both (FIB entries and exceptions). Userspace
has to filter. This is the legacy behavior you are trying to restore.

2. strict mode:
   a. dump only FIB entries if RTM_F_CLONED is not set
   b. dump only exception entries if RTM_F_CLONED is set

Agreed?

Martin, others, ok with this?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version
  2019-06-17 17:06                 ` David Ahern
@ 2019-06-17 18:28                   ` Stefano Brivio
  0 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2019-06-17 18:28 UTC (permalink / raw)
  To: David Ahern
  Cc: David Miller, Martin KaFai Lau, Jianlin Shi, Wei Wang,
	Eric Dumazet, Matti Vaittinen, netdev

On Mon, 17 Jun 2019 11:06:51 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/17/19 8:13 AM, Stefano Brivio wrote:
> >>
> >> With strict checking (5.0 and forward):
> >> - RTM_F_CLONED NOT set means dump only FIB entries
> >> - RTM_F_CLONED set means dump only exceptions  
> > 
> > Okay. Should we really ignore the RFC and NLM_F_MATCH though? If we add
> > field(s) to the filter, it comes almost for free, something like:
> > 
> > 	if (nlh->nlmsg_flags & NLM_F_MATCH)
> > 		filter->dump_exceptions = rtm->rtm_flags & RTM_F_CLONED;
> > 
> > instead of:
> > 
> > 	filter->dump_exceptions = rtm->rtm_flags & RTM_F_CLONED;  
> 
> This is where you keep losing me. iproute2 has always set NLM_F_MATCH on
> dump requests, so that flag can not be used as a discriminator here.

iproute2 yes, but some other users (I'm not aware of any so I have no
examples) might *very* vaguely follow the RFC and expect consistent
results. That was my only point here. Most likely just a theoretical
one.

> >   
> >> Without strict checking (old iproute2 on any kernel):
> >> - dump all, userspace has to sort
> >>
> >> Kernel side this can be handled with new field, dump_exceptions, in the
> >> filter that defaults to true and then is reset in the strict path if the
> >> flag is not set.  
> > 
> > I guess we need to add two fields, we'll need a 'dump_routes' too.
> > 
> > Otherwise, the dump functions can't distinguish between the three cases
> > ('no strict checking', 'strict checking and RTM_F_CLONED', 'strict
> > checking and no RTM_F_CLONED'). How would you do this with a single
> > additional field?
> >   
> 
> sure, separate fields are needed for the pre-strict mode use case.

Well, they are needed, in general. They both start as true, non-strict
mode doesn't clear them, strict mode clears one. That's how I would do
it.

> So, I take it we are converging on this:
> 
> 1. non-strict mode, dump both (FIB entries and exceptions). Userspace
> has to filter. This is the legacy behavior you are trying to restore.
> 
> 2. strict mode:
>    a. dump only FIB entries if RTM_F_CLONED is not set
>    b. dump only exception entries if RTM_F_CLONED is set
> 
> Agreed?

Agreed in general, maybe let me know what you think about the
NLM_F_MATCH point above though.

-- 
Stefano

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-06-17 18:28 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-15  1:32 [PATCH net v4 0/8] Fix listing (IPv4, IPv6) and flushing (IPv6) of cached route exceptions Stefano Brivio
2019-06-15  1:32 ` [PATCH net v4 1/8] ipv4/fib_frontend: Rename ip_valid_fib_dump_req, provide non-strict version Stefano Brivio
2019-06-15  2:54   ` David Ahern
2019-06-15  3:13     ` Stefano Brivio
2019-06-15  3:16       ` David Ahern
2019-06-15  3:27         ` Stefano Brivio
2019-06-16 20:04           ` Stefano Brivio
2019-06-17 13:38             ` David Ahern
2019-06-17 14:13               ` Stefano Brivio
2019-06-17 17:06                 ` David Ahern
2019-06-17 18:28                   ` Stefano Brivio
2019-06-17 13:18           ` David Ahern
2019-06-15  1:32 ` [PATCH net v4 2/8] ipv4: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent Stefano Brivio
2019-06-15  3:13   ` David Ahern
2019-06-15  3:23     ` Stefano Brivio
2019-06-17 13:29       ` David Ahern
2019-06-15  1:32 ` [PATCH net v4 3/8] ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering Stefano Brivio
2019-06-15  1:32 ` [PATCH 4/8] ipv4: Dump routed caches if requested Stefano Brivio
2019-06-15  1:32 ` [PATCH 5/8] Revert "net/ipv6: Bail early if user only wants cloned entries" Stefano Brivio
2019-06-15  1:32 ` [PATCH 6/8] ipv6: Honour NLM_F_MATCH, make semantics of NETLINK_GET_STRICT_CHK consistent Stefano Brivio
2019-06-15  1:32 ` [PATCH 7/8] ipv6: Dump route exceptions too in rt6_dump_route() Stefano Brivio
2019-06-15  1:32 ` [PATCH 8/8] ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1() Stefano Brivio

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).