netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
@ 2013-07-07 17:30 Hannes Frederic Sowa
  2013-07-09 21:57 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-07 17:30 UTC (permalink / raw)
  To: netdev; +Cc: yoshfuji, petrus.lt, davem

Hello!

This patch fixes the last (I currently know of) known problem with nexthop
selection and rt->n removal for me. I did some preliminary testing and it
looks good so far. I wanted to post it here early to gather further feedback,
reviews or even testing. I am not sure if I overlooked something, yet.

Things I have not checked, yet:
a) switching a router to host (I don't know if we handle this correctly,
   currently)
b) router disappearing (but should be fine)
c) ...

No signed-off by me, for now.

Thanks,

  Hannes

-- >8 --

This is a follow-up patch to 3630d40067a21d4dfbadc6002bb469ce26ac5d52
("ipv6: rt6_check_neigh should successfully verify neigh if no NUD
information are available").

Since the removal of rt->n in rt6_info we can end up with a dst ==
NULL in rt6_check_neigh. In case the kernel is not compiled with
CONFIG_IPV6_ROUTER_PREF we should also select a route with unkown
NUD state but we must not avoid doing round robin selection on routes
with the same target. So introduce and pass down a boolean ``do_rr'' to
indicate when we should update rt->rr_ptr. As soon as no route is valid
we do backtracking and do a lookup on a higher level in the fib trie.

To hold correct state on the NUD selection we need to create a neighbour
entry as soon as we try to validate a nexthop.

I changed the return value of rt6_check_neigh to:
   1 in case of the dst entry validated
  -1 in case of we had no dst_entry and we need to do rr now
  -2 in case a we had a dst_entry and it did not validate

In case of CONFIG_IPV6_ROUTER_PREF, rt6_probe does not allocate an
neighbour entry (!CONFIG_IPV6_ROUTER_PREF: rt6_probe is a nop). Because
of this, we have to create a neighbour entry on nexthop validation to
track earlier validation errors. We recheck NUD state here to shortcurcuit
NUD_NOARP neighbours.

This seems to be the least complex fix for stable and net. I'll introduce
a new route lookup flag 'idempotent' as soon as next opens to not let
ip route get trigger active NUD validation if CONFIG_IPV6_ROUTER_PREF
is enabled.

It also seems advantageous to make CONFIG_IPV6_ROUTER_PREF a runtime
switch and just select the default operation on compile-time.

Reported-by: Pierre Emeriaud <petrus.lt@gmail.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David Miller <davem@davemloft.net>
---
 net/ipv6/route.c | 67 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 44 insertions(+), 23 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index bd5fd70..6a87ced 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -531,28 +531,39 @@ static inline int rt6_check_dev(struct rt6_info *rt, int oif)
 	return 0;
 }
 
-static inline bool rt6_check_neigh(struct rt6_info *rt)
+static inline int rt6_check_neigh(struct rt6_info *rt)
 {
 	struct neighbour *neigh;
-	bool ret = false;
+	int ret = -2;
 
 	if (rt->rt6i_flags & RTF_NONEXTHOP ||
 	    !(rt->rt6i_flags & RTF_GATEWAY))
-		return true;
+		return 1;
 
 	rcu_read_lock_bh();
 	neigh = __ipv6_neigh_lookup_noref(rt->dst.dev, &rt->rt6i_gateway);
 	if (neigh) {
 		read_lock(&neigh->lock);
 		if (neigh->nud_state & NUD_VALID)
-			ret = true;
+			ret = 1;
 #ifdef CONFIG_IPV6_ROUTER_PREF
 		else if (!(neigh->nud_state & NUD_FAILED))
-			ret = true;
+			ret = 1;
 #endif
 		read_unlock(&neigh->lock);
-	} else if (IS_ENABLED(CONFIG_IPV6_ROUTER_PREF)) {
-		ret = true;
+	} else {
+		/* track state for next check */
+		neigh = __neigh_create(&nd_tbl, &rt->rt6i_gateway, rt->dst.dev,
+				       false);
+		if (!IS_ERR(neigh)) {
+			if (neigh->nud_state & NUD_VALID)
+				ret = 1;
+			else
+				ret = IS_ENABLED(CONFIG_IPV6_ROUTER_PREF) ?
+				      1 : -1;
+		} else {
+			ret = -2;
+		}
 	}
 	rcu_read_unlock_bh();
 
@@ -566,43 +577,52 @@ static int rt6_score_route(struct rt6_info *rt, int oif,
 
 	m = rt6_check_dev(rt, oif);
 	if (!m && (strict & RT6_LOOKUP_F_IFACE))
-		return -1;
+		return -2;
 #ifdef CONFIG_IPV6_ROUTER_PREF
 	m |= IPV6_DECODE_PREF(IPV6_EXTRACT_PREF(rt->rt6i_flags)) << 2;
 #endif
-	if (!rt6_check_neigh(rt) && (strict & RT6_LOOKUP_F_REACHABLE))
-		return -1;
+	if (strict & RT6_LOOKUP_F_REACHABLE) {
+		int n = rt6_check_neigh(rt);
+		if (n < 0)
+			return n;
+	}
 	return m;
 }
 
 static struct rt6_info *find_match(struct rt6_info *rt, int oif, int strict,
-				   int *mpri, struct rt6_info *match)
+				   int *mpri, struct rt6_info *match,
+				   bool *do_rr)
 {
 	int m;
+	bool match_do_rr = false;
 
 	if (rt6_check_expired(rt))
 		goto out;
 
 	m = rt6_score_route(rt, oif, strict);
-	if (m < 0)
+	if (m == -1 && !IS_ENABLED(CONFIG_IPV6_ROUTER_PREF)) {
+		match_do_rr = true;
+		m = 0; /* lowest valid score */
+	} else if (m < 0) {
 		goto out;
+	}
+
+	if (strict & RT6_LOOKUP_F_REACHABLE)
+		rt6_probe(rt);
 
 	if (m > *mpri) {
-		if (strict & RT6_LOOKUP_F_REACHABLE)
-			rt6_probe(match);
+		*do_rr = match_do_rr;
 		*mpri = m;
 		match = rt;
-	} else if (strict & RT6_LOOKUP_F_REACHABLE) {
-		rt6_probe(rt);
 	}
-
 out:
 	return match;
 }
 
 static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
 				     struct rt6_info *rr_head,
-				     u32 metric, int oif, int strict)
+				     u32 metric, int oif, int strict,
+				     bool *do_rr)
 {
 	struct rt6_info *rt, *match;
 	int mpri = -1;
@@ -610,10 +630,10 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
 	match = NULL;
 	for (rt = rr_head; rt && rt->rt6i_metric == metric;
 	     rt = rt->dst.rt6_next)
-		match = find_match(rt, oif, strict, &mpri, match);
+		match = find_match(rt, oif, strict, &mpri, match, do_rr);
 	for (rt = fn->leaf; rt && rt != rr_head && rt->rt6i_metric == metric;
 	     rt = rt->dst.rt6_next)
-		match = find_match(rt, oif, strict, &mpri, match);
+		match = find_match(rt, oif, strict, &mpri, match, do_rr);
 
 	return match;
 }
@@ -622,15 +642,16 @@ static struct rt6_info *rt6_select(struct fib6_node *fn, int oif, int strict)
 {
 	struct rt6_info *match, *rt0;
 	struct net *net;
+	bool do_rr = false;
 
 	rt0 = fn->rr_ptr;
 	if (!rt0)
 		fn->rr_ptr = rt0 = fn->leaf;
 
-	match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict);
+	match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict,
+			     &do_rr);
 
-	if (!match &&
-	    (strict & RT6_LOOKUP_F_REACHABLE)) {
+	if (do_rr) {
 		struct rt6_info *next = rt0->dst.rt6_next;
 
 		/* no entries matched; do round-robin */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-07 17:30 [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF Hannes Frederic Sowa
@ 2013-07-09 21:57 ` Hannes Frederic Sowa
  2013-07-10  7:54   ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-09 21:57 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

Hello Nicolas!

I am currently trying to fix the nexthop selection for kernels compiled
without support for CONFIG_IPV6_ROUTER_PREF. I attached my current patch
below. While testing I got the following kernel panic in ecmp code:

[   80.144667] ------------[ cut here ]------------
[   80.145172] kernel BUG at net/ipv6/ip6_fib.c:733!
[   80.145172] invalid opcode: 0000 [#1] SMP 
[   80.145172] Modules linked in: 8021q nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer virtio_balloon snd soundcore i2c_piix4 i2c_core virtio_net virtio_blk
[   80.145172] CPU: 1 PID: 786 Comm: ping6 Not tainted 3.10.0+ #118
[   80.145172] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   80.145172] task: ffff880117fa0000 ti: ffff880118770000 task.ti: ffff880118770000
[   80.145172] RIP: 0010:[<ffffffff815f3b5d>]  [<ffffffff815f3b5d>] fib6_add+0x75d/0x830
[   80.145172] RSP: 0018:ffff880118771798  EFLAGS: 00010202
[   80.145172] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011350e480
[   80.145172] RDX: ffff88011350e238 RSI: 0000000000000004 RDI: ffff88011350f738
[   80.145172] RBP: ffff880118771848 R08: ffff880117903280 R09: 0000000000000001
[   80.145172] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011350f680
[   80.145172] R13: ffff880117903280 R14: ffff880118771890 R15: ffff88011350ef90
[   80.145172] FS:  00007f02b5127740(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
[   80.145172] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   80.145172] CR2: 00007f981322a000 CR3: 00000001181b1000 CR4: 00000000000006e0
[   80.145172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   80.145172] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   80.145172] Stack:
[   80.145172]  0000000000000001 ffff880100000000 ffff880100000000 ffff880117903280
[   80.145172]  0000000000000000 ffff880119a4cf00 0000000000000400 00000000000007fa
[   80.145172]  0000000000000000 0000000000000000 0000000000000000 ffff88011350f680
[   80.145172] Call Trace:
[   80.145172]  [<ffffffff815eeceb>] ? rt6_bind_peer+0x4b/0x90
[   80.145172]  [<ffffffff815ed985>] __ip6_ins_rt+0x45/0x70
[   80.145172]  [<ffffffff815eee35>] ip6_ins_rt+0x35/0x40
[   80.145172]  [<ffffffff815ef1e4>] ip6_pol_route.isra.44+0x3a4/0x4b0
[   80.145172]  [<ffffffff815ef34a>] ip6_pol_route_output+0x2a/0x30
[   80.145172]  [<ffffffff81616077>] fib6_rule_action+0xd7/0x210
[   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
[   80.145172]  [<ffffffff81553026>] fib_rules_lookup+0xc6/0x140
[   80.145172]  [<ffffffff81616374>] fib6_rule_lookup+0x44/0x80
[   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
[   80.145172]  [<ffffffff815edea3>] ip6_route_output+0x73/0xb0
[   80.145172]  [<ffffffff815dfdf3>] ip6_dst_lookup_tail+0x2c3/0x2e0
[   80.145172]  [<ffffffff813007b1>] ? list_del+0x11/0x40
[   80.145172]  [<ffffffff81082a4c>] ? remove_wait_queue+0x3c/0x50
[   80.145172]  [<ffffffff815dfe4d>] ip6_dst_lookup_flow+0x3d/0xa0
[   80.145172]  [<ffffffff815fda77>] rawv6_sendmsg+0x267/0xc20
[   80.145172]  [<ffffffff815a8a83>] inet_sendmsg+0x63/0xb0
[   80.145172]  [<ffffffff8128eb93>] ? selinux_socket_sendmsg+0x23/0x30
[   80.145172]  [<ffffffff815218d6>] sock_sendmsg+0xa6/0xd0
[   80.145172]  [<ffffffff81524a68>] SYSC_sendto+0x128/0x180
[   80.145172]  [<ffffffff8109825c>] ? update_curr+0xec/0x170
[   80.145172]  [<ffffffff81041d09>] ? kvm_clock_get_cycles+0x9/0x10
[   80.145172]  [<ffffffff810afd1e>] ? __getnstimeofday+0x3e/0xd0
[   80.145172]  [<ffffffff8152509e>] SyS_sendto+0xe/0x10
[   80.145172]  [<ffffffff8164efd9>] system_call_fastpath+0x16/0x1b
[   80.145172] Code: fe ff ff 41 f6 45 2a 06 0f 85 ca fe ff ff 49 8b 7e 08 4c 89 ee e8 94 ef ff ff e9 b9 fe ff ff 48 8b 82 28 05 00 00 e9 01 ff ff ff <0f> 0b 49 8b 54 24 30 0d 00 00 40 00 89 83 14 01 00 00 48 89 53 
[   80.145172] RIP  [<ffffffff815f3b5d>] fib6_add+0x75d/0x830
[   80.145172]  RSP <ffff880118771798>
[   80.387413] ---[ end trace 02f20b7a8b81ed95 ]---
[   80.390154] Kernel panic - not syncing: Fatal exception in interrupt

The relevant code is:

    725                 /* For each sibling in the list, increment the counter of
    726                  * siblings. BUG() if counters does not match, list of siblings
    727                  * is broken!
    728                  */
    729                 rt6i_nsiblings = 0;
    730                 list_for_each_entry_safe(sibling, temp_sibling,
    731                                          &rt->rt6i_siblings, rt6i_siblings) {
    732                         sibling->rt6i_nsiblings++;
    733                         BUG_ON(sibling->rt6i_nsiblings != rt->rt6i_nsiblings);
    734                         rt6i_nsiblings++;
    735                 }
    736                 BUG_ON(rt6i_nsiblings != rt->rt6i_nsiblings);
    737         }

Currently, I don't know if I my patch is to blame or something in the ecmp logic.

Another thing I just noticed is:

   1240         /* Remove this entry from other siblings */
   1241         if (rt->rt6i_nsiblings) {
   1242                 struct rt6_info *sibling, *next_sibling;
   1243 
   1244                 list_for_each_entry_safe(sibling, next_sibling,
   1245                                          &rt->rt6i_siblings, rt6i_siblings)
   1246                         sibling->rt6i_nsiblings--;
   1247                 rt->rt6i_nsiblings = 0;
   1248                 list_del_init(&rt->rt6i_siblings);
   1249         }

Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
start iterating from fn->leaf? But this does not seem to cause it,
because my trace does not report any calls to fib6_del_route.

You could try reproduce it by having an interface autoconfigured with
a default router with NUD_VALID neighbour. I then added an unused vlan
interface (vid 100 in my case) and added the following ip addresses:

ip -6 a a 2001:ffff::1/64 dev eth0.100
ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31 nexthop via 2001:ffff::32 nexthop via 2001:ffff::33

(all nexthops should not be reachable)

After starting a ping6 2000::1 the box should panic soon, after the
first nexthop entry times out.

Perhaps you could give me a hint?

Thanks a lot,

  Hannes


[PATCH RFC] ipv6: fix route selection if kernel not compiled with CONFIG_IPV6_ROUTER_PREF

This is a follow-up patch to 3630d40067a21d4dfbadc6002bb469ce26ac5d52
("ipv6: rt6_check_neigh should successfully verify neigh if no NUD
information are available").

Since the removal of rt->n in rt6_info we can end up with a dst ==
NULL in rt6_check_neigh. In case the kernel is not compiled with
CONFIG_IPV6_ROUTER_PREF we should also select a route with unkown
NUD state but we must not avoid doing round robin selection on routes
with the same target. So introduce and pass down a boolean ``do_rr'' to
indicate when we should update rt->rr_ptr. As soon as no route is valid
we do backtracking and do a lookup on a higher level in the fib trie.

To hold correct state on the NUD selection we need to create a neighbour
entry as soon as we tried to validate a nexthop.

I changed the return value of rt6_check_neigh to:
   1 in case of the dst entry validated
  -1 in case of we had no dst_entry and we need to do rr now
  -2 in case a we had a dst_entry and it did not validate

In case of CONFIG_IPV6_ROUTER_PREF, rt6_probe does not allocate an
neighbour entry (!CONFIG_IPV6_ROUTER_PREF: rt6_probe is a nop). Because
of this, we have to create a neighbour entry on nexthop validation to
track earlier validation errors. We recheck NUD state here to shortcurcuit
NUD_NOARP neighbours.

This seems to be the least complex fix for stable and net. I'll introduce
a new route lookup flag 'idempotent' as soon as next opens to not let
ip route get trigger active NUD validation if CONFIG_IPV6_ROUTER_PREF
is enabled. Currently we trigger active NUD validation if compiled with
CONFIG_IPV6_ROUTER_PREF.

It also seems advantageous to make CONFIG_IPV6_ROUTER_PREF a runtime
switch and just select the default operation on compile-time.

v2:
a) improved rt6_check_neigh logic and documented return values

Reported-by: Pierre Emeriaud <petrus.lt@gmail.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
---
 net/ipv6/route.c | 62 +++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index bd5fd70..c5d9e68 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -531,28 +531,34 @@ static inline int rt6_check_dev(struct rt6_info *rt, int oif)
 	return 0;
 }
 
-static inline bool rt6_check_neigh(struct rt6_info *rt)
+/* This function checks if a neighbour is reachable for routing
+ * purposes. It returns -2 in case the neighbour should not get
+ * selected as a viable router, -1 in case it should get selected with
+ * lowest score and afterwards trying roundrobin. 1 indicates a
+ * successfull verification.
+ */
+static inline int rt6_check_neigh(struct rt6_info *rt)
 {
 	struct neighbour *neigh;
-	bool ret = false;
+	int ret = -2;
 
 	if (rt->rt6i_flags & RTF_NONEXTHOP ||
 	    !(rt->rt6i_flags & RTF_GATEWAY))
-		return true;
+		return 1;
 
 	rcu_read_lock_bh();
 	neigh = __ipv6_neigh_lookup_noref(rt->dst.dev, &rt->rt6i_gateway);
 	if (neigh) {
 		read_lock(&neigh->lock);
 		if (neigh->nud_state & NUD_VALID)
-			ret = true;
+			ret = 1;
 #ifdef CONFIG_IPV6_ROUTER_PREF
 		else if (!(neigh->nud_state & NUD_FAILED))
-			ret = true;
+			ret = 1;
 #endif
 		read_unlock(&neigh->lock);
-	} else if (IS_ENABLED(CONFIG_IPV6_ROUTER_PREF)) {
-		ret = true;
+	} else {
+		ret = IS_ENABLED(CONFIG_IPV6_ROUTER_PREF) ? 1 : -1;
 	}
 	rcu_read_unlock_bh();
 
@@ -566,43 +572,52 @@ static int rt6_score_route(struct rt6_info *rt, int oif,
 
 	m = rt6_check_dev(rt, oif);
 	if (!m && (strict & RT6_LOOKUP_F_IFACE))
-		return -1;
+		return -2;
 #ifdef CONFIG_IPV6_ROUTER_PREF
 	m |= IPV6_DECODE_PREF(IPV6_EXTRACT_PREF(rt->rt6i_flags)) << 2;
 #endif
-	if (!rt6_check_neigh(rt) && (strict & RT6_LOOKUP_F_REACHABLE))
-		return -1;
+	if (strict & RT6_LOOKUP_F_REACHABLE) {
+		int n = rt6_check_neigh(rt);
+		if (n < 0)
+			return n;
+	}
 	return m;
 }
 
 static struct rt6_info *find_match(struct rt6_info *rt, int oif, int strict,
-				   int *mpri, struct rt6_info *match)
+				   int *mpri, struct rt6_info *match,
+				   bool *do_rr)
 {
 	int m;
+	bool match_do_rr = false;
 
 	if (rt6_check_expired(rt))
 		goto out;
 
 	m = rt6_score_route(rt, oif, strict);
-	if (m < 0)
+	if (m == -1 && !IS_ENABLED(CONFIG_IPV6_ROUTER_PREF)) {
+		match_do_rr = true;
+		m = 0; /* lowest valid score */
+	} else if (m < 0) {
 		goto out;
+	}
+
+	if (strict & RT6_LOOKUP_F_REACHABLE)
+		rt6_probe(rt);
 
 	if (m > *mpri) {
-		if (strict & RT6_LOOKUP_F_REACHABLE)
-			rt6_probe(match);
+		*do_rr = match_do_rr;
 		*mpri = m;
 		match = rt;
-	} else if (strict & RT6_LOOKUP_F_REACHABLE) {
-		rt6_probe(rt);
 	}
-
 out:
 	return match;
 }
 
 static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
 				     struct rt6_info *rr_head,
-				     u32 metric, int oif, int strict)
+				     u32 metric, int oif, int strict,
+				     bool *do_rr)
 {
 	struct rt6_info *rt, *match;
 	int mpri = -1;
@@ -610,10 +625,10 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
 	match = NULL;
 	for (rt = rr_head; rt && rt->rt6i_metric == metric;
 	     rt = rt->dst.rt6_next)
-		match = find_match(rt, oif, strict, &mpri, match);
+		match = find_match(rt, oif, strict, &mpri, match, do_rr);
 	for (rt = fn->leaf; rt && rt != rr_head && rt->rt6i_metric == metric;
 	     rt = rt->dst.rt6_next)
-		match = find_match(rt, oif, strict, &mpri, match);
+		match = find_match(rt, oif, strict, &mpri, match, do_rr);
 
 	return match;
 }
@@ -622,15 +637,16 @@ static struct rt6_info *rt6_select(struct fib6_node *fn, int oif, int strict)
 {
 	struct rt6_info *match, *rt0;
 	struct net *net;
+	bool do_rr = false;
 
 	rt0 = fn->rr_ptr;
 	if (!rt0)
 		fn->rr_ptr = rt0 = fn->leaf;
 
-	match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict);
+	match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict,
+			     &do_rr);
 
-	if (!match &&
-	    (strict & RT6_LOOKUP_F_REACHABLE)) {
+	if (do_rr) {
 		struct rt6_info *next = rt0->dst.rt6_next;
 
 		/* no entries matched; do round-robin */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-09 21:57 ` Hannes Frederic Sowa
@ 2013-07-10  7:54   ` Nicolas Dichtel
  2013-07-10  9:28     ` Nicolas Dichtel
  2013-07-10 11:15     ` Hannes Frederic Sowa
  0 siblings, 2 replies; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-10  7:54 UTC (permalink / raw)
  To: netdev, yoshfuji, petrus.lt, davem

Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
> Hello Nicolas!
>
> I am currently trying to fix the nexthop selection for kernels compiled
> without support for CONFIG_IPV6_ROUTER_PREF. I attached my current patch
> below. While testing I got the following kernel panic in ecmp code:
>
> [   80.144667] ------------[ cut here ]------------
> [   80.145172] kernel BUG at net/ipv6/ip6_fib.c:733!
> [   80.145172] invalid opcode: 0000 [#1] SMP
> [   80.145172] Modules linked in: 8021q nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer virtio_balloon snd soundcore i2c_piix4 i2c_core virtio_net virtio_blk
> [   80.145172] CPU: 1 PID: 786 Comm: ping6 Not tainted 3.10.0+ #118
> [   80.145172] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   80.145172] task: ffff880117fa0000 ti: ffff880118770000 task.ti: ffff880118770000
> [   80.145172] RIP: 0010:[<ffffffff815f3b5d>]  [<ffffffff815f3b5d>] fib6_add+0x75d/0x830
> [   80.145172] RSP: 0018:ffff880118771798  EFLAGS: 00010202
> [   80.145172] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011350e480
> [   80.145172] RDX: ffff88011350e238 RSI: 0000000000000004 RDI: ffff88011350f738
> [   80.145172] RBP: ffff880118771848 R08: ffff880117903280 R09: 0000000000000001
> [   80.145172] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011350f680
> [   80.145172] R13: ffff880117903280 R14: ffff880118771890 R15: ffff88011350ef90
> [   80.145172] FS:  00007f02b5127740(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
> [   80.145172] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [   80.145172] CR2: 00007f981322a000 CR3: 00000001181b1000 CR4: 00000000000006e0
> [   80.145172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   80.145172] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   80.145172] Stack:
> [   80.145172]  0000000000000001 ffff880100000000 ffff880100000000 ffff880117903280
> [   80.145172]  0000000000000000 ffff880119a4cf00 0000000000000400 00000000000007fa
> [   80.145172]  0000000000000000 0000000000000000 0000000000000000 ffff88011350f680
> [   80.145172] Call Trace:
> [   80.145172]  [<ffffffff815eeceb>] ? rt6_bind_peer+0x4b/0x90
> [   80.145172]  [<ffffffff815ed985>] __ip6_ins_rt+0x45/0x70
> [   80.145172]  [<ffffffff815eee35>] ip6_ins_rt+0x35/0x40
> [   80.145172]  [<ffffffff815ef1e4>] ip6_pol_route.isra.44+0x3a4/0x4b0
> [   80.145172]  [<ffffffff815ef34a>] ip6_pol_route_output+0x2a/0x30
> [   80.145172]  [<ffffffff81616077>] fib6_rule_action+0xd7/0x210
> [   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
> [   80.145172]  [<ffffffff81553026>] fib_rules_lookup+0xc6/0x140
> [   80.145172]  [<ffffffff81616374>] fib6_rule_lookup+0x44/0x80
> [   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
> [   80.145172]  [<ffffffff815edea3>] ip6_route_output+0x73/0xb0
> [   80.145172]  [<ffffffff815dfdf3>] ip6_dst_lookup_tail+0x2c3/0x2e0
> [   80.145172]  [<ffffffff813007b1>] ? list_del+0x11/0x40
> [   80.145172]  [<ffffffff81082a4c>] ? remove_wait_queue+0x3c/0x50
> [   80.145172]  [<ffffffff815dfe4d>] ip6_dst_lookup_flow+0x3d/0xa0
> [   80.145172]  [<ffffffff815fda77>] rawv6_sendmsg+0x267/0xc20
> [   80.145172]  [<ffffffff815a8a83>] inet_sendmsg+0x63/0xb0
> [   80.145172]  [<ffffffff8128eb93>] ? selinux_socket_sendmsg+0x23/0x30
> [   80.145172]  [<ffffffff815218d6>] sock_sendmsg+0xa6/0xd0
> [   80.145172]  [<ffffffff81524a68>] SYSC_sendto+0x128/0x180
> [   80.145172]  [<ffffffff8109825c>] ? update_curr+0xec/0x170
> [   80.145172]  [<ffffffff81041d09>] ? kvm_clock_get_cycles+0x9/0x10
> [   80.145172]  [<ffffffff810afd1e>] ? __getnstimeofday+0x3e/0xd0
> [   80.145172]  [<ffffffff8152509e>] SyS_sendto+0xe/0x10
> [   80.145172]  [<ffffffff8164efd9>] system_call_fastpath+0x16/0x1b
> [   80.145172] Code: fe ff ff 41 f6 45 2a 06 0f 85 ca fe ff ff 49 8b 7e 08 4c 89 ee e8 94 ef ff ff e9 b9 fe ff ff 48 8b 82 28 05 00 00 e9 01 ff ff ff <0f> 0b 49 8b 54 24 30 0d 00 00 40 00 89 83 14 01 00 00 48 89 53
> [   80.145172] RIP  [<ffffffff815f3b5d>] fib6_add+0x75d/0x830
> [   80.145172]  RSP <ffff880118771798>
> [   80.387413] ---[ end trace 02f20b7a8b81ed95 ]---
> [   80.390154] Kernel panic - not syncing: Fatal exception in interrupt
>
> The relevant code is:
>
>      725                 /* For each sibling in the list, increment the counter of
>      726                  * siblings. BUG() if counters does not match, list of siblings
>      727                  * is broken!
>      728                  */
>      729                 rt6i_nsiblings = 0;
>      730                 list_for_each_entry_safe(sibling, temp_sibling,
>      731                                          &rt->rt6i_siblings, rt6i_siblings) {
>      732                         sibling->rt6i_nsiblings++;
>      733                         BUG_ON(sibling->rt6i_nsiblings != rt->rt6i_nsiblings);
>      734                         rt6i_nsiblings++;
>      735                 }
>      736                 BUG_ON(rt6i_nsiblings != rt->rt6i_nsiblings);
>      737         }
>
> Currently, I don't know if I my patch is to blame or something in the ecmp logic.
>
> Another thing I just noticed is:
>
>     1240         /* Remove this entry from other siblings */
>     1241         if (rt->rt6i_nsiblings) {
>     1242                 struct rt6_info *sibling, *next_sibling;
>     1243
>     1244                 list_for_each_entry_safe(sibling, next_sibling,
>     1245                                          &rt->rt6i_siblings, rt6i_siblings)
>     1246                         sibling->rt6i_nsiblings--;
>     1247                 rt->rt6i_nsiblings = 0;
>     1248                 list_del_init(&rt->rt6i_siblings);
>     1249         }
>
> Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
> start iterating from fn->leaf? But this does not seem to cause it,
> because my trace does not report any calls to fib6_del_route.
Note sure to follow you, but all siblings are listed in rt6i_siblings, so it 
must be enough.

>
> You could try reproduce it by having an interface autoconfigured with
> a default router with NUD_VALID neighbour. I then added an unused vlan
> interface (vid 100 in my case) and added the following ip addresses:
>
> ip -6 a a 2001:ffff::1/64 dev eth0.100
> ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31 nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
>
> (all nexthops should not be reachable)
>
> After starting a ping6 2000::1 the box should panic soon, after the
> first nexthop entry times out.
>
> Perhaps you could give me a hint?
I will run some tests with your patch. Will see.

I assume you didn't reproduce this without your patch.


Regards,
Nicolas

>
> Thanks a lot,
>
>    Hannes
>
>
> [PATCH RFC] ipv6: fix route selection if kernel not compiled with CONFIG_IPV6_ROUTER_PREF
>
> This is a follow-up patch to 3630d40067a21d4dfbadc6002bb469ce26ac5d52
> ("ipv6: rt6_check_neigh should successfully verify neigh if no NUD
> information are available").
>
> Since the removal of rt->n in rt6_info we can end up with a dst ==
> NULL in rt6_check_neigh. In case the kernel is not compiled with
> CONFIG_IPV6_ROUTER_PREF we should also select a route with unkown
> NUD state but we must not avoid doing round robin selection on routes
> with the same target. So introduce and pass down a boolean ``do_rr'' to
> indicate when we should update rt->rr_ptr. As soon as no route is valid
> we do backtracking and do a lookup on a higher level in the fib trie.
>
> To hold correct state on the NUD selection we need to create a neighbour
> entry as soon as we tried to validate a nexthop.
>
> I changed the return value of rt6_check_neigh to:
>     1 in case of the dst entry validated
>    -1 in case of we had no dst_entry and we need to do rr now
>    -2 in case a we had a dst_entry and it did not validate
>
> In case of CONFIG_IPV6_ROUTER_PREF, rt6_probe does not allocate an
> neighbour entry (!CONFIG_IPV6_ROUTER_PREF: rt6_probe is a nop). Because
> of this, we have to create a neighbour entry on nexthop validation to
> track earlier validation errors. We recheck NUD state here to shortcurcuit
> NUD_NOARP neighbours.
>
> This seems to be the least complex fix for stable and net. I'll introduce
> a new route lookup flag 'idempotent' as soon as next opens to not let
> ip route get trigger active NUD validation if CONFIG_IPV6_ROUTER_PREF
> is enabled. Currently we trigger active NUD validation if compiled with
> CONFIG_IPV6_ROUTER_PREF.
>
> It also seems advantageous to make CONFIG_IPV6_ROUTER_PREF a runtime
> switch and just select the default operation on compile-time.
>
> v2:
> a) improved rt6_check_neigh logic and documented return values
>
> Reported-by: Pierre Emeriaud <petrus.lt@gmail.com>
> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
> ---
>   net/ipv6/route.c | 62 +++++++++++++++++++++++++++++++++++---------------------
>   1 file changed, 39 insertions(+), 23 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index bd5fd70..c5d9e68 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -531,28 +531,34 @@ static inline int rt6_check_dev(struct rt6_info *rt, int oif)
>   	return 0;
>   }
>
> -static inline bool rt6_check_neigh(struct rt6_info *rt)
> +/* This function checks if a neighbour is reachable for routing
> + * purposes. It returns -2 in case the neighbour should not get
> + * selected as a viable router, -1 in case it should get selected with
> + * lowest score and afterwards trying roundrobin. 1 indicates a
> + * successfull verification.
> + */
> +static inline int rt6_check_neigh(struct rt6_info *rt)
>   {
>   	struct neighbour *neigh;
> -	bool ret = false;
> +	int ret = -2;
>
>   	if (rt->rt6i_flags & RTF_NONEXTHOP ||
>   	    !(rt->rt6i_flags & RTF_GATEWAY))
> -		return true;
> +		return 1;
>
>   	rcu_read_lock_bh();
>   	neigh = __ipv6_neigh_lookup_noref(rt->dst.dev, &rt->rt6i_gateway);
>   	if (neigh) {
>   		read_lock(&neigh->lock);
>   		if (neigh->nud_state & NUD_VALID)
> -			ret = true;
> +			ret = 1;
>   #ifdef CONFIG_IPV6_ROUTER_PREF
>   		else if (!(neigh->nud_state & NUD_FAILED))
> -			ret = true;
> +			ret = 1;
>   #endif
>   		read_unlock(&neigh->lock);
> -	} else if (IS_ENABLED(CONFIG_IPV6_ROUTER_PREF)) {
> -		ret = true;
> +	} else {
> +		ret = IS_ENABLED(CONFIG_IPV6_ROUTER_PREF) ? 1 : -1;
>   	}
>   	rcu_read_unlock_bh();
>
> @@ -566,43 +572,52 @@ static int rt6_score_route(struct rt6_info *rt, int oif,
>
>   	m = rt6_check_dev(rt, oif);
>   	if (!m && (strict & RT6_LOOKUP_F_IFACE))
> -		return -1;
> +		return -2;
>   #ifdef CONFIG_IPV6_ROUTER_PREF
>   	m |= IPV6_DECODE_PREF(IPV6_EXTRACT_PREF(rt->rt6i_flags)) << 2;
>   #endif
> -	if (!rt6_check_neigh(rt) && (strict & RT6_LOOKUP_F_REACHABLE))
> -		return -1;
> +	if (strict & RT6_LOOKUP_F_REACHABLE) {
> +		int n = rt6_check_neigh(rt);
> +		if (n < 0)
> +			return n;
> +	}
>   	return m;
>   }
>
>   static struct rt6_info *find_match(struct rt6_info *rt, int oif, int strict,
> -				   int *mpri, struct rt6_info *match)
> +				   int *mpri, struct rt6_info *match,
> +				   bool *do_rr)
>   {
>   	int m;
> +	bool match_do_rr = false;
>
>   	if (rt6_check_expired(rt))
>   		goto out;
>
>   	m = rt6_score_route(rt, oif, strict);
> -	if (m < 0)
> +	if (m == -1 && !IS_ENABLED(CONFIG_IPV6_ROUTER_PREF)) {
> +		match_do_rr = true;
> +		m = 0; /* lowest valid score */
> +	} else if (m < 0) {
>   		goto out;
> +	}
> +
> +	if (strict & RT6_LOOKUP_F_REACHABLE)
> +		rt6_probe(rt);
>
>   	if (m > *mpri) {
> -		if (strict & RT6_LOOKUP_F_REACHABLE)
> -			rt6_probe(match);
> +		*do_rr = match_do_rr;
>   		*mpri = m;
>   		match = rt;
> -	} else if (strict & RT6_LOOKUP_F_REACHABLE) {
> -		rt6_probe(rt);
>   	}
> -
>   out:
>   	return match;
>   }
>
>   static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
>   				     struct rt6_info *rr_head,
> -				     u32 metric, int oif, int strict)
> +				     u32 metric, int oif, int strict,
> +				     bool *do_rr)
>   {
>   	struct rt6_info *rt, *match;
>   	int mpri = -1;
> @@ -610,10 +625,10 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
>   	match = NULL;
>   	for (rt = rr_head; rt && rt->rt6i_metric == metric;
>   	     rt = rt->dst.rt6_next)
> -		match = find_match(rt, oif, strict, &mpri, match);
> +		match = find_match(rt, oif, strict, &mpri, match, do_rr);
>   	for (rt = fn->leaf; rt && rt != rr_head && rt->rt6i_metric == metric;
>   	     rt = rt->dst.rt6_next)
> -		match = find_match(rt, oif, strict, &mpri, match);
> +		match = find_match(rt, oif, strict, &mpri, match, do_rr);
>
>   	return match;
>   }
> @@ -622,15 +637,16 @@ static struct rt6_info *rt6_select(struct fib6_node *fn, int oif, int strict)
>   {
>   	struct rt6_info *match, *rt0;
>   	struct net *net;
> +	bool do_rr = false;
>
>   	rt0 = fn->rr_ptr;
>   	if (!rt0)
>   		fn->rr_ptr = rt0 = fn->leaf;
>
> -	match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict);
> +	match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict,
> +			     &do_rr);
>
> -	if (!match &&
> -	    (strict & RT6_LOOKUP_F_REACHABLE)) {
> +	if (do_rr) {
>   		struct rt6_info *next = rt0->dst.rt6_next;
>
>   		/* no entries matched; do round-robin */
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10  7:54   ` Nicolas Dichtel
@ 2013-07-10  9:28     ` Nicolas Dichtel
  2013-07-10 10:53       ` Hannes Frederic Sowa
  2013-07-10 11:15     ` Hannes Frederic Sowa
  1 sibling, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-10  9:28 UTC (permalink / raw)
  To: hannes; +Cc: netdev, yoshfuji, petrus.lt, davem

Le 10/07/2013 09:54, Nicolas Dichtel a écrit :
> Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
>> Hello Nicolas!
>>
>> I am currently trying to fix the nexthop selection for kernels compiled
>> without support for CONFIG_IPV6_ROUTER_PREF. I attached my current patch
>> below. While testing I got the following kernel panic in ecmp code:
>>
>> [   80.144667] ------------[ cut here ]------------
>> [   80.145172] kernel BUG at net/ipv6/ip6_fib.c:733!
>> [   80.145172] invalid opcode: 0000 [#1] SMP
>> [   80.145172] Modules linked in: 8021q nf_conntrack_netbios_ns
>> nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT
>> nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle
>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter
>> ebtables ip6table_filter ip6_tables snd_hda_intel snd_hda_codec snd_hwdep
>> snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer virtio_balloon snd
>> soundcore i2c_piix4 i2c_core virtio_net virtio_blk
>> [   80.145172] CPU: 1 PID: 786 Comm: ping6 Not tainted 3.10.0+ #118
>> [   80.145172] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> [   80.145172] task: ffff880117fa0000 ti: ffff880118770000 task.ti:
>> ffff880118770000
>> [   80.145172] RIP: 0010:[<ffffffff815f3b5d>]  [<ffffffff815f3b5d>]
>> fib6_add+0x75d/0x830
>> [   80.145172] RSP: 0018:ffff880118771798  EFLAGS: 00010202
>> [   80.145172] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011350e480
>> [   80.145172] RDX: ffff88011350e238 RSI: 0000000000000004 RDI: ffff88011350f738
>> [   80.145172] RBP: ffff880118771848 R08: ffff880117903280 R09: 0000000000000001
>> [   80.145172] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011350f680
>> [   80.145172] R13: ffff880117903280 R14: ffff880118771890 R15: ffff88011350ef90
>> [   80.145172] FS:  00007f02b5127740(0000) GS:ffff88011fd00000(0000)
>> knlGS:0000000000000000
>> [   80.145172] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [   80.145172] CR2: 00007f981322a000 CR3: 00000001181b1000 CR4: 00000000000006e0
>> [   80.145172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [   80.145172] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [   80.145172] Stack:
>> [   80.145172]  0000000000000001 ffff880100000000 ffff880100000000
>> ffff880117903280
>> [   80.145172]  0000000000000000 ffff880119a4cf00 0000000000000400
>> 00000000000007fa
>> [   80.145172]  0000000000000000 0000000000000000 0000000000000000
>> ffff88011350f680
>> [   80.145172] Call Trace:
>> [   80.145172]  [<ffffffff815eeceb>] ? rt6_bind_peer+0x4b/0x90
>> [   80.145172]  [<ffffffff815ed985>] __ip6_ins_rt+0x45/0x70
>> [   80.145172]  [<ffffffff815eee35>] ip6_ins_rt+0x35/0x40
>> [   80.145172]  [<ffffffff815ef1e4>] ip6_pol_route.isra.44+0x3a4/0x4b0
>> [   80.145172]  [<ffffffff815ef34a>] ip6_pol_route_output+0x2a/0x30
>> [   80.145172]  [<ffffffff81616077>] fib6_rule_action+0xd7/0x210
>> [   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
>> [   80.145172]  [<ffffffff81553026>] fib_rules_lookup+0xc6/0x140
>> [   80.145172]  [<ffffffff81616374>] fib6_rule_lookup+0x44/0x80
>> [   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
>> [   80.145172]  [<ffffffff815edea3>] ip6_route_output+0x73/0xb0
>> [   80.145172]  [<ffffffff815dfdf3>] ip6_dst_lookup_tail+0x2c3/0x2e0
>> [   80.145172]  [<ffffffff813007b1>] ? list_del+0x11/0x40
>> [   80.145172]  [<ffffffff81082a4c>] ? remove_wait_queue+0x3c/0x50
>> [   80.145172]  [<ffffffff815dfe4d>] ip6_dst_lookup_flow+0x3d/0xa0
>> [   80.145172]  [<ffffffff815fda77>] rawv6_sendmsg+0x267/0xc20
>> [   80.145172]  [<ffffffff815a8a83>] inet_sendmsg+0x63/0xb0
>> [   80.145172]  [<ffffffff8128eb93>] ? selinux_socket_sendmsg+0x23/0x30
>> [   80.145172]  [<ffffffff815218d6>] sock_sendmsg+0xa6/0xd0
>> [   80.145172]  [<ffffffff81524a68>] SYSC_sendto+0x128/0x180
>> [   80.145172]  [<ffffffff8109825c>] ? update_curr+0xec/0x170
>> [   80.145172]  [<ffffffff81041d09>] ? kvm_clock_get_cycles+0x9/0x10
>> [   80.145172]  [<ffffffff810afd1e>] ? __getnstimeofday+0x3e/0xd0
>> [   80.145172]  [<ffffffff8152509e>] SyS_sendto+0xe/0x10
>> [   80.145172]  [<ffffffff8164efd9>] system_call_fastpath+0x16/0x1b
>> [   80.145172] Code: fe ff ff 41 f6 45 2a 06 0f 85 ca fe ff ff 49 8b 7e 08 4c
>> 89 ee e8 94 ef ff ff e9 b9 fe ff ff 48 8b 82 28 05 00 00 e9 01 ff ff ff <0f>
>> 0b 49 8b 54 24 30 0d 00 00 40 00 89 83 14 01 00 00 48 89 53
>> [   80.145172] RIP  [<ffffffff815f3b5d>] fib6_add+0x75d/0x830
>> [   80.145172]  RSP <ffff880118771798>
>> [   80.387413] ---[ end trace 02f20b7a8b81ed95 ]---
>> [   80.390154] Kernel panic - not syncing: Fatal exception in interrupt
>>
>> The relevant code is:
>>
>>      725                 /* For each sibling in the list, increment the
>> counter of
>>      726                  * siblings. BUG() if counters does not match, list
>> of siblings
>>      727                  * is broken!
>>      728                  */
>>      729                 rt6i_nsiblings = 0;
>>      730                 list_for_each_entry_safe(sibling, temp_sibling,
>>      731                                          &rt->rt6i_siblings,
>> rt6i_siblings) {
>>      732                         sibling->rt6i_nsiblings++;
>>      733                         BUG_ON(sibling->rt6i_nsiblings !=
>> rt->rt6i_nsiblings);
>>      734                         rt6i_nsiblings++;
>>      735                 }
>>      736                 BUG_ON(rt6i_nsiblings != rt->rt6i_nsiblings);
>>      737         }
>>
>> Currently, I don't know if I my patch is to blame or something in the ecmp logic.
>>
>> Another thing I just noticed is:
>>
>>     1240         /* Remove this entry from other siblings */
>>     1241         if (rt->rt6i_nsiblings) {
>>     1242                 struct rt6_info *sibling, *next_sibling;
>>     1243
>>     1244                 list_for_each_entry_safe(sibling, next_sibling,
>>     1245                                          &rt->rt6i_siblings,
>> rt6i_siblings)
>>     1246                         sibling->rt6i_nsiblings--;
>>     1247                 rt->rt6i_nsiblings = 0;
>>     1248                 list_del_init(&rt->rt6i_siblings);
>>     1249         }
>>
>> Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
>> start iterating from fn->leaf? But this does not seem to cause it,
>> because my trace does not report any calls to fib6_del_route.
> Note sure to follow you, but all siblings are listed in rt6i_siblings, so it
> must be enough.
>
>>
>> You could try reproduce it by having an interface autoconfigured with
>> a default router with NUD_VALID neighbour. I then added an unused vlan
>> interface (vid 100 in my case) and added the following ip addresses:
>>
>> ip -6 a a 2001:ffff::1/64 dev eth0.100
>> ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31 nexthop
>> via 2001:ffff::32 nexthop via 2001:ffff::33
>>
>> (all nexthops should not be reachable)
>>
>> After starting a ping6 2000::1 the box should panic soon, after the
>> first nexthop entry times out.
>>
>> Perhaps you could give me a hint?
> I will run some tests with your patch. Will see.
I don't reproduce this panic.

Maybe you can try to revert this patch:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/ipv6/route.c?id=52bd4c0c1551daa2efa7bb9e01a2f4ea6d1311bb


Regards,
Nicolas

>
> I assume you didn't reproduce this without your patch.
>
>
> Regards,
> Nicolas
>
>>
>> Thanks a lot,
>>
>>    Hannes
>>
>>
>> [PATCH RFC] ipv6: fix route selection if kernel not compiled with
>> CONFIG_IPV6_ROUTER_PREF
>>
>> This is a follow-up patch to 3630d40067a21d4dfbadc6002bb469ce26ac5d52
>> ("ipv6: rt6_check_neigh should successfully verify neigh if no NUD
>> information are available").
>>
>> Since the removal of rt->n in rt6_info we can end up with a dst ==
>> NULL in rt6_check_neigh. In case the kernel is not compiled with
>> CONFIG_IPV6_ROUTER_PREF we should also select a route with unkown
>> NUD state but we must not avoid doing round robin selection on routes
>> with the same target. So introduce and pass down a boolean ``do_rr'' to
>> indicate when we should update rt->rr_ptr. As soon as no route is valid
>> we do backtracking and do a lookup on a higher level in the fib trie.
>>
>> To hold correct state on the NUD selection we need to create a neighbour
>> entry as soon as we tried to validate a nexthop.
>>
>> I changed the return value of rt6_check_neigh to:
>>     1 in case of the dst entry validated
>>    -1 in case of we had no dst_entry and we need to do rr now
>>    -2 in case a we had a dst_entry and it did not validate
>>
>> In case of CONFIG_IPV6_ROUTER_PREF, rt6_probe does not allocate an
>> neighbour entry (!CONFIG_IPV6_ROUTER_PREF: rt6_probe is a nop). Because
>> of this, we have to create a neighbour entry on nexthop validation to
>> track earlier validation errors. We recheck NUD state here to shortcurcuit
>> NUD_NOARP neighbours.
>>
>> This seems to be the least complex fix for stable and net. I'll introduce
>> a new route lookup flag 'idempotent' as soon as next opens to not let
>> ip route get trigger active NUD validation if CONFIG_IPV6_ROUTER_PREF
>> is enabled. Currently we trigger active NUD validation if compiled with
>> CONFIG_IPV6_ROUTER_PREF.
>>
>> It also seems advantageous to make CONFIG_IPV6_ROUTER_PREF a runtime
>> switch and just select the default operation on compile-time.
>>
>> v2:
>> a) improved rt6_check_neigh logic and documented return values
>>
>> Reported-by: Pierre Emeriaud <petrus.lt@gmail.com>
>> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
>> ---
>>   net/ipv6/route.c | 62 +++++++++++++++++++++++++++++++++++---------------------
>>   1 file changed, 39 insertions(+), 23 deletions(-)
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index bd5fd70..c5d9e68 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -531,28 +531,34 @@ static inline int rt6_check_dev(struct rt6_info *rt, int
>> oif)
>>       return 0;
>>   }
>>
>> -static inline bool rt6_check_neigh(struct rt6_info *rt)
>> +/* This function checks if a neighbour is reachable for routing
>> + * purposes. It returns -2 in case the neighbour should not get
>> + * selected as a viable router, -1 in case it should get selected with
>> + * lowest score and afterwards trying roundrobin. 1 indicates a
>> + * successfull verification.
>> + */
>> +static inline int rt6_check_neigh(struct rt6_info *rt)
>>   {
>>       struct neighbour *neigh;
>> -    bool ret = false;
>> +    int ret = -2;
>>
>>       if (rt->rt6i_flags & RTF_NONEXTHOP ||
>>           !(rt->rt6i_flags & RTF_GATEWAY))
>> -        return true;
>> +        return 1;
>>
>>       rcu_read_lock_bh();
>>       neigh = __ipv6_neigh_lookup_noref(rt->dst.dev, &rt->rt6i_gateway);
>>       if (neigh) {
>>           read_lock(&neigh->lock);
>>           if (neigh->nud_state & NUD_VALID)
>> -            ret = true;
>> +            ret = 1;
>>   #ifdef CONFIG_IPV6_ROUTER_PREF
>>           else if (!(neigh->nud_state & NUD_FAILED))
>> -            ret = true;
>> +            ret = 1;
>>   #endif
>>           read_unlock(&neigh->lock);
>> -    } else if (IS_ENABLED(CONFIG_IPV6_ROUTER_PREF)) {
>> -        ret = true;
>> +    } else {
>> +        ret = IS_ENABLED(CONFIG_IPV6_ROUTER_PREF) ? 1 : -1;
>>       }
>>       rcu_read_unlock_bh();
>>
>> @@ -566,43 +572,52 @@ static int rt6_score_route(struct rt6_info *rt, int oif,
>>
>>       m = rt6_check_dev(rt, oif);
>>       if (!m && (strict & RT6_LOOKUP_F_IFACE))
>> -        return -1;
>> +        return -2;
>>   #ifdef CONFIG_IPV6_ROUTER_PREF
>>       m |= IPV6_DECODE_PREF(IPV6_EXTRACT_PREF(rt->rt6i_flags)) << 2;
>>   #endif
>> -    if (!rt6_check_neigh(rt) && (strict & RT6_LOOKUP_F_REACHABLE))
>> -        return -1;
>> +    if (strict & RT6_LOOKUP_F_REACHABLE) {
>> +        int n = rt6_check_neigh(rt);
>> +        if (n < 0)
>> +            return n;
>> +    }
>>       return m;
>>   }
>>
>>   static struct rt6_info *find_match(struct rt6_info *rt, int oif, int strict,
>> -                   int *mpri, struct rt6_info *match)
>> +                   int *mpri, struct rt6_info *match,
>> +                   bool *do_rr)
>>   {
>>       int m;
>> +    bool match_do_rr = false;
>>
>>       if (rt6_check_expired(rt))
>>           goto out;
>>
>>       m = rt6_score_route(rt, oif, strict);
>> -    if (m < 0)
>> +    if (m == -1 && !IS_ENABLED(CONFIG_IPV6_ROUTER_PREF)) {
>> +        match_do_rr = true;
>> +        m = 0; /* lowest valid score */
>> +    } else if (m < 0) {
>>           goto out;
>> +    }
>> +
>> +    if (strict & RT6_LOOKUP_F_REACHABLE)
>> +        rt6_probe(rt);
>>
>>       if (m > *mpri) {
>> -        if (strict & RT6_LOOKUP_F_REACHABLE)
>> -            rt6_probe(match);
>> +        *do_rr = match_do_rr;
>>           *mpri = m;
>>           match = rt;
>> -    } else if (strict & RT6_LOOKUP_F_REACHABLE) {
>> -        rt6_probe(rt);
>>       }
>> -
>>   out:
>>       return match;
>>   }
>>
>>   static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
>>                        struct rt6_info *rr_head,
>> -                     u32 metric, int oif, int strict)
>> +                     u32 metric, int oif, int strict,
>> +                     bool *do_rr)
>>   {
>>       struct rt6_info *rt, *match;
>>       int mpri = -1;
>> @@ -610,10 +625,10 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
>>       match = NULL;
>>       for (rt = rr_head; rt && rt->rt6i_metric == metric;
>>            rt = rt->dst.rt6_next)
>> -        match = find_match(rt, oif, strict, &mpri, match);
>> +        match = find_match(rt, oif, strict, &mpri, match, do_rr);
>>       for (rt = fn->leaf; rt && rt != rr_head && rt->rt6i_metric == metric;
>>            rt = rt->dst.rt6_next)
>> -        match = find_match(rt, oif, strict, &mpri, match);
>> +        match = find_match(rt, oif, strict, &mpri, match, do_rr);
>>
>>       return match;
>>   }
>> @@ -622,15 +637,16 @@ static struct rt6_info *rt6_select(struct fib6_node *fn,
>> int oif, int strict)
>>   {
>>       struct rt6_info *match, *rt0;
>>       struct net *net;
>> +    bool do_rr = false;
>>
>>       rt0 = fn->rr_ptr;
>>       if (!rt0)
>>           fn->rr_ptr = rt0 = fn->leaf;
>>
>> -    match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict);
>> +    match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict,
>> +                 &do_rr);
>>
>> -    if (!match &&
>> -        (strict & RT6_LOOKUP_F_REACHABLE)) {
>> +    if (do_rr) {
>>           struct rt6_info *next = rt0->dst.rt6_next;
>>
>>           /* no entries matched; do round-robin */
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10  9:28     ` Nicolas Dichtel
@ 2013-07-10 10:53       ` Hannes Frederic Sowa
  2013-07-10 12:22         ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 10:53 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 11:28:57AM +0200, Nicolas Dichtel wrote:
> Le 10/07/2013 09:54, Nicolas Dichtel a écrit :
> >Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
> >>After starting a ping6 2000::1 the box should panic soon, after the
> >>first nexthop entry times out.
> >>
> >>Perhaps you could give me a hint?
> >I will run some tests with your patch. Will see.
> I don't reproduce this panic.

I just dumped the routes for which it does increase the rt6i_nsiblings
counter in this condition:

                        /* If we have the same destination and the same metric,                                                                                                                                                                                                                                                
                         * but not the same gateway, then the route we try to                                                                                                                                                                                                                                                  
                         * add is sibling to this route, increment our counter                                                                                                                                                                                                                                                 
                         * of siblings, and later we will add our route to the                                                                                                                                                                                                                                                 
                         * list.                                                                                                                                                                                                                                                                                               
                         * Only static routes (which don't have flag                                                                                                                                                                                                                                                           
                         * RTF_EXPIRES) are used for ECMPv6.                                                                                                                                                                                                                                                                   
                         *                                                                                                                                                                                                                                                                                                     
                         * To avoid long list, we only had siblings if the                                                                                                                                                                                                                                                     
                         * route have a gateway.                                                                                                                                                                                                                                                                               
                         */
                        if (rt->rt6i_flags & RTF_GATEWAY &&
                            !(rt->rt6i_flags & RTF_EXPIRES) &&
                            !(iter->rt6i_flags & RTF_EXPIRES))
                                rt->rt6i_nsiblings++;
                                dump_route(iter, "(iter)");
                                dump_route(rt, "(rt)");
			}



Here:

[   42.497470] (iter): ffff88011796cc00 dst 2000::1 plen 128 gateway 2001:db8::32, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801139ddc00 dev ffff880117e83000
[   42.505912] (rt): ffff88011796d800 dst 2000::1 plen 128 gateway fe80::5054:ff:fe82:e153, siblings 1, metric 0, expires 0 gateway 2 idev6 ffff880117edc400 dev ffff8801185cb000
[   42.527241] (iter): ffff88011796d380 dst 2000::1 plen 128 gateway 2001:db8::33, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801139ddc00 dev ffff880117e83000
[   42.536440] (rt): ffff88011796d800 dst 2000::1 plen 128 gateway fe80::5054:ff:fe82:e153, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff880117edc400 dev ffff8801185cb000

From my understanding these two routes should not be aggregated in one ecmp
route set. Am I seeing this correct? (My configuration is like in the mail
before.)

I wonder why the '(rt)' route does not have the expires flag, but it seems we
have to special-case RTF_CACHE routes here which derive from different
levels of the fib6_tree. Does that make sense?

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10  7:54   ` Nicolas Dichtel
  2013-07-10  9:28     ` Nicolas Dichtel
@ 2013-07-10 11:15     ` Hannes Frederic Sowa
  2013-07-10 11:40       ` Hannes Frederic Sowa
  2013-07-10 12:08       ` Nicolas Dichtel
  1 sibling, 2 replies; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 11:15 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 09:54:58AM +0200, Nicolas Dichtel wrote:
> Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
> >Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
> >start iterating from fn->leaf? But this does not seem to cause it,
> >because my trace does not report any calls to fib6_del_route.
> Note sure to follow you, but all siblings are listed in rt6i_siblings, so 
> it must be enough.

My hunch was to iterate over fn->leaf->rt_next and compare the metrics like we
do when adding a new route. Then take that rt6_info->rt6i_siblings list_head
to iterate over the remaining siblings. But I did not review that part
carefully, need to check later.

> >You could try reproduce it by having an interface autoconfigured with
> >a default router with NUD_VALID neighbour. I then added an unused vlan
> >interface (vid 100 in my case) and added the following ip addresses:
> >
> >ip -6 a a 2001:ffff::1/64 dev eth0.100
> >ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31 
> >nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
> >
> >(all nexthops should not be reachable)
> >
> >After starting a ping6 2000::1 the box should panic soon, after the
> >first nexthop entry times out.
> >
> >Perhaps you could give me a hint?
> I will run some tests with your patch. Will see.
> 
> I assume you didn't reproduce this without your patch.

Current kernel does not correctly select more specific routes, so these routes
are not even tried and the logic should not be excercised.

Ah, sorry, you should also compile your kernel without
CONFIG_IPV6_ROUTER_PREF, too, if you try to reproduce it.

Thanks for looking into this,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 11:15     ` Hannes Frederic Sowa
@ 2013-07-10 11:40       ` Hannes Frederic Sowa
  2013-07-10 12:08       ` Nicolas Dichtel
  1 sibling, 0 replies; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 11:40 UTC (permalink / raw)
  To: Nicolas Dichtel, netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 01:15:04PM +0200, Hannes Frederic Sowa wrote:
> On Wed, Jul 10, 2013 at 09:54:58AM +0200, Nicolas Dichtel wrote:
> > Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
> > >Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
> > >start iterating from fn->leaf? But this does not seem to cause it,
> > >because my trace does not report any calls to fib6_del_route.
> > Note sure to follow you, but all siblings are listed in rt6i_siblings, so 
> > it must be enough.
> 
> My hunch was to iterate over fn->leaf->rt_next and compare the metrics like we
> do when adding a new route. Then take that rt6_info->rt6i_siblings list_head
> to iterate over the remaining siblings. But I did not review that part
> carefully, need to check later.

Just checked, it is fine.

I was distracted by the way the initial rt6_sibling list was constructed.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 11:15     ` Hannes Frederic Sowa
  2013-07-10 11:40       ` Hannes Frederic Sowa
@ 2013-07-10 12:08       ` Nicolas Dichtel
  2013-07-10 13:17         ` Hannes Frederic Sowa
  1 sibling, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-10 12:08 UTC (permalink / raw)
  To: hannes; +Cc: netdev, yoshfuji, petrus.lt, davem

Le 10/07/2013 13:15, Hannes Frederic Sowa a écrit :
> On Wed, Jul 10, 2013 at 09:54:58AM +0200, Nicolas Dichtel wrote:
>> Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
>>> Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
>>> start iterating from fn->leaf? But this does not seem to cause it,
>>> because my trace does not report any calls to fib6_del_route.
>> Note sure to follow you, but all siblings are listed in rt6i_siblings, so
>> it must be enough.
>
> My hunch was to iterate over fn->leaf->rt_next and compare the metrics like we
> do when adding a new route. Then take that rt6_info->rt6i_siblings list_head
> to iterate over the remaining siblings. But I did not review that part
> carefully, need to check later.
>
>>> You could try reproduce it by having an interface autoconfigured with
>>> a default router with NUD_VALID neighbour. I then added an unused vlan
>>> interface (vid 100 in my case) and added the following ip addresses:
>>>
>>> ip -6 a a 2001:ffff::1/64 dev eth0.100
>>> ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31
>>> nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
>>>
>>> (all nexthops should not be reachable)
>>>
>>> After starting a ping6 2000::1 the box should panic soon, after the
>>> first nexthop entry times out.
>>>
>>> Perhaps you could give me a hint?
>> I will run some tests with your patch. Will see.
>>
>> I assume you didn't reproduce this without your patch.
>
> Current kernel does not correctly select more specific routes, so these routes
> are not even tried and the logic should not be excercised.
>
> Ah, sorry, you should also compile your kernel without
> CONFIG_IPV6_ROUTER_PREF, too, if you try to reproduce it.
I've done this.

My conf (eth1 autoconfigured, I use net-next + your patch):
vconfig add eth1 100
ifconfig eth1.100 up
ip -6 a a 2001:ffff::1/64 dev eth1.100
ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31 nexthop 
via 2001:ffff::32 nexthop via 2001:ffff::33
ping6 2000::1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 10:53       ` Hannes Frederic Sowa
@ 2013-07-10 12:22         ` Nicolas Dichtel
  2013-07-10 13:21           ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-10 12:22 UTC (permalink / raw)
  To: hannes; +Cc: netdev, yoshfuji, petrus.lt, davem

Le 10/07/2013 12:53, Hannes Frederic Sowa a écrit :
> On Wed, Jul 10, 2013 at 11:28:57AM +0200, Nicolas Dichtel wrote:
>> Le 10/07/2013 09:54, Nicolas Dichtel a écrit :
>>> Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
>>>> After starting a ping6 2000::1 the box should panic soon, after the
>>>> first nexthop entry times out.
>>>>
>>>> Perhaps you could give me a hint?
>>> I will run some tests with your patch. Will see.
>> I don't reproduce this panic.
>
> I just dumped the routes for which it does increase the rt6i_nsiblings
> counter in this condition:
>
>                          /* If we have the same destination and the same metric,
>                           * but not the same gateway, then the route we try to
>                           * add is sibling to this route, increment our counter
>                           * of siblings, and later we will add our route to the
>                           * list.
>                           * Only static routes (which don't have flag
>                           * RTF_EXPIRES) are used for ECMPv6.
>                           *
>                           * To avoid long list, we only had siblings if the
>                           * route have a gateway.
>                           */
>                          if (rt->rt6i_flags & RTF_GATEWAY &&
>                              !(rt->rt6i_flags & RTF_EXPIRES) &&
>                              !(iter->rt6i_flags & RTF_EXPIRES))
>                                  rt->rt6i_nsiblings++;
>                                  dump_route(iter, "(iter)");
>                                  dump_route(rt, "(rt)");
> 			}
>
>
>
> Here:
>
> [   42.497470] (iter): ffff88011796cc00 dst 2000::1 plen 128 gateway 2001:db8::32, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801139ddc00 dev ffff880117e83000
> [   42.505912] (rt): ffff88011796d800 dst 2000::1 plen 128 gateway fe80::5054:ff:fe82:e153, siblings 1, metric 0, expires 0 gateway 2 idev6 ffff880117edc400 dev ffff8801185cb000
> [   42.527241] (iter): ffff88011796d380 dst 2000::1 plen 128 gateway 2001:db8::33, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801139ddc00 dev ffff880117e83000
> [   42.536440] (rt): ffff88011796d800 dst 2000::1 plen 128 gateway fe80::5054:ff:fe82:e153, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff880117edc400 dev ffff8801185cb000
>
>  From my understanding these two routes should not be aggregated in one ecmp
> route set. Am I seeing this correct? (My configuration is like in the mail
> before.)
Hmm, why?
Routes have the same destination, same metric, are static (expires == 0) and 
have a gateway.

nsiblings counts the number of siblings and does not contains ourself, hence 
both iter should be 1, not 2.

Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 12:08       ` Nicolas Dichtel
@ 2013-07-10 13:17         ` Hannes Frederic Sowa
  2013-07-10 13:49           ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 13:17 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 02:08:42PM +0200, Nicolas Dichtel wrote:
> Le 10/07/2013 13:15, Hannes Frederic Sowa a écrit :
> >On Wed, Jul 10, 2013 at 09:54:58AM +0200, Nicolas Dichtel wrote:
> >>Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
> >>>Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
> >>>start iterating from fn->leaf? But this does not seem to cause it,
> >>>because my trace does not report any calls to fib6_del_route.
> >>Note sure to follow you, but all siblings are listed in rt6i_siblings, so
> >>it must be enough.
> >
> >My hunch was to iterate over fn->leaf->rt_next and compare the metrics 
> >like we
> >do when adding a new route. Then take that rt6_info->rt6i_siblings 
> >list_head
> >to iterate over the remaining siblings. But I did not review that part
> >carefully, need to check later.
> >
> >>>You could try reproduce it by having an interface autoconfigured with
> >>>a default router with NUD_VALID neighbour. I then added an unused vlan
> >>>interface (vid 100 in my case) and added the following ip addresses:
> >>>
> >>>ip -6 a a 2001:ffff::1/64 dev eth0.100
> >>>ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31
> >>>nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
> >>>
> >>>(all nexthops should not be reachable)
> >>>
> >>>After starting a ping6 2000::1 the box should panic soon, after the
> >>>first nexthop entry times out.
> >>>
> >>>Perhaps you could give me a hint?
> >>I will run some tests with your patch. Will see.
> >>
> >>I assume you didn't reproduce this without your patch.
> >
> >Current kernel does not correctly select more specific routes, so these 
> >routes
> >are not even tried and the logic should not be excercised.
> >
> >Ah, sorry, you should also compile your kernel without
> >CONFIG_IPV6_ROUTER_PREF, too, if you try to reproduce it.
> I've done this.
> 
> My conf (eth1 autoconfigured, I use net-next + your patch):
> vconfig add eth1 100
> ifconfig eth1.100 up
> ip -6 a a 2001:ffff::1/64 dev eth1.100
> ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31 
> nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
> ping6 2000::1

Hm, I see. I suspect something with timing. I, too, use a net-next and have
one function dump_route added and sprinkeld it at some points.

When I copy&pasted your calls I could not reproduce it. After a reboot when
just applying the commands from my history (which I did a lot faster), I got
the panic again.

I'll remove the dump_routes and recheck later.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 12:22         ` Nicolas Dichtel
@ 2013-07-10 13:21           ` Hannes Frederic Sowa
  2013-07-10 14:10             ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 13:21 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 02:22:55PM +0200, Nicolas Dichtel wrote:
> Le 10/07/2013 12:53, Hannes Frederic Sowa a écrit :
> >On Wed, Jul 10, 2013 at 11:28:57AM +0200, Nicolas Dichtel wrote:
> >>Le 10/07/2013 09:54, Nicolas Dichtel a écrit :
> >>>Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
> >>>>After starting a ping6 2000::1 the box should panic soon, after the
> >>>>first nexthop entry times out.
> >>>>
> >>>>Perhaps you could give me a hint?
> >>>I will run some tests with your patch. Will see.
> >>I don't reproduce this panic.
> >
> >I just dumped the routes for which it does increase the rt6i_nsiblings
> >counter in this condition:
> >
> >                         /* If we have the same destination and the same 
> >                         metric,
> >                          * but not the same gateway, then the route we 
> >                          try to
> >                          * add is sibling to this route, increment our 
> >                          counter
> >                          * of siblings, and later we will add our route 
> >                          to the
> >                          * list.
> >                          * Only static routes (which don't have flag
> >                          * RTF_EXPIRES) are used for ECMPv6.
> >                          *
> >                          * To avoid long list, we only had siblings if the
> >                          * route have a gateway.
> >                          */
> >                         if (rt->rt6i_flags & RTF_GATEWAY &&
> >                             !(rt->rt6i_flags & RTF_EXPIRES) &&
> >                             !(iter->rt6i_flags & RTF_EXPIRES))
> >                                 rt->rt6i_nsiblings++;
> >                                 dump_route(iter, "(iter)");
> >                                 dump_route(rt, "(rt)");
> >			}
> >
> >
> >
> >Here:
> >
> >[   42.497470] (iter): ffff88011796cc00 dst 2000::1 plen 128 gateway 
> >2001:db8::32, siblings 2, metric 0, expires 0 gateway 2 idev6 
> >ffff8801139ddc00 dev ffff880117e83000
> >[   42.505912] (rt): ffff88011796d800 dst 2000::1 plen 128 gateway 
> >fe80::5054:ff:fe82:e153, siblings 1, metric 0, expires 0 gateway 2 idev6 
> >ffff880117edc400 dev ffff8801185cb000
> >[   42.527241] (iter): ffff88011796d380 dst 2000::1 plen 128 gateway 
> >2001:db8::33, siblings 2, metric 0, expires 0 gateway 2 idev6 
> >ffff8801139ddc00 dev ffff880117e83000
> >[   42.536440] (rt): ffff88011796d800 dst 2000::1 plen 128 gateway 
> >fe80::5054:ff:fe82:e153, siblings 2, metric 0, expires 0 gateway 2 idev6 
> >ffff880117edc400 dev ffff8801185cb000
> >
> > From my understanding these two routes should not be aggregated in one 
> > ecmp
> >route set. Am I seeing this correct? (My configuration is like in the mail
> >before.)
> Hmm, why?
> Routes have the same destination, same metric, are static (expires == 0) 
> and have a gateway.

The route with rt6i_gateway does actually expire because I got it from
autoconf and ip -6 r l confirms this, too. It seems this is only the cached
route (I will confirm shortly). Is this still ok?

> nsiblings counts the number of siblings and does not contains ourself, 
> hence both iter should be 1, not 2.

Ok.

Thanks for helping,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 13:17         ` Hannes Frederic Sowa
@ 2013-07-10 13:49           ` Hannes Frederic Sowa
  2013-07-10 14:30             ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 13:49 UTC (permalink / raw)
  To: Nicolas Dichtel, netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 03:17:41PM +0200, Hannes Frederic Sowa wrote:
> On Wed, Jul 10, 2013 at 02:08:42PM +0200, Nicolas Dichtel wrote:
> > Le 10/07/2013 13:15, Hannes Frederic Sowa a écrit :
> > >On Wed, Jul 10, 2013 at 09:54:58AM +0200, Nicolas Dichtel wrote:
> > >>Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
> > >>>Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
> > >>>start iterating from fn->leaf? But this does not seem to cause it,
> > >>>because my trace does not report any calls to fib6_del_route.
> > >>Note sure to follow you, but all siblings are listed in rt6i_siblings, so
> > >>it must be enough.
> > >
> > >My hunch was to iterate over fn->leaf->rt_next and compare the metrics 
> > >like we
> > >do when adding a new route. Then take that rt6_info->rt6i_siblings 
> > >list_head
> > >to iterate over the remaining siblings. But I did not review that part
> > >carefully, need to check later.
> > >
> > >>>You could try reproduce it by having an interface autoconfigured with
> > >>>a default router with NUD_VALID neighbour. I then added an unused vlan
> > >>>interface (vid 100 in my case) and added the following ip addresses:
> > >>>
> > >>>ip -6 a a 2001:ffff::1/64 dev eth0.100
> > >>>ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31
> > >>>nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
> > >>>
> > >>>(all nexthops should not be reachable)
> > >>>
> > >>>After starting a ping6 2000::1 the box should panic soon, after the
> > >>>first nexthop entry times out.
> > >>>
> > >>>Perhaps you could give me a hint?
> > >>I will run some tests with your patch. Will see.
> > >>
> > >>I assume you didn't reproduce this without your patch.
> > >
> > >Current kernel does not correctly select more specific routes, so these 
> > >routes
> > >are not even tried and the logic should not be excercised.
> > >
> > >Ah, sorry, you should also compile your kernel without
> > >CONFIG_IPV6_ROUTER_PREF, too, if you try to reproduce it.
> > I've done this.
> > 
> > My conf (eth1 autoconfigured, I use net-next + your patch):
> > vconfig add eth1 100
> > ifconfig eth1.100 up
> > ip -6 a a 2001:ffff::1/64 dev eth1.100
> > ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31 
> > nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
> > ping6 2000::1
> 
> Hm, I see. I suspect something with timing. I, too, use a net-next and have
> one function dump_route added and sprinkeld it at some points.
> 
> When I copy&pasted your calls I could not reproduce it. After a reboot when
> just applying the commands from my history (which I did a lot faster), I got
> the panic again.
> 
> I'll remove the dump_routes and recheck later.

This patch ontop

--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -46,6 +46,16 @@
 #define RT6_TRACE(x...) do { ; } while (0)
 #endif
 
+static void dump_route(struct rt6_info *rt, const char *prefix)
+{
+       u32 f = rt->rt6i_flags;
+       struct rt6key *k = &rt->rt6i_dst;
+       printk(KERN_INFO "%s: %p dst %pI6c plen %d gateway %pI6c, siblings %d, metric %d, expires %d gateway %d idev6 %p dev %p\n", prefix,
+              rt, &k->addr, k->plen, &rt->rt6i_gateway, rt->rt6i_nsiblings, rt->rt6i_metric, f&RTF_EXPIRES, f&RTF_GATEWAY, rt->rt6i_idev, rt->dst.dev);
+}
+
+
+
 static struct kmem_cache * fib6_node_kmem __read_mostly;
 
 enum fib_walk_state_t
@@ -693,8 +703,11 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
                         */
                        if (rt->rt6i_flags & RTF_GATEWAY &&
                            !(rt->rt6i_flags & RTF_EXPIRES) &&
-                           !(iter->rt6i_flags & RTF_EXPIRES))
+                           !(iter->rt6i_flags & RTF_EXPIRES)) {
                                rt->rt6i_nsiblings++;
+                               dump_route(rt, "(rt)");
+                               dump_route(iter, "(iter)");
+                       }
                }
 
                if (iter->rt6i_metric > rt->rt6i_metric)
@@ -718,6 +731,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
                        if (sibling->rt6i_metric == rt->rt6i_metric) {
                                list_add_tail(&rt->rt6i_siblings,
                                              &sibling->rt6i_siblings);
+                               dump_route(sibling, "(sibling)");
                                break;
                        }
                        sibling = sibling->dst.rt6_next;
@@ -730,6 +744,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
                list_for_each_entry_safe(sibling, temp_sibling,
                                         &rt->rt6i_siblings, rt6i_siblings) {
                        sibling->rt6i_nsiblings++;
+                       dump_route(sibling, "(sibling increment)");
                        BUG_ON(sibling->rt6i_nsiblings != rt->rt6i_nsiblings);
                        rt6i_nsiblings++;
                }

produces this panic:

[   59.234779] (rt): ffff880113242000 dst 2000::1 plen 128 gateway 2001:ffff::33, siblings 1, metric 0, expires 0 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
[   59.243794] (iter): ffff880117e7b680 dst 2000::1 plen 128 gateway 2001:ffff::31, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
[   59.261383] (rt): ffff880113242000 dst 2000::1 plen 128 gateway 2001:ffff::33, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
[   59.270030] (iter): ffff880117e7bb00 dst 2000::1 plen 128 gateway 2001:ffff::32, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
[   59.291933] (sibling): ffff880117e62480 dst 2000::1 plen 128 gateway 2001:ffff::30, siblings 2, metric 0, expires 4194304 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
[   59.306893] (sibling increment): ffff880117e62480 dst 2000::1 plen 128 gateway 2001:ffff::30, siblings 3, metric 0, expires 4194304 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
[   59.318840] ------------[ cut here ]------------
[   59.319780] kernel BUG at net/ipv6/ip6_fib.c:748!
[   59.319780] invalid opcode: 0000 [#1] SMP 
[   59.319780] Modules linked in: 8021q nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc virtio_balloon snd_timer virtio_net snd i2c_piix4 soundcore i2c_core virtio_blk
[   59.319780] CPU: 0 PID: 784 Comm: ping6 Not tainted 3.10.0+ #154
[   59.319780] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   59.319780] task: ffff880117e98000 ti: ffff8801131e4000 task.ti: ffff8801131e4000
[   59.319780] RIP: 0010:[<ffffffff815f3ca6>]  [<ffffffff815f3ca6>] fib6_add+0x826/0x900
[   59.319780] RSP: 0018:ffff8801131e5788  EFLAGS: 00010202
[   59.319780] RAX: 00000000000000ad RBX: ffff880117e7b680 RCX: ffff88011fc0faa8
[   59.319780] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000246
[   59.319780] RBP: ffff8801131e5848 R08: ffffffff81ce7f00 R09: 0000000000000244
[   59.319780] R10: 0000000000000000 R11: 0000000000000243 R12: ffff880117e62480
[   59.319780] R13: ffff880117e7bb90 R14: 0000000000000000 R15: ffff880113242000
[   59.319780] FS:  00007f83b6958740(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[   59.319780] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   59.319780] CR2: 00007f83b59e5580 CR3: 00000001187d1000 CR4: 00000000000006f0
[   59.319780] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   59.319780] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   59.319780] Stack:
[   59.319780]  0000000000000000 0000000100000000 ffff880113035840 ffff880113035840
[   59.319780]  ffff8801131e5890 0000000000000000 ffff880119b80a00 0000000000000000
[   59.319780]  ffff8801131e5818 ffffffff81180ca3 ffff8801131e58e8 ffffffff8109bf04
[   59.319780] Call Trace:
[   59.319780]  [<ffffffff81180ca3>] ? kmem_cache_alloc+0x1a3/0x1f0
[   59.319780]  [<ffffffff8109bf04>] ? load_balance+0xf4/0x7f0
[   59.319780]  [<ffffffff815eeceb>] ? rt6_bind_peer+0x4b/0x90
[   59.319780]  [<ffffffff815ed985>] __ip6_ins_rt+0x45/0x70
[   59.319780]  [<ffffffff815eee35>] ip6_ins_rt+0x35/0x40
[   59.319780]  [<ffffffff815ef1e4>] ip6_pol_route.isra.44+0x3a4/0x4b0
[   59.319780]  [<ffffffff815ef34a>] ip6_pol_route_output+0x2a/0x30
[   59.319780]  [<ffffffff816161c7>] fib6_rule_action+0xd7/0x210
[   59.319780]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
[   59.319780]  [<ffffffff8106cd8f>] ? try_to_del_timer_sync+0x4f/0x70
[   59.319780]  [<ffffffff81553026>] fib_rules_lookup+0xc6/0x140
[   59.319780]  [<ffffffff816164c4>] fib6_rule_lookup+0x44/0x80
[   59.319780]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
[   59.319780]  [<ffffffff815edea3>] ip6_route_output+0x73/0xb0
[   59.319780]  [<ffffffff815dfdf3>] ip6_dst_lookup_tail+0x2c3/0x2e0
[   59.319780]  [<ffffffff81095838>] ? __enqueue_entity+0x78/0x80
[   59.319780]  [<ffffffff815dfe4d>] ip6_dst_lookup_flow+0x3d/0xa0
[   59.319780]  [<ffffffff815fdbc7>] rawv6_sendmsg+0x267/0xc20
[   59.319780]  [<ffffffff815a8a83>] inet_sendmsg+0x63/0xb0
[   59.319780]  [<ffffffff8128eb93>] ? selinux_socket_sendmsg+0x23/0x30
[   59.319780]  [<ffffffff815218d6>] sock_sendmsg+0xa6/0xd0
[   59.319780]  [<ffffffff81524a68>] SYSC_sendto+0x128/0x180
[   59.319780]  [<ffffffff8109825c>] ? update_curr+0xec/0x170
[   59.319780]  [<ffffffff81041d09>] ? kvm_clock_get_cycles+0x9/0x10
[   59.319780]  [<ffffffff810afd1e>] ? __getnstimeofday+0x3e/0xd0
[   59.319780]  [<ffffffff8152509e>] SyS_sendto+0xe/0x10
[   59.319780]  [<ffffffff8164f159>] system_call_fastpath+0x16/0x1b
[   59.319780] Code: 06 0f 85 c6 fe ff ff 48 8b 95 60 ff ff ff 48 89 c6 48 8b 7a 08 e8 cb ee ff ff e9 ae fe ff ff 48 8b 82 28 05 00 00 e9 fc fe ff ff <0f> 0b 49 8b 57 30 0d 00 00 40 00 bb ef ff ff ff 41 89 86 14 01 
[   59.319780] RIP  [<ffffffff815f3ca6>] fib6_add+0x826/0x900
[   59.319780]  RSP <ffff8801131e5788>
[   59.506210] ---[ end trace 3ade307f40880be9 ]---
[   59.507503] Kernel panic - not syncing: Fatal exception in interrupt

I can also reproduce it without this debugging diff:

git log --oneline HEAD^^^..
155b81f ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
c7e8e8a bridge: fix some kernel warning in multicast timer
734d4e1 sfc: Fix memory leak when discarding scattered packets

(net-next a few days old)

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 13:21           ` Hannes Frederic Sowa
@ 2013-07-10 14:10             ` Nicolas Dichtel
  2013-07-10 15:20               ` Hannes Frederic Sowa
  2013-07-10 21:21               ` Hannes Frederic Sowa
  0 siblings, 2 replies; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-10 14:10 UTC (permalink / raw)
  Cc: netdev, yoshfuji, petrus.lt, davem, hannes

Le 10/07/2013 15:21, Hannes Frederic Sowa a écrit :
> On Wed, Jul 10, 2013 at 02:22:55PM +0200, Nicolas Dichtel wrote:
>> Le 10/07/2013 12:53, Hannes Frederic Sowa a écrit :
>>> On Wed, Jul 10, 2013 at 11:28:57AM +0200, Nicolas Dichtel wrote:
>>>> Le 10/07/2013 09:54, Nicolas Dichtel a écrit :
>>>>> Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
>>>>>> After starting a ping6 2000::1 the box should panic soon, after the
>>>>>> first nexthop entry times out.
>>>>>>
>>>>>> Perhaps you could give me a hint?
>>>>> I will run some tests with your patch. Will see.
>>>> I don't reproduce this panic.
>>>
>>> I just dumped the routes for which it does increase the rt6i_nsiblings
>>> counter in this condition:
>>>
>>>                          /* If we have the same destination and the same
>>>                          metric,
>>>                           * but not the same gateway, then the route we
>>>                           try to
>>>                           * add is sibling to this route, increment our
>>>                           counter
>>>                           * of siblings, and later we will add our route
>>>                           to the
>>>                           * list.
>>>                           * Only static routes (which don't have flag
>>>                           * RTF_EXPIRES) are used for ECMPv6.
>>>                           *
>>>                           * To avoid long list, we only had siblings if the
>>>                           * route have a gateway.
>>>                           */
>>>                          if (rt->rt6i_flags & RTF_GATEWAY &&
>>>                              !(rt->rt6i_flags & RTF_EXPIRES) &&
>>>                              !(iter->rt6i_flags & RTF_EXPIRES))
>>>                                  rt->rt6i_nsiblings++;
>>>                                  dump_route(iter, "(iter)");
>>>                                  dump_route(rt, "(rt)");
>>> 			}
>>>
>>>
>>>
>>> Here:
>>>
>>> [   42.497470] (iter): ffff88011796cc00 dst 2000::1 plen 128 gateway
>>> 2001:db8::32, siblings 2, metric 0, expires 0 gateway 2 idev6
>>> ffff8801139ddc00 dev ffff880117e83000
>>> [   42.505912] (rt): ffff88011796d800 dst 2000::1 plen 128 gateway
>>> fe80::5054:ff:fe82:e153, siblings 1, metric 0, expires 0 gateway 2 idev6
>>> ffff880117edc400 dev ffff8801185cb000
>>> [   42.527241] (iter): ffff88011796d380 dst 2000::1 plen 128 gateway
>>> 2001:db8::33, siblings 2, metric 0, expires 0 gateway 2 idev6
>>> ffff8801139ddc00 dev ffff880117e83000
>>> [   42.536440] (rt): ffff88011796d800 dst 2000::1 plen 128 gateway
>>> fe80::5054:ff:fe82:e153, siblings 2, metric 0, expires 0 gateway 2 idev6
>>> ffff880117edc400 dev ffff8801185cb000
>>>
>>>  From my understanding these two routes should not be aggregated in one
>>> ecmp
>>> route set. Am I seeing this correct? (My configuration is like in the mail
>>> before.)
>> Hmm, why?
>> Routes have the same destination, same metric, are static (expires == 0)
>> and have a gateway.
>
> The route with rt6i_gateway does actually expire because I got it from
> autoconf and ip -6 r l confirms this, too. It seems this is only the cached
> route (I will confirm shortly). Is this still ok?
I wonder why expires is 0. Even if this route is cached, the flag RTF_EXPIRES 
should be set. Am I wrong?


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 13:49           ` Hannes Frederic Sowa
@ 2013-07-10 14:30             ` Nicolas Dichtel
  2013-07-10 14:34               ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-10 14:30 UTC (permalink / raw)
  To: hannes; +Cc: netdev, yoshfuji, petrus.lt, davem

Le 10/07/2013 15:49, Hannes Frederic Sowa a écrit :
> On Wed, Jul 10, 2013 at 03:17:41PM +0200, Hannes Frederic Sowa wrote:
>> On Wed, Jul 10, 2013 at 02:08:42PM +0200, Nicolas Dichtel wrote:
>>> Le 10/07/2013 13:15, Hannes Frederic Sowa a écrit :
>>>> On Wed, Jul 10, 2013 at 09:54:58AM +0200, Nicolas Dichtel wrote:
>>>>> Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
>>>>>> Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
>>>>>> start iterating from fn->leaf? But this does not seem to cause it,
>>>>>> because my trace does not report any calls to fib6_del_route.
>>>>> Note sure to follow you, but all siblings are listed in rt6i_siblings, so
>>>>> it must be enough.
>>>>
>>>> My hunch was to iterate over fn->leaf->rt_next and compare the metrics
>>>> like we
>>>> do when adding a new route. Then take that rt6_info->rt6i_siblings
>>>> list_head
>>>> to iterate over the remaining siblings. But I did not review that part
>>>> carefully, need to check later.
>>>>
>>>>>> You could try reproduce it by having an interface autoconfigured with
>>>>>> a default router with NUD_VALID neighbour. I then added an unused vlan
>>>>>> interface (vid 100 in my case) and added the following ip addresses:
>>>>>>
>>>>>> ip -6 a a 2001:ffff::1/64 dev eth0.100
>>>>>> ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31
>>>>>> nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
>>>>>>
>>>>>> (all nexthops should not be reachable)
>>>>>>
>>>>>> After starting a ping6 2000::1 the box should panic soon, after the
>>>>>> first nexthop entry times out.
>>>>>>
>>>>>> Perhaps you could give me a hint?
>>>>> I will run some tests with your patch. Will see.
>>>>>
>>>>> I assume you didn't reproduce this without your patch.
>>>>
>>>> Current kernel does not correctly select more specific routes, so these
>>>> routes
>>>> are not even tried and the logic should not be excercised.
>>>>
>>>> Ah, sorry, you should also compile your kernel without
>>>> CONFIG_IPV6_ROUTER_PREF, too, if you try to reproduce it.
>>> I've done this.
>>>
>>> My conf (eth1 autoconfigured, I use net-next + your patch):
>>> vconfig add eth1 100
>>> ifconfig eth1.100 up
>>> ip -6 a a 2001:ffff::1/64 dev eth1.100
>>> ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31
>>> nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
>>> ping6 2000::1
>>
>> Hm, I see. I suspect something with timing. I, too, use a net-next and have
>> one function dump_route added and sprinkeld it at some points.
>>
>> When I copy&pasted your calls I could not reproduce it. After a reboot when
>> just applying the commands from my history (which I did a lot faster), I got
>> the panic again.
>>
>> I'll remove the dump_routes and recheck later.
>
> This patch ontop
>
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -46,6 +46,16 @@
>   #define RT6_TRACE(x...) do { ; } while (0)
>   #endif
>
> +static void dump_route(struct rt6_info *rt, const char *prefix)
> +{
> +       u32 f = rt->rt6i_flags;
> +       struct rt6key *k = &rt->rt6i_dst;
> +       printk(KERN_INFO "%s: %p dst %pI6c plen %d gateway %pI6c, siblings %d, metric %d, expires %d gateway %d idev6 %p dev %p\n", prefix,
> +              rt, &k->addr, k->plen, &rt->rt6i_gateway, rt->rt6i_nsiblings, rt->rt6i_metric, f&RTF_EXPIRES, f&RTF_GATEWAY, rt->rt6i_idev, rt->dst.dev);
> +}
> +
> +
> +
>   static struct kmem_cache * fib6_node_kmem __read_mostly;
>
>   enum fib_walk_state_t
> @@ -693,8 +703,11 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>                           */
>                          if (rt->rt6i_flags & RTF_GATEWAY &&
>                              !(rt->rt6i_flags & RTF_EXPIRES) &&
> -                           !(iter->rt6i_flags & RTF_EXPIRES))
> +                           !(iter->rt6i_flags & RTF_EXPIRES)) {
>                                  rt->rt6i_nsiblings++;
> +                               dump_route(rt, "(rt)");
> +                               dump_route(iter, "(iter)");
> +                       }
>                  }
>
>                  if (iter->rt6i_metric > rt->rt6i_metric)
> @@ -718,6 +731,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>                          if (sibling->rt6i_metric == rt->rt6i_metric) {
>                                  list_add_tail(&rt->rt6i_siblings,
>                                                &sibling->rt6i_siblings);
> +                               dump_route(sibling, "(sibling)");
>                                  break;
>                          }
>                          sibling = sibling->dst.rt6_next;
> @@ -730,6 +744,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>                  list_for_each_entry_safe(sibling, temp_sibling,
>                                           &rt->rt6i_siblings, rt6i_siblings) {
>                          sibling->rt6i_nsiblings++;
> +                       dump_route(sibling, "(sibling increment)");
>                          BUG_ON(sibling->rt6i_nsiblings != rt->rt6i_nsiblings);
>                          rt6i_nsiblings++;
>                  }
>
> produces this panic:
>
> [   59.234779] (rt): ffff880113242000 dst 2000::1 plen 128 gateway 2001:ffff::33, siblings 1, metric 0, expires 0 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
> [   59.243794] (iter): ffff880117e7b680 dst 2000::1 plen 128 gateway 2001:ffff::31, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
> [   59.261383] (rt): ffff880113242000 dst 2000::1 plen 128 gateway 2001:ffff::33, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
> [   59.270030] (iter): ffff880117e7bb00 dst 2000::1 plen 128 gateway 2001:ffff::32, siblings 2, metric 0, expires 0 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
> [   59.291933] (sibling): ffff880117e62480 dst 2000::1 plen 128 gateway 2001:ffff::30, siblings 2, metric 0, expires 4194304 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
> [   59.306893] (sibling increment): ffff880117e62480 dst 2000::1 plen 128 gateway 2001:ffff::30, siblings 3, metric 0, expires 4194304 gateway 2 idev6 ffff8801131ab000 dev ffff88011816d000
I don't have the same output:
[   97.945170] (rt): f1a02d80 dst 2000:: plen 3 gateway 2001:660:1234:5678::31, 
siblings 1, metric 1024, expires 0 gateway 2 idev6 f7507e00 dev f7793000
[   97.948117] (iter): f1a02e80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::30, siblings 0, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.951207] (sibling): f1a02e80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::30, siblings 0, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.954272] (sibling increment): f1a02e80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::30, siblings 1, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.957545] (rt): f1a02c80 dst 2000:: plen 3 gateway 2001:660:1234:5678::32, 
siblings 1, metric 1024, expires 0 gateway 2 idev6 f7507e00 dev f7793000
[   97.960376] (iter): f1a02e80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::30, siblings 1, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.961902] (rt): f1a02c80 dst 2000:: plen 3 gateway 2001:660:1234:5678::32, 
siblings 2, metric 1024, expires 0 gateway 2 idev6 f7507e00 dev f7793000
[   97.963095] (iter): f1a02d80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::31, siblings 1, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.964354] (sibling): f1a02e80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::30, siblings 1, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.965604] (sibling increment): f1a02e80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::30, siblings 2, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.966916] (sibling increment): f1a02d80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::31, siblings 2, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.968254] (rt): f1a02b80 dst 2000:: plen 3 gateway 2001:660:1234:5678::33, 
siblings 1, metric 1024, expires 0 gateway 2 idev6 f7507e00 dev f7793000
[   97.969467] (iter): f1a02e80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::30, siblings 2, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.970702] (rt): f1a02b80 dst 2000:: plen 3 gateway 2001:660:1234:5678::33, 
siblings 2, metric 1024, expires 0 gateway 2 idev6 f7507e00 dev f7793000
[   97.971895] (iter): f1a02d80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::31, siblings 2, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.973137] (rt): f1a02b80 dst 2000:: plen 3 gateway 2001:660:1234:5678::33, 
siblings 3, metric 1024, expires 0 gateway 2 idev6 f7507e00 dev f7793000
[   97.974331] (iter): f1a02c80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::32, siblings 2, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.975542] (sibling): f1a02e80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::30, siblings 2, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.976808] (sibling increment): f1a02e80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::30, siblings 3, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.978126] (sibling increment): f1a02d80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::31, siblings 3, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000
[   97.979453] (sibling increment): f1a02c80 dst 2000:: plen 3 gateway 
2001:660:1234:5678::32, siblings 3, metric 1024, expires 0 gateway 2 idev6 
f7507e00 dev f7793000

Can you send me the output of:
ip -6 r
ip -6 a

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 14:30             ` Nicolas Dichtel
@ 2013-07-10 14:34               ` Hannes Frederic Sowa
  0 siblings, 0 replies; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 14:34 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 04:30:35PM +0200, Nicolas Dichtel wrote:
> Le 10/07/2013 15:49, Hannes Frederic Sowa a écrit :
> >On Wed, Jul 10, 2013 at 03:17:41PM +0200, Hannes Frederic Sowa wrote:
> >>On Wed, Jul 10, 2013 at 02:08:42PM +0200, Nicolas Dichtel wrote:
> >>>Le 10/07/2013 13:15, Hannes Frederic Sowa a écrit :
> >>>>On Wed, Jul 10, 2013 at 09:54:58AM +0200, Nicolas Dichtel wrote:
> >>>>>Le 09/07/2013 23:57, Hannes Frederic Sowa a écrit :
> >>>>>>Are we sure we decrement all sibling's rt6i_nsiblings? Shouldn't we
> >>>>>>start iterating from fn->leaf? But this does not seem to cause it,
> >>>>>>because my trace does not report any calls to fib6_del_route.
> >>>>>Note sure to follow you, but all siblings are listed in rt6i_siblings, 
> >>>>>so
> >>>>>it must be enough.
> >>>>
> >>>>My hunch was to iterate over fn->leaf->rt_next and compare the metrics
> >>>>like we
> >>>>do when adding a new route. Then take that rt6_info->rt6i_siblings
> >>>>list_head
> >>>>to iterate over the remaining siblings. But I did not review that part
> >>>>carefully, need to check later.
> >>>>
> >>>>>>You could try reproduce it by having an interface autoconfigured with
> >>>>>>a default router with NUD_VALID neighbour. I then added an unused vlan
> >>>>>>interface (vid 100 in my case) and added the following ip addresses:
> >>>>>>
> >>>>>>ip -6 a a 2001:ffff::1/64 dev eth0.100
> >>>>>>ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31
> >>>>>>nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
> >>>>>>
> >>>>>>(all nexthops should not be reachable)
> >>>>>>
> >>>>>>After starting a ping6 2000::1 the box should panic soon, after the
> >>>>>>first nexthop entry times out.
> >>>>>>
> >>>>>>Perhaps you could give me a hint?
> >>>>>I will run some tests with your patch. Will see.
> >>>>>
> >>>>>I assume you didn't reproduce this without your patch.
> >>>>
> >>>>Current kernel does not correctly select more specific routes, so these
> >>>>routes
> >>>>are not even tried and the logic should not be excercised.
> >>>>
> >>>>Ah, sorry, you should also compile your kernel without
> >>>>CONFIG_IPV6_ROUTER_PREF, too, if you try to reproduce it.
> >>>I've done this.
> >>>
> >>>My conf (eth1 autoconfigured, I use net-next + your patch):
> >>>vconfig add eth1 100
> >>>ifconfig eth1.100 up
> >>>ip -6 a a 2001:ffff::1/64 dev eth1.100
> >>>ip -6 r a 2000::/3 nexthop via 2001:ffff::30 nexthop via 2001:ffff::31
> >>>nexthop via 2001:ffff::32 nexthop via 2001:ffff::33
> >>>ping6 2000::1
> >>
> >>Hm, I see. I suspect something with timing. I, too, use a net-next and 
> >>have
> >>one function dump_route added and sprinkeld it at some points.
> >>
> >>When I copy&pasted your calls I could not reproduce it. After a reboot 
> >>when
> >>just applying the commands from my history (which I did a lot faster), I 
> >>got
> >>the panic again.
> >>
> >>I'll remove the dump_routes and recheck later.
> >
> >This patch ontop
> >
> >--- a/net/ipv6/ip6_fib.c
> >+++ b/net/ipv6/ip6_fib.c
> >@@ -46,6 +46,16 @@
> >  #define RT6_TRACE(x...) do { ; } while (0)
> >  #endif
> >
> >+static void dump_route(struct rt6_info *rt, const char *prefix)
> >+{
> >+       u32 f = rt->rt6i_flags;
> >+       struct rt6key *k = &rt->rt6i_dst;
> >+       printk(KERN_INFO "%s: %p dst %pI6c plen %d gateway %pI6c, siblings 
> >%d, metric %d, expires %d gateway %d idev6 %p dev %p\n", prefix,
> >+              rt, &k->addr, k->plen, &rt->rt6i_gateway, 
> >rt->rt6i_nsiblings, rt->rt6i_metric, f&RTF_EXPIRES, f&RTF_GATEWAY, 
> >rt->rt6i_idev, rt->dst.dev);
> >+}
> >+
> >+
> >+
> >  static struct kmem_cache * fib6_node_kmem __read_mostly;
> >
> >  enum fib_walk_state_t
> >@@ -693,8 +703,11 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
> >struct rt6_info *rt,
> >                          */
> >                         if (rt->rt6i_flags & RTF_GATEWAY &&
> >                             !(rt->rt6i_flags & RTF_EXPIRES) &&
> >-                           !(iter->rt6i_flags & RTF_EXPIRES))
> >+                           !(iter->rt6i_flags & RTF_EXPIRES)) {
> >                                 rt->rt6i_nsiblings++;
> >+                               dump_route(rt, "(rt)");
> >+                               dump_route(iter, "(iter)");
> >+                       }
> >                 }
> >
> >                 if (iter->rt6i_metric > rt->rt6i_metric)
> >@@ -718,6 +731,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
> >struct rt6_info *rt,
> >                         if (sibling->rt6i_metric == rt->rt6i_metric) {
> >                                 list_add_tail(&rt->rt6i_siblings,
> >                                               &sibling->rt6i_siblings);
> >+                               dump_route(sibling, "(sibling)");
> >                                 break;
> >                         }
> >                         sibling = sibling->dst.rt6_next;
> >@@ -730,6 +744,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
> >struct rt6_info *rt,
> >                 list_for_each_entry_safe(sibling, temp_sibling,
> >                                          &rt->rt6i_siblings, 
> >                                          rt6i_siblings) {
> >                         sibling->rt6i_nsiblings++;
> >+                       dump_route(sibling, "(sibling increment)");
> >                         BUG_ON(sibling->rt6i_nsiblings != 
> >                         rt->rt6i_nsiblings);
> >                         rt6i_nsiblings++;
> >                 }
> >
> >produces this panic:
> >
> >[   59.234779] (rt): ffff880113242000 dst 2000::1 plen 128 gateway 
> >2001:ffff::33, siblings 1, metric 0, expires 0 gateway 2 idev6 
> >ffff8801131ab000 dev ffff88011816d000
> >[   59.243794] (iter): ffff880117e7b680 dst 2000::1 plen 128 gateway 
> >2001:ffff::31, siblings 2, metric 0, expires 0 gateway 2 idev6 
> >ffff8801131ab000 dev ffff88011816d000
> >[   59.261383] (rt): ffff880113242000 dst 2000::1 plen 128 gateway 
> >2001:ffff::33, siblings 2, metric 0, expires 0 gateway 2 idev6 
> >ffff8801131ab000 dev ffff88011816d000
> >[   59.270030] (iter): ffff880117e7bb00 dst 2000::1 plen 128 gateway 
> >2001:ffff::32, siblings 2, metric 0, expires 0 gateway 2 idev6 
> >ffff8801131ab000 dev ffff88011816d000
> >[   59.291933] (sibling): ffff880117e62480 dst 2000::1 plen 128 gateway 
> >2001:ffff::30, siblings 2, metric 0, expires 4194304 gateway 2 idev6 
> >ffff8801131ab000 dev ffff88011816d000
> >[   59.306893] (sibling increment): ffff880117e62480 dst 2000::1 plen 128 
> >gateway 2001:ffff::30, siblings 3, metric 0, expires 4194304 gateway 2 
> >idev6 ffff8801131ab000 dev ffff88011816d000
> I don't have the same output:
> [   97.945170] (rt): f1a02d80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::31, siblings 1, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.948117] (iter): f1a02e80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::30, siblings 0, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.951207] (sibling): f1a02e80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::30, siblings 0, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.954272] (sibling increment): f1a02e80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::30, siblings 1, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.957545] (rt): f1a02c80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::32, siblings 1, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.960376] (iter): f1a02e80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::30, siblings 1, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.961902] (rt): f1a02c80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::32, siblings 2, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.963095] (iter): f1a02d80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::31, siblings 1, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.964354] (sibling): f1a02e80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::30, siblings 1, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.965604] (sibling increment): f1a02e80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::30, siblings 2, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.966916] (sibling increment): f1a02d80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::31, siblings 2, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.968254] (rt): f1a02b80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::33, siblings 1, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.969467] (iter): f1a02e80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::30, siblings 2, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.970702] (rt): f1a02b80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::33, siblings 2, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.971895] (iter): f1a02d80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::31, siblings 2, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.973137] (rt): f1a02b80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::33, siblings 3, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.974331] (iter): f1a02c80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::32, siblings 2, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.975542] (sibling): f1a02e80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::30, siblings 2, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.976808] (sibling increment): f1a02e80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::30, siblings 3, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.978126] (sibling increment): f1a02d80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::31, siblings 3, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> [   97.979453] (sibling increment): f1a02c80 dst 2000:: plen 3 gateway 
> 2001:660:1234:5678::32, siblings 3, metric 1024, expires 0 gateway 2 idev6 
> f7507e00 dev f7793000
> 
> Can you send me the output of:
> ip -6 r
> ip -6 a
> 

Of course:

ip -6 r:
2001:700:1300:feed::134 via 2001:ffff::33 dev eth0.100  metric 0 
    cache  expires -7sec
2001:700:1300:feed::134 via fe80::5054:ff:fe82:e153 dev eth0  metric 0 
    cache 
2001:db8:ee8c:180::/64 dev eth0  proto kernel  metric 256  expires 86353sec
2001:ffff::/64 dev eth0.100  proto kernel  metric 256 
2000::/3 via 2001:ffff::30 dev eth0.100  metric 1024 
2000::/3 via 2001:ffff::31 dev eth0.100  metric 1024 
2000::/3 via 2001:ffff::32 dev eth0.100  metric 1024 
2000::/3 via 2001:ffff::33 dev eth0.100  metric 1024 
fe80::/64 dev eth0  proto kernel  metric 256 
fe80::/64 dev eth0.100  proto kernel  metric 256 
default via fe80::5054:ff:fe82:e153 dev eth0  proto ra  metric 1024  expires 1753sec

ip -6 a:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
    inet6 2001:db8:ee8c:180:5054:ff:fe53:c6c9/64 scope global dynamic 
       valid_lft 86392sec preferred_lft 14392sec
    inet6 fe80::5054:ff:fe53:c6c9/64 scope link 
       valid_lft forever preferred_lft forever
3: eth0.100@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
    inet6 2001:ffff::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe53:c6c9/64 scope link 
       valid_lft forever preferred_lft forever

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 14:10             ` Nicolas Dichtel
@ 2013-07-10 15:20               ` Hannes Frederic Sowa
  2013-07-10 15:59                 ` Hannes Frederic Sowa
  2013-07-10 21:21               ` Hannes Frederic Sowa
  1 sibling, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 15:20 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 04:10:58PM +0200, Nicolas Dichtel wrote:
> Le 10/07/2013 15:21, Hannes Frederic Sowa a écrit :
> >On Wed, Jul 10, 2013 at 02:22:55PM +0200, Nicolas Dichtel wrote:
> >>Le 10/07/2013 12:53, Hannes Frederic Sowa a écrit :
> >The route with rt6i_gateway does actually expire because I got it from
> >autoconf and ip -6 r l confirms this, too. It seems this is only the cached
> >route (I will confirm shortly). Is this still ok?
> I wonder why expires is 0. Even if this route is cached, the flag 
> RTF_EXPIRES should be set. Am I wrong?

It seems it is possible cached route gets its expiration updated. As
such it is not counted in the iteration but it is found as a sibling and its
nsiblings count is updated again.

(sorry for the additional mail, forgot to send it to the list)

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 15:20               ` Hannes Frederic Sowa
@ 2013-07-10 15:59                 ` Hannes Frederic Sowa
  2013-07-10 16:35                   ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 15:59 UTC (permalink / raw)
  To: Nicolas Dichtel, netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 05:20:01PM +0200, Hannes Frederic Sowa wrote:
> On Wed, Jul 10, 2013 at 04:10:58PM +0200, Nicolas Dichtel wrote:
> > Le 10/07/2013 15:21, Hannes Frederic Sowa a écrit :
> > >On Wed, Jul 10, 2013 at 02:22:55PM +0200, Nicolas Dichtel wrote:
> > >>Le 10/07/2013 12:53, Hannes Frederic Sowa a écrit :
> > >The route with rt6i_gateway does actually expire because I got it from
> > >autoconf and ip -6 r l confirms this, too. It seems this is only the cached
> > >route (I will confirm shortly). Is this still ok?
> > I wonder why expires is 0. Even if this route is cached, the flag 
> > RTF_EXPIRES should be set. Am I wrong?
> 
> It seems it is possible cached route gets its expiration updated. As
> such it is not counted in the iteration but it is found as a sibling and its
> nsiblings count is updated again.

ip6_link_failure is the problem. We need to remove the route directly instead
of calling rt6_update_expires:

static void ip6_link_failure(struct sk_buff *skb)
{
        struct rt6_info *rt;

        icmpv6_send(skb, ICMPV6_DEST_UNREACH, ICMPV6_ADDR_UNREACH, 0);

        rt = (struct rt6_info *) skb_dst(skb);
        if (rt) {
                if (rt->rt6i_flags & RTF_CACHE)
                        rt6_update_expires(rt, 0);
                else if (rt->rt6i_node && (rt->rt6i_flags & RTF_DEFAULT))
                        rt->rt6i_node->fn_sernum = -1;
        }
}

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 15:59                 ` Hannes Frederic Sowa
@ 2013-07-10 16:35                   ` Hannes Frederic Sowa
  2013-07-11  8:07                     ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 16:35 UTC (permalink / raw)
  To: Nicolas Dichtel, netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 05:59:04PM +0200, Hannes Frederic Sowa wrote:
> On Wed, Jul 10, 2013 at 05:20:01PM +0200, Hannes Frederic Sowa wrote:
> > On Wed, Jul 10, 2013 at 04:10:58PM +0200, Nicolas Dichtel wrote:
> > > Le 10/07/2013 15:21, Hannes Frederic Sowa a écrit :
> > > >On Wed, Jul 10, 2013 at 02:22:55PM +0200, Nicolas Dichtel wrote:
> > > >>Le 10/07/2013 12:53, Hannes Frederic Sowa a écrit :
> > > >The route with rt6i_gateway does actually expire because I got it from
> > > >autoconf and ip -6 r l confirms this, too. It seems this is only the cached
> > > >route (I will confirm shortly). Is this still ok?
> > > I wonder why expires is 0. Even if this route is cached, the flag 
> > > RTF_EXPIRES should be set. Am I wrong?
> > 
> > It seems it is possible cached route gets its expiration updated. As
> > such it is not counted in the iteration but it is found as a sibling and its
> > nsiblings count is updated again.
> 
> ip6_link_failure is the problem. We need to remove the route directly instead
> of calling rt6_update_expires:
> 
> static void ip6_link_failure(struct sk_buff *skb)
> {
>         struct rt6_info *rt;
> 
>         icmpv6_send(skb, ICMPV6_DEST_UNREACH, ICMPV6_ADDR_UNREACH, 0);
> 
>         rt = (struct rt6_info *) skb_dst(skb);
>         if (rt) {
>                 if (rt->rt6i_flags & RTF_CACHE)
>                         rt6_update_expires(rt, 0);
>                 else if (rt->rt6i_node && (rt->rt6i_flags & RTF_DEFAULT))
>                         rt->rt6i_node->fn_sernum = -1;
>         }
> }

With this patch I do not get the panic any more and the routing table
stabilizes and discarded all NUD nonavailable routes. I'll need to review
it myself, first. Thanks for your help!

--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1096,10 +1096,12 @@ static void ip6_link_failure(struct sk_buff *skb)
 
        rt = (struct rt6_info *) skb_dst(skb);
        if (rt) {
-               if (rt->rt6i_flags & RTF_CACHE)
-                       rt6_update_expires(rt, 0);
-               else if (rt->rt6i_node && (rt->rt6i_flags & RTF_DEFAULT))
+               if (rt->rt6i_flags & RTF_CACHE) {
+                       dst_hold(&rt->dst);
+                       ip6_del_rt(rt);
+               } else if (rt->rt6i_node && (rt->rt6i_flags & RTF_DEFAULT)) {
                        rt->rt6i_node->fn_sernum = -1;
+               }
        }
 }
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 14:10             ` Nicolas Dichtel
  2013-07-10 15:20               ` Hannes Frederic Sowa
@ 2013-07-10 21:21               ` Hannes Frederic Sowa
  2013-07-11  8:04                 ` Nicolas Dichtel
  1 sibling, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-10 21:21 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Wed, Jul 10, 2013 at 04:10:58PM +0200, Nicolas Dichtel wrote:
> I wonder why expires is 0. Even if this route is cached, the flag 
> RTF_EXPIRES should be set. Am I wrong?

rt6_set_from deliberately clears the RTF_EXPIRES when creating a cached copy
of the route if the route is an autoconfigured default route.

Maybe the criterion for exclusion of which routes can get into an ecmp route
set should be revisited? This could result in strange effects for users
working with two interfaces, both receiving a RA with default routes.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 21:21               ` Hannes Frederic Sowa
@ 2013-07-11  8:04                 ` Nicolas Dichtel
  2013-07-11 10:24                   ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-11  8:04 UTC (permalink / raw)
  To: hannes; +Cc: netdev, yoshfuji, petrus.lt, davem

Le 10/07/2013 23:21, Hannes Frederic Sowa a écrit :
> On Wed, Jul 10, 2013 at 04:10:58PM +0200, Nicolas Dichtel wrote:
>> I wonder why expires is 0. Even if this route is cached, the flag
>> RTF_EXPIRES should be set. Am I wrong?
>
> rt6_set_from deliberately clears the RTF_EXPIRES when creating a cached copy
> of the route if the route is an autoconfigured default route.
>
> Maybe the criterion for exclusion of which routes can get into an ecmp route
> set should be revisited? This could result in strange effects for users
> working with two interfaces, both receiving a RA with default routes.
Agreed. Here is a proposal, what do you think?

[PATCH] ipv6: don't use autoconfigured route for ecmp

The intention was already there by checking the flag RTF_EXPIRES, but this flag
is removed from the default route by rt6_set_from() when this route is cached.

Let's add a check against RTF_ADDRCONF.

Spotted-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
  net/ipv6/ip6_fib.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 192dd1a..87e31c6 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -686,14 +686,14 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
rt6_info *rt,
  			 * of siblings, and later we will add our route to the
  			 * list.
  			 * Only static routes (which don't have flag
-			 * RTF_EXPIRES) are used for ECMPv6.
+			 * RTF_EXPIRES nor RTF_ADDRCONF) are used for ECMPv6.
  			 *
  			 * To avoid long list, we only had siblings if the
  			 * route have a gateway.
  			 */
  			if (rt->rt6i_flags & RTF_GATEWAY &&
-			    !(rt->rt6i_flags & RTF_EXPIRES) &&
-			    !(iter->rt6i_flags & RTF_EXPIRES))
+			    !(rt->rt6i_flags & (RTF_EXPIRES|RTF_ADDRCONF)) &&
+			    !(iter->rt6i_flags & (RTF_EXPIRES|RTF_ADDRCONF)))
  				rt->rt6i_nsiblings++;
  		}

-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-10 16:35                   ` Hannes Frederic Sowa
@ 2013-07-11  8:07                     ` Nicolas Dichtel
  0 siblings, 0 replies; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-11  8:07 UTC (permalink / raw)
  To: hannes; +Cc: netdev, yoshfuji, petrus.lt, davem

Le 10/07/2013 18:35, Hannes Frederic Sowa a écrit :
> On Wed, Jul 10, 2013 at 05:59:04PM +0200, Hannes Frederic Sowa wrote:
>> On Wed, Jul 10, 2013 at 05:20:01PM +0200, Hannes Frederic Sowa wrote:
>>> On Wed, Jul 10, 2013 at 04:10:58PM +0200, Nicolas Dichtel wrote:
>>>> Le 10/07/2013 15:21, Hannes Frederic Sowa a écrit :
>>>>> On Wed, Jul 10, 2013 at 02:22:55PM +0200, Nicolas Dichtel wrote:
>>>>>> Le 10/07/2013 12:53, Hannes Frederic Sowa a écrit :
>>>>> The route with rt6i_gateway does actually expire because I got it from
>>>>> autoconf and ip -6 r l confirms this, too. It seems this is only the cached
>>>>> route (I will confirm shortly). Is this still ok?
>>>> I wonder why expires is 0. Even if this route is cached, the flag
>>>> RTF_EXPIRES should be set. Am I wrong?
>>>
>>> It seems it is possible cached route gets its expiration updated. As
>>> such it is not counted in the iteration but it is found as a sibling and its
>>> nsiblings count is updated again.
>>
>> ip6_link_failure is the problem. We need to remove the route directly instead
>> of calling rt6_update_expires:
>>
>> static void ip6_link_failure(struct sk_buff *skb)
>> {
>>          struct rt6_info *rt;
>>
>>          icmpv6_send(skb, ICMPV6_DEST_UNREACH, ICMPV6_ADDR_UNREACH, 0);
>>
>>          rt = (struct rt6_info *) skb_dst(skb);
>>          if (rt) {
>>                  if (rt->rt6i_flags & RTF_CACHE)
>>                          rt6_update_expires(rt, 0);
>>                  else if (rt->rt6i_node && (rt->rt6i_flags & RTF_DEFAULT))
>>                          rt->rt6i_node->fn_sernum = -1;
>>          }
>> }
>
> With this patch I do not get the panic any more and the routing table
> stabilizes and discarded all NUD nonavailable routes. I'll need to review
> it myself, first. Thanks for your help!

Good catch!
Thank you.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-11  8:04                 ` Nicolas Dichtel
@ 2013-07-11 10:24                   ` Hannes Frederic Sowa
  2013-07-11 14:46                     ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-11 10:24 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Thu, Jul 11, 2013 at 10:04:47AM +0200, Nicolas Dichtel wrote:
> Le 10/07/2013 23:21, Hannes Frederic Sowa a écrit :
> >On Wed, Jul 10, 2013 at 04:10:58PM +0200, Nicolas Dichtel wrote:
> >>I wonder why expires is 0. Even if this route is cached, the flag
> >>RTF_EXPIRES should be set. Am I wrong?
> >
> >rt6_set_from deliberately clears the RTF_EXPIRES when creating a cached 
> >copy
> >of the route if the route is an autoconfigured default route.
> >
> >Maybe the criterion for exclusion of which routes can get into an ecmp 
> >route
> >set should be revisited? This could result in strange effects for users
> >working with two interfaces, both receiving a RA with default routes.
> Agreed. Here is a proposal, what do you think?
> 
> [PATCH] ipv6: don't use autoconfigured route for ecmp
> 
> The intention was already there by checking the flag RTF_EXPIRES, but this 
> flag
> is removed from the default route by rt6_set_from() when this route is 
> cached.
> 
> Let's add a check against RTF_ADDRCONF.
> 
> Spotted-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

I do think the patch is ok. I just wanted to show you a solution I did
on my laptop this night and did not have the time to send yesterday. It
is only compile-tested.

I did strengthen the RTF_EXPIRES check a bit. Also I am not sure what
did stop the search for the first route with the same metric to find a
RTF_EXPIRES route, so I also added the guard there, too.

I fear, I'll need to do a bit more research.

Thanks!

[PATCH RFC] ipv6: routes only qualify for ecmp if their original routes do not expire

Cloned routes get their RTF_EXPIRES flag reset if they are autoconfigured
default routes. Because these routes should not end up in the ecmp route
set, we check if the original route has the RTF_EXPIRES flag set.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/ip6_fib.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 192dd1a..3bd5fcd 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -632,6 +632,18 @@ insert_above:
 	return ln;
 }
 
+static bool rt6_qualify_for_ecmp(struct rt6_info *rt0)
+{
+	struct rt6_info *rt;
+
+	if (!(rt0->rt6i_flags & RTF_GATEWAY))
+		return false;
+
+	for (rt = rt0; rt && !(rt->rt6i_flags & RTF_EXPIRES);
+	     rt = (struct rt6_info *)rt->dst.from);
+	return !(rt && (rt->rt6i_flags & RTF_EXPIRES));
+}
+
 /*
  *	Insert routing information in a node.
  */
@@ -691,9 +703,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 			 * To avoid long list, we only had siblings if the
 			 * route have a gateway.
 			 */
-			if (rt->rt6i_flags & RTF_GATEWAY &&
-			    !(rt->rt6i_flags & RTF_EXPIRES) &&
-			    !(iter->rt6i_flags & RTF_EXPIRES))
+			if (rt6_qualify_for_ecmp(rt) &&
+			    rt6_qualify_for_ecmp(iter))
 				rt->rt6i_nsiblings++;
 		}
 
@@ -715,7 +726,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 		/* Find the first route that have the same metric */
 		sibling = fn->leaf;
 		while (sibling) {
-			if (sibling->rt6i_metric == rt->rt6i_metric) {
+			if (sibling->rt6i_metric == rt->rt6i_metric &&
+			    rt6_qualify_for_ecmp(sibling)) {
 				list_add_tail(&rt->rt6i_siblings,
 					      &sibling->rt6i_siblings);
 				break;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-11 10:24                   ` Hannes Frederic Sowa
@ 2013-07-11 14:46                     ` Hannes Frederic Sowa
  2013-07-11 14:57                       ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-11 14:46 UTC (permalink / raw)
  To: Nicolas Dichtel, netdev, yoshfuji, petrus.lt, davem

On Thu, Jul 11, 2013 at 12:24:41PM +0200, Hannes Frederic Sowa wrote:
> I fear, I'll need to do a bit more research.

My proposal is to take my patch and check for RTF_ADDRCONF plus RTF_DYNAMIC,
too. The RTF_DYNAMIC check would prevent routes created from icmpv6 redirects
entering an ecmp route set.

Do you agree?

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-11 14:46                     ` Hannes Frederic Sowa
@ 2013-07-11 14:57                       ` Nicolas Dichtel
  2013-07-12  8:51                         ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-11 14:57 UTC (permalink / raw)
  To: hannes; +Cc: netdev, yoshfuji, petrus.lt, davem

Le 11/07/2013 16:46, Hannes Frederic Sowa a écrit :
> On Thu, Jul 11, 2013 at 12:24:41PM +0200, Hannes Frederic Sowa wrote:
>> I fear, I'll need to do a bit more research.
>
> My proposal is to take my patch and check for RTF_ADDRCONF plus RTF_DYNAMIC,
> too. The RTF_DYNAMIC check would prevent routes created from icmpv6 redirects
> entering an ecmp route set.
>
> Do you agree?
Yes.


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-11 14:57                       ` Nicolas Dichtel
@ 2013-07-12  8:51                         ` Hannes Frederic Sowa
  2013-07-12 12:04                           ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-12  8:51 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

Hello Nicolas,

On Thu, Jul 11, 2013 at 04:57:42PM +0200, Nicolas Dichtel wrote:
> Le 11/07/2013 16:46, Hannes Frederic Sowa a écrit :
> >On Thu, Jul 11, 2013 at 12:24:41PM +0200, Hannes Frederic Sowa wrote:
> >>I fear, I'll need to do a bit more research.
> >
> >My proposal is to take my patch and check for RTF_ADDRCONF plus 
> >RTF_DYNAMIC,
> >too. The RTF_DYNAMIC check would prevent routes created from icmpv6 
> >redirects
> >entering an ecmp route set.
> >
> >Do you agree?
> Yes.

There is still some window where things go wrong now, I fear. If we have ecmp
routes active and we update the pmtu of that rt6_info, we might end up with a
route in the ecmp set, which might not get recountet if another ecmp route
joins the set. I will have to think how to deal with this. Do you have an
idea?

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-12  8:51                         ` Hannes Frederic Sowa
@ 2013-07-12 12:04                           ` Nicolas Dichtel
  2013-07-12 16:19                             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-12 12:04 UTC (permalink / raw)
  To: hannes; +Cc: netdev, yoshfuji, petrus.lt, davem

Le 12/07/2013 10:51, Hannes Frederic Sowa a écrit :
> Hello Nicolas,
>
> On Thu, Jul 11, 2013 at 04:57:42PM +0200, Nicolas Dichtel wrote:
>> Le 11/07/2013 16:46, Hannes Frederic Sowa a écrit :
>>> On Thu, Jul 11, 2013 at 12:24:41PM +0200, Hannes Frederic Sowa wrote:
>>>> I fear, I'll need to do a bit more research.
>>>
>>> My proposal is to take my patch and check for RTF_ADDRCONF plus
>>> RTF_DYNAMIC,
>>> too. The RTF_DYNAMIC check would prevent routes created from icmpv6
>>> redirects
>>> entering an ecmp route set.
>>>
>>> Do you agree?
>> Yes.
>
> There is still some window where things go wrong now, I fear. If we have ecmp
> routes active and we update the pmtu of that rt6_info, we might end up with a
> route in the ecmp set, which might not get recountet if another ecmp route
> joins the set. I will have to think how to deal with this. Do you have an
> idea?
It's possible to add a glue to check this counter when we play with these flags, 
but it's ugly.

Maybe the check against RTF_EXPIRES is fundamentally wrong. Checking 
RTF_ADDRCONF|RTF_DYNAMIC should be enough, what do you think?

In another hand, we can discuss about the initial assumption, that was "only 
static routes are part of ECMP routes". I'm thinking of what are the consequence 
if we accept to accept all routes, without checking any flags.


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-12 12:04                           ` Nicolas Dichtel
@ 2013-07-12 16:19                             ` Hannes Frederic Sowa
  2013-07-12 19:01                               ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-12 16:19 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Fri, Jul 12, 2013 at 02:04:45PM +0200, Nicolas Dichtel wrote:
> It's possible to add a glue to check this counter when we play with these 
> flags, but it's ugly.
> 
> Maybe the check against RTF_EXPIRES is fundamentally wrong. Checking 
> RTF_ADDRCONF|RTF_DYNAMIC should be enough, what do you think?

Yes, this seems to be the best option now. I will audit the source if
RTF_ADDRCONF and RTF_DYNAMIC are immutable after dst construction and
if other errors could arise for that and would go with this solution then.

What do you think about making ecmp routes explicit by adding RTF_ECMP
flag?

> In another hand, we can discuss about the initial assumption, that was 
> "only static routes are part of ECMP routes". I'm thinking of what are the 
> consequence if we accept to accept all routes, without checking any flags.

I don't have a good feeling about that. But I may be wrong.

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-12 16:19                             ` Hannes Frederic Sowa
@ 2013-07-12 19:01                               ` Nicolas Dichtel
  2013-07-12 19:20                                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2013-07-12 19:01 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev, yoshfuji, petrus.lt, davem

Le 12/07/2013 18:19, Hannes Frederic Sowa a écrit :
> On Fri, Jul 12, 2013 at 02:04:45PM +0200, Nicolas Dichtel wrote:
>> It's possible to add a glue to check this counter when we play with these
>> flags, but it's ugly.
>>
>> Maybe the check against RTF_EXPIRES is fundamentally wrong. Checking
>> RTF_ADDRCONF|RTF_DYNAMIC should be enough, what do you think?
>
> Yes, this seems to be the best option now. I will audit the source if
> RTF_ADDRCONF and RTF_DYNAMIC are immutable after dst construction and
> if other errors could arise for that and would go with this solution then.
>
> What do you think about making ecmp routes explicit by adding RTF_ECMP
> flag?
Why not, but you will have to be careful on insertion and deletion. Next hop can 
be added one by one and deleted one by one.

>
>> In another hand, we can discuss about the initial assumption, that was
>> "only static routes are part of ECMP routes". I'm thinking of what are the
>> consequence if we accept to accept all routes, without checking any flags.
>
> I don't have a good feeling about that. But I may be wrong.
Same for me, but for now, I don't have any argument ;-) The above solution is 
probably the better way for now.


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-12 19:01                               ` Nicolas Dichtel
@ 2013-07-12 19:20                                 ` Hannes Frederic Sowa
  2013-07-12 21:48                                   ` Hannes Frederic Sowa
  0 siblings, 1 reply; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-12 19:20 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, yoshfuji, petrus.lt, davem

On Fri, Jul 12, 2013 at 09:01:17PM +0200, Nicolas Dichtel wrote:
> Le 12/07/2013 18:19, Hannes Frederic Sowa a écrit :
> >On Fri, Jul 12, 2013 at 02:04:45PM +0200, Nicolas Dichtel wrote:
> >>It's possible to add a glue to check this counter when we play with these
> >>flags, but it's ugly.
> >>
> >>Maybe the check against RTF_EXPIRES is fundamentally wrong. Checking
> >>RTF_ADDRCONF|RTF_DYNAMIC should be enough, what do you think?
> >
> >Yes, this seems to be the best option now. I will audit the source if
> >RTF_ADDRCONF and RTF_DYNAMIC are immutable after dst construction and
> >if other errors could arise for that and would go with this solution then.
> >
> >What do you think about making ecmp routes explicit by adding RTF_ECMP
> >flag?
> Why not, but you will have to be careful on insertion and deletion. Next 
> hop can be added one by one and deleted one by one.

Ok, we can have a look to do so in -next.

> >
> >>In another hand, we can discuss about the initial assumption, that was
> >>"only static routes are part of ECMP routes". I'm thinking of what are the
> >>consequence if we accept to accept all routes, without checking any flags.
> >
> >I don't have a good feeling about that. But I may be wrong.
> Same for me, but for now, I don't have any argument ;-) The above solution 
> is probably the better way for now.

To go without RTF_EXPIRES seems the way to go. I still am unsure if we need to
propagate the RTF_DYNAMIC flag in case we update the expiration date on a
route.

--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -682,6 +682,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
                                        rt->rt6i_nsiblings = 0;
                                if (!(iter->rt6i_flags & RTF_EXPIRES))
                                        return -EEXIST;
+                               iter->rt6i_flags |= rt->rt6i_flags & RTF_DYNAMIC;
                                if (!(rt->rt6i_flags & RTF_EXPIRES))
                                        rt6_clean_expires(iter);
                                else

I hope to have identified all possible site-effects later today.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
  2013-07-12 19:20                                 ` Hannes Frederic Sowa
@ 2013-07-12 21:48                                   ` Hannes Frederic Sowa
  0 siblings, 0 replies; 30+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-12 21:48 UTC (permalink / raw)
  To: Nicolas Dichtel, netdev, yoshfuji, petrus.lt, davem

On Fri, Jul 12, 2013 at 09:20:23PM +0200, Hannes Frederic Sowa wrote:
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -682,6 +682,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>                                         rt->rt6i_nsiblings = 0;
>                                 if (!(iter->rt6i_flags & RTF_EXPIRES))
>                                         return -EEXIST;
> +                               iter->rt6i_flags |= rt->rt6i_flags & RTF_DYNAMIC;
>                                 if (!(rt->rt6i_flags & RTF_EXPIRES))
>                                         rt6_clean_expires(iter);
>                                 else
> 

This was not necessary and just complicated things because we would have to
special-case the toggling of the RTF_DYNAMIC flag to also get removed from the
ecmp route set.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-07-12 21:48 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-07 17:30 [PATCH RFC] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF Hannes Frederic Sowa
2013-07-09 21:57 ` Hannes Frederic Sowa
2013-07-10  7:54   ` Nicolas Dichtel
2013-07-10  9:28     ` Nicolas Dichtel
2013-07-10 10:53       ` Hannes Frederic Sowa
2013-07-10 12:22         ` Nicolas Dichtel
2013-07-10 13:21           ` Hannes Frederic Sowa
2013-07-10 14:10             ` Nicolas Dichtel
2013-07-10 15:20               ` Hannes Frederic Sowa
2013-07-10 15:59                 ` Hannes Frederic Sowa
2013-07-10 16:35                   ` Hannes Frederic Sowa
2013-07-11  8:07                     ` Nicolas Dichtel
2013-07-10 21:21               ` Hannes Frederic Sowa
2013-07-11  8:04                 ` Nicolas Dichtel
2013-07-11 10:24                   ` Hannes Frederic Sowa
2013-07-11 14:46                     ` Hannes Frederic Sowa
2013-07-11 14:57                       ` Nicolas Dichtel
2013-07-12  8:51                         ` Hannes Frederic Sowa
2013-07-12 12:04                           ` Nicolas Dichtel
2013-07-12 16:19                             ` Hannes Frederic Sowa
2013-07-12 19:01                               ` Nicolas Dichtel
2013-07-12 19:20                                 ` Hannes Frederic Sowa
2013-07-12 21:48                                   ` Hannes Frederic Sowa
2013-07-10 11:15     ` Hannes Frederic Sowa
2013-07-10 11:40       ` Hannes Frederic Sowa
2013-07-10 12:08       ` Nicolas Dichtel
2013-07-10 13:17         ` Hannes Frederic Sowa
2013-07-10 13:49           ` Hannes Frederic Sowa
2013-07-10 14:30             ` Nicolas Dichtel
2013-07-10 14:34               ` Hannes Frederic Sowa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).