All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] Failed neighbors attached to routes are not released
@ 2011-08-19 18:53 Guy Yur
  2011-08-22 16:44 ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Guy Yur @ 2011-08-19 18:53 UTC (permalink / raw)
  To: netdev

Hi,

The issue I am seeing is with a neighbor used as a gateway in a route,
when the neighbor gets nud FAILED it will not be removed from the
neighbor cache.
The reference count for the neighbor remains > 1
when neigh_periodic_work() is called.

Issue noticed with IPv6 neighbors on Arch Linux with kernel 3.0.3
kernel config includes CONFIG_IPV6_ROUTER_PREF

The problem affects routing when the neighbor loses connectivity and returns.

Scenario: Using a default route and a static route through different interfaces.
When the neighbor gateway of the static route goes down the traffic
will move to the default gateway as expected.
Once the static route neighbor comes back up it is not asked for
neighbor solicitation
because the route is marked as FAILED and the traffic will continue to
pass through the default gateway.

Steps to reproduce the route not being removed:
1. add an IPv6 address on an interface
2. add a route to a network through a gateway on the interface's network
3. make sure the gateway address is not reachable
4. ping6 a host in the route network
5. "ip -6 nei" will show the gateway neighbor as FAILED and it won't be released

Steps to reproduce the routing problem:
1. client and two gateway machines (A and B)
2. on the client define a static route through A and a default route through B
3. disconnect eth on A
4. ping6 a host in the network that should go through A
   after a while the traffic will move through B which is the default route
5. reconnect eth on A
6. ping6 a host in the network again, the traffic will still go through B
   "ip -6 nei" on the client will still show A as FAILED


Patch to change the nud state to NONE if the reference count > 1
allowing the neighbor to be released in a future pass.

--- linux/net/core/neighbour.c.orig	2011-08-19 13:15:35.041524169 +0300
+++ linux/net/core/neighbour.c	2011-08-19 18:19:07.271276096 +0300
@@ -802,14 +802,16 @@ static void neigh_periodic_work(struct w
 			if (time_before(n->used, n->confirmed))
 				n->used = n->confirmed;

-			if (atomic_read(&n->refcnt) == 1 &&
-			    (state == NUD_FAILED ||
+			if ((state == NUD_FAILED ||
 			     time_after(jiffies, n->used + n->parms->gc_staletime))) {
-				*np = n->next;
-				n->dead = 1;
-				write_unlock(&n->lock);
-				neigh_cleanup_and_release(n);
-				continue;
+				if (atomic_read(&n->refcnt) == 1) {
+					*np = n->next;
+					n->dead = 1;
+					write_unlock(&n->lock);
+					neigh_cleanup_and_release(n);
+					continue;
+				} else
+					n->nud_state = NUD_NONE;
 			}
 			write_unlock(&n->lock);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] Failed neighbors attached to routes are not released
  2011-08-19 18:53 [RFC][PATCH] Failed neighbors attached to routes are not released Guy Yur
@ 2011-08-22 16:44 ` Eric Dumazet
  2011-08-22 19:49   ` Guy Yur
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2011-08-22 16:44 UTC (permalink / raw)
  To: Guy Yur; +Cc: netdev

Le vendredi 19 août 2011 à 21:53 +0300, Guy Yur a écrit :
> Hi,
> 
> The issue I am seeing is with a neighbor used as a gateway in a route,
> when the neighbor gets nud FAILED it will not be removed from the
> neighbor cache.
> The reference count for the neighbor remains > 1
> when neigh_periodic_work() is called.
> 
> Issue noticed with IPv6 neighbors on Arch Linux with kernel 3.0.3
> kernel config includes CONFIG_IPV6_ROUTER_PREF
> 
> The problem affects routing when the neighbor loses connectivity and returns.
> 
> Scenario: Using a default route and a static route through different interfaces.
> When the neighbor gateway of the static route goes down the traffic
> will move to the default gateway as expected.
> Once the static route neighbor comes back up it is not asked for
> neighbor solicitation
> because the route is marked as FAILED and the traffic will continue to
> pass through the default gateway.
> 
> Steps to reproduce the route not being removed:
> 1. add an IPv6 address on an interface
> 2. add a route to a network through a gateway on the interface's network
> 3. make sure the gateway address is not reachable
> 4. ping6 a host in the route network
> 5. "ip -6 nei" will show the gateway neighbor as FAILED and it won't be released
> 
> Steps to reproduce the routing problem:
> 1. client and two gateway machines (A and B)
> 2. on the client define a static route through A and a default route through B
> 3. disconnect eth on A
> 4. ping6 a host in the network that should go through A
>    after a while the traffic will move through B which is the default route
> 5. reconnect eth on A
> 6. ping6 a host in the network again, the traffic will still go through B
>    "ip -6 nei" on the client will still show A as FAILED
> 
> 
> Patch to change the nud state to NONE if the reference count > 1
> allowing the neighbor to be released in a future pass.

I wonder why a 'future pass' is needed at all.

Shouldnt we immediately detect link becomes alive and force an immediate
flush at this point, before waiting a garbage collect timer ?


> 
> --- linux/net/core/neighbour.c.orig	2011-08-19 13:15:35.041524169 +0300
> +++ linux/net/core/neighbour.c	2011-08-19 18:19:07.271276096 +0300
> @@ -802,14 +802,16 @@ static void neigh_periodic_work(struct w
>  			if (time_before(n->used, n->confirmed))
>  				n->used = n->confirmed;
> 
> -			if (atomic_read(&n->refcnt) == 1 &&
> -			    (state == NUD_FAILED ||
> +			if ((state == NUD_FAILED ||
>  			     time_after(jiffies, n->used + n->parms->gc_staletime))) {
> -				*np = n->next;
> -				n->dead = 1;
> -				write_unlock(&n->lock);
> -				neigh_cleanup_and_release(n);
> -				continue;
> +				if (atomic_read(&n->refcnt) == 1) {
> +					*np = n->next;
> +					n->dead = 1;
> +					write_unlock(&n->lock);
> +					neigh_cleanup_and_release(n);
> +					continue;
> +				} else
> +					n->nud_state = NUD_NONE;
>  			}
>  			write_unlock(&n->lock);
> --



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] Failed neighbors attached to routes are not released
  2011-08-22 16:44 ` Eric Dumazet
@ 2011-08-22 19:49   ` Guy Yur
  2011-08-22 21:17     ` Guy Yur
  0 siblings, 1 reply; 6+ messages in thread
From: Guy Yur @ 2011-08-22 19:49 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet

On Mon, Aug 22, 2011 at 7:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 19 août 2011 à 21:53 +0300, Guy Yur a écrit :
>> Hi,
>>
>> The issue I am seeing is with a neighbor used as a gateway in a route,
>> when the neighbor gets nud FAILED it will not be removed from the
>> neighbor cache.
>> The reference count for the neighbor remains > 1
>> when neigh_periodic_work() is called.
>>
>> Issue noticed with IPv6 neighbors on Arch Linux with kernel 3.0.3
>> kernel config includes CONFIG_IPV6_ROUTER_PREF
>>
>> The problem affects routing when the neighbor loses connectivity and returns.
>>
>> Scenario: Using a default route and a static route through different interfaces.
>> When the neighbor gateway of the static route goes down the traffic
>> will move to the default gateway as expected.
>> Once the static route neighbor comes back up it is not asked for
>> neighbor solicitation
>> because the route is marked as FAILED and the traffic will continue to
>> pass through the default gateway.
>>
>> Steps to reproduce the route not being removed:
>> 1. add an IPv6 address on an interface
>> 2. add a route to a network through a gateway on the interface's network
>> 3. make sure the gateway address is not reachable
>> 4. ping6 a host in the route network
>> 5. "ip -6 nei" will show the gateway neighbor as FAILED and it won't be released
>>
>> Steps to reproduce the routing problem:
>> 1. client and two gateway machines (A and B)
>> 2. on the client define a static route through A and a default route through B
>> 3. disconnect eth on A
>> 4. ping6 a host in the network that should go through A
>>    after a while the traffic will move through B which is the default route
>> 5. reconnect eth on A
>> 6. ping6 a host in the network again, the traffic will still go through B
>>    "ip -6 nei" on the client will still show A as FAILED
>>
>>
>> Patch to change the nud state to NONE if the reference count > 1
>> allowing the neighbor to be released in a future pass.
>
> I wonder why a 'future pass' is needed at all.
>
> Shouldnt we immediately detect link becomes alive and force an immediate
> flush at this point, before waiting a garbage collect timer ?
>

I tested removing the neighbor as done in neigh_flush_dev() instead of
just setting to NUD_NONE,
it seems that the neighbor shouldn't be removed if it has routes attached to it.
If the neighbor is removed the route won't be considered at all.

The problem of the state remaining FAILED does need to be handled.
Like the case of no neighbor, a FAILED state means find_match() won't
try to send neighbor solicitation probes.
The neighbor will remain in a state of NUD_FAILED and the route won't
be used, only a direct communication with the neighbor will change its
state.

Do you think it is better to change the state from NUD_FAILED to
NUD_NONE in neigh_timer_handler() instead?

Updated patch to set the state to NUD_NOARP if the state was valid,
same as done in neigh_flush_dev().
Not sure if it is needed or the state can be set to NUD_NONE in all cases.

--- linux/net/core/neighbour.c.orig	2011-07-22 05:17:23.000000000 +0300
+++ linux/net/core/neighbour.c	2011-08-22 22:03:48.790689600 +0300
@@ -802,14 +802,20 @@
 			if (time_before(n->used, n->confirmed))
 				n->used = n->confirmed;

-			if (atomic_read(&n->refcnt) == 1 &&
-			    (state == NUD_FAILED ||
-			     time_after(jiffies, n->used + n->parms->gc_staletime))) {
+			if (state == NUD_FAILED ||
+			     time_after(jiffies, n->used + n->parms->gc_staletime)) {
 				*np = n->next;
-				n->dead = 1;
-				write_unlock(&n->lock);
-				neigh_cleanup_and_release(n);
-				continue;
+				if (atomic_read(&n->refcnt) == 1) {
+					n->dead = 1;
+					write_unlock(&n->lock);
+					neigh_cleanup_and_release(n);
+					continue;
+				} else {
+					if (n->nud_state & NUD_VALID)
+						n->nud_state = NUD_NOARP;
+					else
+						n->nud_state = NUD_NONE;
+				}
 			}
 			write_unlock(&n->lock);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] Failed neighbors attached to routes are not released
  2011-08-22 19:49   ` Guy Yur
@ 2011-08-22 21:17     ` Guy Yur
  0 siblings, 0 replies; 6+ messages in thread
From: Guy Yur @ 2011-08-22 21:17 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet

On Mon, Aug 22, 2011 at 10:49 PM, Guy Yur <guyyur@gmail.com> wrote:
> On Mon, Aug 22, 2011 at 7:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le vendredi 19 août 2011 à 21:53 +0300, Guy Yur a écrit :
>>> Hi,
>>>
>>> The issue I am seeing is with a neighbor used as a gateway in a route,
>>> when the neighbor gets nud FAILED it will not be removed from the
>>> neighbor cache.
>>> The reference count for the neighbor remains > 1
>>> when neigh_periodic_work() is called.
>>>
>>> Issue noticed with IPv6 neighbors on Arch Linux with kernel 3.0.3
>>> kernel config includes CONFIG_IPV6_ROUTER_PREF
>>>
>>> The problem affects routing when the neighbor loses connectivity and returns.
>>>
>>> Scenario: Using a default route and a static route through different interfaces.
>>> When the neighbor gateway of the static route goes down the traffic
>>> will move to the default gateway as expected.
>>> Once the static route neighbor comes back up it is not asked for
>>> neighbor solicitation
>>> because the route is marked as FAILED and the traffic will continue to
>>> pass through the default gateway.
>>>
>>> Steps to reproduce the route not being removed:
>>> 1. add an IPv6 address on an interface
>>> 2. add a route to a network through a gateway on the interface's network
>>> 3. make sure the gateway address is not reachable
>>> 4. ping6 a host in the route network
>>> 5. "ip -6 nei" will show the gateway neighbor as FAILED and it won't be released
>>>
>>> Steps to reproduce the routing problem:
>>> 1. client and two gateway machines (A and B)
>>> 2. on the client define a static route through A and a default route through B
>>> 3. disconnect eth on A
>>> 4. ping6 a host in the network that should go through A
>>>    after a while the traffic will move through B which is the default route
>>> 5. reconnect eth on A
>>> 6. ping6 a host in the network again, the traffic will still go through B
>>>    "ip -6 nei" on the client will still show A as FAILED
>>>
>>>
>>> Patch to change the nud state to NONE if the reference count > 1
>>> allowing the neighbor to be released in a future pass.
>>
>> I wonder why a 'future pass' is needed at all.
>>
>> Shouldnt we immediately detect link becomes alive and force an immediate
>> flush at this point, before waiting a garbage collect timer ?
>>
>
> I tested removing the neighbor as done in neigh_flush_dev() instead of
> just setting to NUD_NONE,
> it seems that the neighbor shouldn't be removed if it has routes attached to it.
> If the neighbor is removed the route won't be considered at all.
>
> The problem of the state remaining FAILED does need to be handled.
> Like the case of no neighbor, a FAILED state means find_match() won't
> try to send neighbor solicitation probes.
> The neighbor will remain in a state of NUD_FAILED and the route won't
> be used, only a direct communication with the neighbor will change its
> state.
>
> Do you think it is better to change the state from NUD_FAILED to
> NUD_NONE in neigh_timer_handler() instead?
>
> Updated patch to set the state to NUD_NOARP if the state was valid,
> same as done in neigh_flush_dev().
> Not sure if it is needed or the state can be set to NUD_NONE in all cases.
>

Previous patch with NOARP doesn't work.

New patch to only deal with FAILD state.

when using a neighbor through a route:
1. If no traffic is passed through the neighbor, it will move to STALE
state and remain there as before.
2. If the neighbor is unreachable, it will become FAILED.
    from then it will change periodically to NONE and the reachability
tests will resume.
    If the neighbor becomes reachable again, once the periodic handler
sets it to NONE and the routing code probes it, traffic will resume
through it.

--- linux/net/core/neighbour.c.orig	2011-07-22 05:17:23.000000000 +0300
+++ linux/net/core/neighbour.c	2011-08-22 23:53:21.448424900 +0300
@@ -802,15 +802,18 @@
 			if (time_before(n->used, n->confirmed))
 				n->used = n->confirmed;

-			if (atomic_read(&n->refcnt) == 1 &&
-			    (state == NUD_FAILED ||
-			     time_after(jiffies, n->used + n->parms->gc_staletime))) {
-				*np = n->next;
-				n->dead = 1;
-				write_unlock(&n->lock);
-				neigh_cleanup_and_release(n);
-				continue;
+			if (state == NUD_FAILED ||
+			     time_after(jiffies, n->used + n->parms->gc_staletime)) {
+				if (atomic_read(&n->refcnt) == 1) {
+					*np = n->next;
+					n->dead = 1;
+					write_unlock(&n->lock);
+					neigh_cleanup_and_release(n);
+					continue;
+				} else if (state == NUD_FAILED)
+					n->nud_state = NUD_NONE;
 			}
 			write_unlock(&n->lock);

 next_elt:

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] Failed neighbors attached to routes are not released
  2011-08-19 18:08 Guy Yur
@ 2011-08-19 18:41 ` David Miller
  0 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2011-08-19 18:41 UTC (permalink / raw)
  To: guyyur; +Cc: linux-kernel


This will get more traction if you post it to the networking development
list (netdev@vger.kernel.org), as most of the core networking developers
do not read linux-kernel.

Also, posting patches to netdev gets that patch queued up and tracked
in our networking patch tracking system at:

	http://patchwork.ozlabs.org/project/netdev/list/

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC][PATCH] Failed neighbors attached to routes are not released
@ 2011-08-19 18:08 Guy Yur
  2011-08-19 18:41 ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Guy Yur @ 2011-08-19 18:08 UTC (permalink / raw)
  To: linux-kernel

Hi,

The issue I am seeing is with a neighbor used as a gateway in a route,
when the neighbor gets nud FAILED it will not be removed from the
neighbor cache.
The reference count for the neighbor remains > 1 when
neigh_periodic_work() is called.

Issue noticed with IPv6 neighbors on Arch Linux with kernel 3.0.3
kernel config includes CONFIG_IPV6_ROUTER_PREF

The problem affects routing when the neighbor loses connectivity and returns.

Scenario: Using a default route and a static route through different interfaces.
When the neighbor gateway of the static route goes down the traffic
will move to the default gateway as expected.
Once the static route neighbor comes back up it is not asked for
neighbor solicitation
because the route is marked as FAILED and the traffic will continue to
pass through the default gateway.

Steps to reproduce the route not being removed:
1. add an IPv6 address on an interface
2. add a route to a network through a gateway on the interface's network
3. make sure the gateway address is not reachable
4. ping6 a host in the route network
5. "ip -6 nei" will show the gateway neighbor as FAILED and it won't be released

Steps to reproduce the routing problem:
1. client and two gateway machines (A and B)
2. on the client define a static route through A and a default route through B
3. disconnect eth on A
4. ping6 a host in the network that should go through A
   after a while the traffic will move through B which is the default route
5. reconnect eth on A
6. ping6 a host in the network again, the traffic will still go through B
   "ip -6 nei" on the client will still show A as FAILED

Patch to change the nud state to NONE if the reference count > 1 --
allowing the neighbor to be released in a future pass.

--- linux/net/core/neighbour.c.orig	2011-08-19 13:15:35.041524169 +0300
+++ linux/net/core/neighbour.c	2011-08-19 18:19:07.271276096 +0300
@@ -802,14 +802,16 @@ static void neigh_periodic_work(struct w
 			if (time_before(n->used, n->confirmed))
 				n->used = n->confirmed;

-			if (atomic_read(&n->refcnt) == 1 &&
-			    (state == NUD_FAILED ||
+			if ((state == NUD_FAILED ||
 			     time_after(jiffies, n->used + n->parms->gc_staletime))) {
-				*np = n->next;
-				n->dead = 1;
-				write_unlock(&n->lock);
-				neigh_cleanup_and_release(n);
-				continue;
+				if (atomic_read(&n->refcnt) == 1) {
+					*np = n->next;
+					n->dead = 1;
+					write_unlock(&n->lock);
+					neigh_cleanup_and_release(n);
+					continue;
+				} else
+					n->nud_state = NUD_NONE;
 			}
 			write_unlock(&n->lock);

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-08-22 21:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-19 18:53 [RFC][PATCH] Failed neighbors attached to routes are not released Guy Yur
2011-08-22 16:44 ` Eric Dumazet
2011-08-22 19:49   ` Guy Yur
2011-08-22 21:17     ` Guy Yur
  -- strict thread matches above, loose matches on Subject: below --
2011-08-19 18:08 Guy Yur
2011-08-19 18:41 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.