All of lore.kernel.org
 help / color / mirror / Atom feed
* Stale entries in RT_TABLE_LOCAL
@ 2011-02-23 17:43 Alex Sidorenko
  2011-03-09 21:53 ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Sidorenko @ 2011-02-23 17:43 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 6094 bytes --]

Hello,

I have found several scenarios when after deleting IP-address from an 
interface there is a stale entry left in RT_TABLE_LOCAL.

All these scenarios use the fact that it is possible to add the same address 
multiple times to the same interface using different masks.

Let us do the following using dummy0 interface:

ifconfig dummy0 192.168.140.31 netmask 255.255.252.0
ip addr add 192.168.142.109/23 dev dummy0
ip addr add 192.168.142.109/22 dev dummy0
ip addr del 192.168.142.109/22 dev dummy0
ip addr del 192.168.142.109/23 dev dummy0

We add 192.168.142.109/23 and 192.168.142.109/22, then delete them (order is 
important). After that, 192.168.142.109 is not in 'ip addr ls' but there are 
entries using this addr in RT_TABLE_LOCAL.

An attached script demonstrates the problem:

{asid 14:00:57} sudo sh iptest.sh
Tables before the test
13: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether 5e:1a:fa:44:90:f6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.140.31/22 brd 192.168.143.255 scope global dummy0
    inet6 fe80::5c1a:faff:fe44:90f6/64 scope link 
       valid_lft forever preferred_lft forever
 
local 192.168.140.31 dev dummy0  proto kernel  scope host  src 192.168.140.31 
broadcast 192.168.140.0 dev dummy0  proto kernel  scope link  src 
192.168.140.31 
broadcast 192.168.143.255 dev dummy0  proto kernel  scope link  src
192.168.140.31 
----------------------
Tables after the test
13: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether 5e:1a:fa:44:90:f6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.140.31/22 brd 192.168.143.255 scope global dummy0
    inet6 fe80::5c1a:faff:fe44:90f6/64 scope link 
       valid_lft forever preferred_lft forever
 
local 192.168.140.31 dev dummy0  proto kernel  scope host  src 192.168.140.31 
local 192.168.142.109 dev dummy0  proto kernel  scope host  src 192.168.140.31 
broadcast 192.168.143.255 dev dummy0  proto kernel  scope link  src
192.168.140.31 
broadcast 192.168.143.255 dev dummy0  proto kernel  scope link  src
192.168.142.109 


As you see, even though there is no 192.168.142.109 on dummy0 address list, 
the entries referring to this addr are still present in RT_TABLE_LOCAL.

Another scenario (adding/deleting two addresses, each one twice with different 
mask) can lead to stale entries cross-referencing each other, like

local 192.168.5.8  proto kernel  scope host  src 192.168.5.9 
local 192.168.5.9  proto kernel  scope host  src 192.168.5.8 

Analysis
--------

Both scenarios use the fact that we can add the same address multiple times to 
the same interface, using different masks.

1. When we delete an IP addr, we remove it from the interface addr list and 
send a notifier to routing code (fib_del_ifaddr) asking to delete the 
associated routes.

2. When we enter fib_del_ifaddr(struct in_ifaddr *ifa), the address is already
deleted. But if we add the same IP twice (with different masks), the same
address (even though with different prefix) is present two times. So after the
first deletion we still have its 2nd instance on the list.

3. We do the following in fib_del_ifaddr():

         for (ifa1 = in_dev->ifa_list; ifa1; ifa1 = ifa1->ifa_next) {
                 if (ifa->ifa_local == ifa1->ifa_local)
                         ok |= LOCAL_OK;
                 if (ifa->ifa_broadcast == ifa1->ifa_broadcast)
                         ok |= BRD_OK;
                 if (brd == ifa1->ifa_broadcast)
                         ok |= BRD1_OK;
                 if (any == ifa1->ifa_broadcast)
                         ok |= BRD0_OK;
         }

That is, we loop on all addrs of the interface (in_dev->ifa_list) and compare
address we have just deleted (passed in 'ifa') with addresses on the list. 

As we compare them without taking prefix (mask) into account, the following 
will be true:

ifa->ifa_local == ifa1->ifa_local
ifa->ifa_broadcast == ifa1->ifa_broadcast

4. As a result, after deleting the first instance of IP (192.168.142.109/22) 
we still have  192.168.142.109/23 on the list. The routing code will find that 
this addr (and broadcast) are still present on the list and will not delete 
the routes.

5. When we delete the second time (192.168.142.109/23), there will be no
192.168.142.109 on the list anymore and the routing code will delete the route 
- but only one out of two entries.

How this can be fixed
---------------------

I am not sure what is the best way to fix this, I can think of several 
approaches:

  (a) change the sources so that it would be impossible to add the same IP 
      multiple times, even with different masks. I cannot think of any
      situation where adding the same IP (but with different mask) to the same
      interface could be useful. But maybe I am wrong?

  (b) improve the deletion algorithm in fib_del_ifaddr()

  (c) add a periodic cleanup that will purge all entries from 'local' table if
      there are no corresponding IPs on the interface list


Impact
------

Stale entries in RT_TABLE_LOCAL make ARP reply to requests for that IPs, even 
though these IPs do not belong to any interface.

These scenarios might seem a bit pathological, but in reality they are 
possible on clusters with multiple addresses on several interfaces, where 
addresses are added/deleted for service migration. Address migration can be 
done both by software and by system administrators and if by mistake a wrong 
mask is used, we can get this situation.

And yes, one of HP customers met exactly this problem. They saw a 'duplicate 
IP' issue after migrating some services and found that the host replies to 
ARP-request even though 'ip addr ls' did not show this address. It is not 
common knowledge that ARP implementation uses RT_TABLE_LOCAL to decide whether 
IP is local, so they were unable to understand what is wrong.

Regards,
Alex

------------------------------------------------------------------
Alexandre Sidorenko             email: asid@hp.com
WTEC Linux                      Hewlett-Packard (Canada)
------------------------------------------------------------------

[-- Attachment #2: iptest.sh --]
[-- Type: application/x-shellscript, Size: 597 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Stale entries in RT_TABLE_LOCAL
  2011-02-23 17:43 Stale entries in RT_TABLE_LOCAL Alex Sidorenko
@ 2011-03-09 21:53 ` David Miller
  2011-03-10  0:04   ` Julian Anastasov
  0 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2011-03-09 21:53 UTC (permalink / raw)
  To: alexandre.sidorenko; +Cc: netdev, ja

From: Alex Sidorenko <alexandre.sidorenko@hp.com>
Date: Wed, 23 Feb 2011 12:43:23 -0500

> I am not sure what is the best way to fix this, I can think of several 
> approaches:
> 
>   (a) change the sources so that it would be impossible to add the same IP 
>       multiple times, even with different masks. I cannot think of any
>       situation where adding the same IP (but with different mask) to the same
>       interface could be useful. But maybe I am wrong?

I'm leaning towards this solution if it's viable.  But I'm not so sure that
nobody uses this feature, maybe Julian knows?

Julian, the issue is that if you add the same IP address multiple times using
different subnet masks, we allow it.

But removal doesn't work correctly, we clear the IFA list on the device but we
leave stale entries in the local routing table.

The test case is:

ip addr add 192.168.142.109/23 dev dummy0
ip addr add 192.168.142.109/22 dev dummy0
ip addr del 192.168.142.109/22 dev dummy0
ip addr del 192.168.142.109/23 dev dummy0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Stale entries in RT_TABLE_LOCAL
  2011-03-09 21:53 ` David Miller
@ 2011-03-10  0:04   ` Julian Anastasov
  2011-03-10  1:50     ` Julian Anastasov
  0 siblings, 1 reply; 8+ messages in thread
From: Julian Anastasov @ 2011-03-10  0:04 UTC (permalink / raw)
  To: David Miller; +Cc: alexandre.sidorenko, netdev


 	Hello,

On Wed, 9 Mar 2011, David Miller wrote:

> From: Alex Sidorenko <alexandre.sidorenko@hp.com>
> Date: Wed, 23 Feb 2011 12:43:23 -0500
>
>> I am not sure what is the best way to fix this, I can think of several
>> approaches:
>>
>>   (a) change the sources so that it would be impossible to add the same IP
>>       multiple times, even with different masks. I cannot think of any
>>       situation where adding the same IP (but with different mask) to the same
>>       interface could be useful. But maybe I am wrong?
>
> I'm leaning towards this solution if it's viable.  But I'm not so sure that
> nobody uses this feature, maybe Julian knows?
>
> Julian, the issue is that if you add the same IP address multiple times using
> different subnet masks, we allow it.
>
> But removal doesn't work correctly, we clear the IFA list on the device but we
> leave stale entries in the local routing table.
>
> The test case is:
>
> ip addr add 192.168.142.109/23 dev dummy0
> ip addr add 192.168.142.109/22 dev dummy0

 	Here I have just one local route.

> ip addr del 192.168.142.109/22 dev dummy0
> ip addr del 192.168.142.109/23 dev dummy0

 	Hm, my 2.6.34 kernel has no such problem. I have
to do some tests with recent tree tomorrow. But fib_magic uses
NLM_F_CREATE | NLM_F_APPEND but even that can not overcome
the check for equal alias, so I don't think it is possible
two equal FIB aliases to be added. I expect only to see
2 local routes if the IP addresses are added to different
devices. May be for some reason we can not remove the single
local route when last IP is deleted.

 	Alex, what is the kernel version and can you test
it on different kernels? Also, do you really see 2 equal
local routes after the two adds?

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Stale entries in RT_TABLE_LOCAL
  2011-03-10  0:04   ` Julian Anastasov
@ 2011-03-10  1:50     ` Julian Anastasov
  2011-03-10 19:41       ` Alex Sidorenko
  0 siblings, 1 reply; 8+ messages in thread
From: Julian Anastasov @ 2011-03-10  1:50 UTC (permalink / raw)
  To: David Miller; +Cc: alexandre.sidorenko, netdev


 	Hello,

On Thu, 10 Mar 2011, Julian Anastasov wrote:

> On Wed, 9 Mar 2011, David Miller wrote:
>
>> From: Alex Sidorenko <alexandre.sidorenko@hp.com>
>> Date: Wed, 23 Feb 2011 12:43:23 -0500
>> 
>>> I am not sure what is the best way to fix this, I can think of several
>>> approaches:
>>>
>>>   (a) change the sources so that it would be impossible to add the same IP
>>>       multiple times, even with different masks. I cannot think of any
>>>       situation where adding the same IP (but with different mask) to the 
>>> same
>>>       interface could be useful. But maybe I am wrong?
>> 
>> I'm leaning towards this solution if it's viable.  But I'm not so sure that
>> nobody uses this feature, maybe Julian knows?
>> 
>> Julian, the issue is that if you add the same IP address multiple times 
>> using
>> different subnet masks, we allow it.
>> 
>> But removal doesn't work correctly, we clear the IFA list on the device but 
>> we
>> leave stale entries in the local routing table.
>> 
>> The test case is:
>> 
>> ip addr add 192.168.142.109/23 dev dummy0
>> ip addr add 192.168.142.109/22 dev dummy0
>
> 	Here I have just one local route.

 	Aha, it seems the problem happens when the
both lines are executed while there is another address on
device and the last added address becomes secondary
for the 1st, eg:

IP1: 192.168.140.31/22, primary
IP2: 192.168.142.109/23, primary
IP3: 192.168.142.109/22, secondary for primary IP1

 	It is the route for IP3 that is leaked, with prefsrc=IP1.
We create local route for secondary IPs with prefsrc=ItsPrimaryIP.
Both local routes for 109 differ in prefsrc (fa_info). But on
deletion only one route is deleted due to last_ip check - the first
because on deletion prefsrc is not matched, fib_table_delete
does not work in symmetric way. So, the local route created
for IP3 remains no matter the deletion order.
If we decide to create one unique local route for this case,
there is a risk device unregistration to remove
it (fib_sync_down_dev with force > 0). I have to think more
on this issue...

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Stale entries in RT_TABLE_LOCAL
  2011-03-10  1:50     ` Julian Anastasov
@ 2011-03-10 19:41       ` Alex Sidorenko
  2011-03-15  8:47         ` Julian Anastasov
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Sidorenko @ 2011-03-10 19:41 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David Miller, netdev

On March 9, 2011 08:50:27 PM Julian Anastasov wrote:
>         Aha, it seems the problem happens when the
> both lines are executed while there is another address on
> device and the last added address becomes secondary
> for the 1st, eg:

Hi Julian,

this is correct - you need another address on device. I used iptest.sh script 
for testing as it is easier and less error-prone than adding addresses 
manually. In additon, it reinserts 'dummy' module so that every time we start 
with a clean slate.

>         Alex, what is the kernel version and can you test
> 
> it on different kernels?

I tested this on SLES10 (2.6.16 kernel), RHEL5 (2.6.18 kernel) and 
Ubuntu/Maverick (2.6.35 kernel)

> 
> IP1: 192.168.140.31/22, primary
> IP2: 192.168.142.109/23, primary
> IP3: 192.168.142.109/22, secondary for primary IP1
> 
>         It is the route for IP3 that is leaked, with prefsrc=IP1.
> We create local route for secondary IPs with prefsrc=ItsPrimaryIP.
> Both local routes for 109 differ in prefsrc (fa_info). But on
> deletion only one route is deleted due to last_ip check - the first
> because on deletion prefsrc is not matched, fib_table_delete
> does not work in symmetric way. So, the local route created
> for IP3 remains no matter the deletion order.
> If we decide to create one unique local route for this case,
> there is a risk device unregistration to remove
> it (fib_sync_down_dev with force > 0). 

Yes, but do we _really_ need to be able to add the same IP multiple times 
(with different masks)? If not, we can just return an error when someine tries 
to add the same IP again with different mask.

Regards,
Alex

-- 
------------------------------------------------------------------
Alexandre Sidorenko             email: asid@hp.com
WTEC Linux			Hewlett-Packard (Canada)
------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Stale entries in RT_TABLE_LOCAL
  2011-03-10 19:41       ` Alex Sidorenko
@ 2011-03-15  8:47         ` Julian Anastasov
  2011-03-17 16:11           ` Jiri Bohac
  0 siblings, 1 reply; 8+ messages in thread
From: Julian Anastasov @ 2011-03-15  8:47 UTC (permalink / raw)
  To: Alex Sidorenko; +Cc: David Miller, netdev


 	Hello,

On Thu, 10 Mar 2011, Alex Sidorenko wrote:

>> IP1: 192.168.140.31/22, primary
>> IP2: 192.168.142.109/23, primary
>> IP3: 192.168.142.109/22, secondary for primary IP1
>>
>>         It is the route for IP3 that is leaked, with prefsrc=IP1.
>> We create local route for secondary IPs with prefsrc=ItsPrimaryIP.
>> Both local routes for 109 differ in prefsrc (fa_info). But on
>> deletion only one route is deleted due to last_ip check - the first
>> because on deletion prefsrc is not matched, fib_table_delete
>> does not work in symmetric way. So, the local route created
>> for IP3 remains no matter the deletion order.
>> If we decide to create one unique local route for this case,
>> there is a risk device unregistration to remove
>> it (fib_sync_down_dev with force > 0).
>
> Yes, but do we _really_ need to be able to add the same IP multiple times
> (with different masks)? If not, we can just return an error when someine tries
> to add the same IP again with different mask.

 	Yes, it seems one IP can be part of different subnets,
even with different scope. We can create compatibility
problems for users that use it in this way. I found more
issues with the secondary address promotion and broadcast
address deletion, it will take 2-3 days to prepare some
patches and test script for all these problems.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Stale entries in RT_TABLE_LOCAL
  2011-03-15  8:47         ` Julian Anastasov
@ 2011-03-17 16:11           ` Jiri Bohac
  2011-03-18 22:57             ` Julian Anastasov
  0 siblings, 1 reply; 8+ messages in thread
From: Jiri Bohac @ 2011-03-17 16:11 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Alex Sidorenko, David Miller, netdev

Hi,

On Tue, Mar 15, 2011 at 10:47:20AM +0200, Julian Anastasov wrote:
> On Thu, 10 Mar 2011, Alex Sidorenko wrote:
> 
> >>IP1: 192.168.140.31/22, primary
> >>IP2: 192.168.142.109/23, primary
> >>IP3: 192.168.142.109/22, secondary for primary IP1
> >>
> >>        It is the route for IP3 that is leaked, with prefsrc=IP1.
> >>We create local route for secondary IPs with prefsrc=ItsPrimaryIP.
> >>Both local routes for 109 differ in prefsrc (fa_info)

Is there any reason to set the prefsrc of a local route to the
primary IP address of the subnet?

I tried the following patch:

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 1d2cdd4..2046b21 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -751,7 +751,7 @@ void fib_add_ifaddr(struct in_ifaddr *ifa)
 		}
 	}
 
-	fib_magic(RTM_NEWROUTE, RTN_LOCAL, addr, 32, prim);
+	fib_magic(RTM_NEWROUTE, RTN_LOCAL, addr, 32, ifa);
 
 	if (!(dev->flags & IFF_UP))
 		return;


The result with the teststcase mentioned previously is that only
one local route is created per IP address. The local routes are
correctly deleted after both identical IP addresses are removed
from the interface.

Furthemore, the testcase uncovers another weirdness with the
prefsrc of the local routes. When a primary IP address is deleted
and a secondary IP address is promoted to primary, its prefsrc is
not updated.

What is the prefrc of a local route good for?

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Stale entries in RT_TABLE_LOCAL
  2011-03-17 16:11           ` Jiri Bohac
@ 2011-03-18 22:57             ` Julian Anastasov
  0 siblings, 0 replies; 8+ messages in thread
From: Julian Anastasov @ 2011-03-18 22:57 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Alex Sidorenko, David Miller, netdev


 	Hello,

On Thu, 17 Mar 2011, Jiri Bohac wrote:

>>>> IP2: 192.168.142.109/23, primary
>>>> IP3: 192.168.142.109/22, secondary for primary IP1
>>>>
>>>>        It is the route for IP3 that is leaked, with prefsrc=IP1.
>>>> We create local route for secondary IPs with prefsrc=ItsPrimaryIP.
>>>> Both local routes for 109 differ in prefsrc (fa_info)
>
> Is there any reason to set the prefsrc of a local route to the
> primary IP address of the subnet?
>
> I tried the following patch:
>
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 1d2cdd4..2046b21 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -751,7 +751,7 @@ void fib_add_ifaddr(struct in_ifaddr *ifa)
> 		}
> 	}
>
> -	fib_magic(RTM_NEWROUTE, RTN_LOCAL, addr, 32, prim);
> +	fib_magic(RTM_NEWROUTE, RTN_LOCAL, addr, 32, ifa);
>
> 	if (!(dev->flags & IFF_UP))
> 		return;
>
>
> The result with the teststcase mentioned previously is that only
> one local route is created per IP address. The local routes are
> correctly deleted after both identical IP addresses are removed
> from the interface.

 	It is a problem also for the broadcast addresses.

> Furthemore, the testcase uncovers another weirdness with the
> prefsrc of the local routes. When a primary IP address is deleted
> and a secondary IP address is promoted to primary, its prefsrc is
> not updated.

 	Yes, I see the same when the deleted primary is
also in another subnet or device.

> What is the prefrc of a local route good for?

 	To prefer this src when talking to local IP :)
You can always add local IPs with /32 mask but currently
you must add it first, i.e. fib_magic does not remove
the NLM_F_APPEND flag for /32 local routes when dst is
same as ifa->ifa_local. So, the order of IP adding
will determine the order of local routes:

ip addr add 1.2.3.4/24 brd + dev eth1
# secondary
ip addr add 1.2.3.5/24 brd + dev eth1
# primary:
ip addr add 1.2.3.5/32 brd + dev eth1

when talking to 1.2.3.5 you will use 1.2.3.4 because
the secondary is added first.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-03-18 22:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-23 17:43 Stale entries in RT_TABLE_LOCAL Alex Sidorenko
2011-03-09 21:53 ` David Miller
2011-03-10  0:04   ` Julian Anastasov
2011-03-10  1:50     ` Julian Anastasov
2011-03-10 19:41       ` Alex Sidorenko
2011-03-15  8:47         ` Julian Anastasov
2011-03-17 16:11           ` Jiri Bohac
2011-03-18 22:57             ` Julian Anastasov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.