linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.29.1: unregister_netdevice problem
@ 2009-04-22  5:57 Alexander V. Lukyanov
  2009-04-23 23:30 ` Eric W. Biederman
  2009-04-27  5:41 ` Alexander V. Lukyanov
  0 siblings, 2 replies; 17+ messages in thread
From: Alexander V. Lukyanov @ 2009-04-22  5:57 UTC (permalink / raw)
  To: linux-kernel

Eventually I have an increased load average without apparent reason.
When I reboot the server in such a case, I get infinitely repeating
messages on the console:

unregister_netdevice: waiting for eth0.2 to become free. Usage count = 4

eth0.2 is a vlan interface, eth0 is 02:00.0 Ethernet controller: Realtek
Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet
controller (rev 01)

-- 
   Alexander.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-04-22  5:57 2.6.29.1: unregister_netdevice problem Alexander V. Lukyanov
@ 2009-04-23 23:30 ` Eric W. Biederman
  2009-04-24 21:16   ` Bruno Prémont
  2009-04-27  5:41 ` Alexander V. Lukyanov
  1 sibling, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2009-04-23 23:30 UTC (permalink / raw)
  To: Alexander V. Lukyanov; +Cc: linux-kernel, netdev

"Alexander V. Lukyanov" <lav@netis.ru> writes:

> Eventually I have an increased load average without apparent reason.
> When I reboot the server in such a case, I get infinitely repeating
> messages on the console:
>
> unregister_netdevice: waiting for eth0.2 to become free. Usage count = 4
>
> eth0.2 is a vlan interface, eth0 is 02:00.0 Ethernet controller: Realtek
> Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet
> controller (rev 01)

CC: netdev where someone might have a better clue.

Infinitely repeating unregister_netdevice messages means something
isn't releasing it's reference count to your network device.

There really isn't enough information in your email to figure out
what you were doing that, or what piece of code triggered this.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-04-23 23:30 ` Eric W. Biederman
@ 2009-04-24 21:16   ` Bruno Prémont
  0 siblings, 0 replies; 17+ messages in thread
From: Bruno Prémont @ 2009-04-24 21:16 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Alexander V. Lukyanov, linux-kernel, netdev

On Thu, 23 April 2009 ebiederm@xmission.com (Eric W. Biederman) wrote:
> "Alexander V. Lukyanov" <lav@netis.ru> writes:
> 
> > Eventually I have an increased load average without apparent reason.
> > When I reboot the server in such a case, I get infinitely repeating
> > messages on the console:
> >
> > unregister_netdevice: waiting for eth0.2 to become free. Usage
> > count = 4
> >
> > eth0.2 is a vlan interface, eth0 is 02:00.0 Ethernet controller:
> > Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit
> > Ethernet controller (rev 01)
> 
> CC: netdev where someone might have a better clue.
> 
> Infinitely repeating unregister_netdevice messages means something
> isn't releasing it's reference count to your network device.
> 
> There really isn't enough information in your email to figure out
> what you were doing that, or what piece of code triggered this.


A few I similar cases I have encountered are related to:
  vlan, netconsole

If you attempt to rmmod the driver of a network interface for which
you have a vlan or netconsole setup on top of you end up with this
kind of lock.

At least the two above do not react of removal attempt notifications
and thus keep the network device referenced.

Bruno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-04-22  5:57 2.6.29.1: unregister_netdevice problem Alexander V. Lukyanov
  2009-04-23 23:30 ` Eric W. Biederman
@ 2009-04-27  5:41 ` Alexander V. Lukyanov
  2009-04-28 12:57   ` Alexander V. Lukyanov
  1 sibling, 1 reply; 17+ messages in thread
From: Alexander V. Lukyanov @ 2009-04-27  5:41 UTC (permalink / raw)
  To: linux-kernel, linux-netdev

On Wed, Apr 22, 2009 at 09:57:35AM +0400, Alexander V. Lukyanov wrote:
> unregister_netdevice: waiting for eth0.2 to become free. Usage count = 4
> 
> eth0.2 is a vlan interface, eth0 is 02:00.0 Ethernet controller: Realtek
> Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet
> controller (rev 01)

Ok, now I did manually 'vconfig rem eth0.2' and I get these repeated messages.
How do I find out what exactly holds the interface being used?

I have killed most of non-kernel processes, eth0.2 is still used. LA=1, but
the cpu is idle.

top - 09:34:26 up 4 days, 23:55,  2 users,  load average: 1.00, 1.12, 3.55
Tasks:  56 total,   1 running,  55 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3356308k total,  1165044k used,  2191264k free,   443536k buffers
Swap:  3212920k total,      712k used,  3212208k free,   493568k cached

-- 
   Alexander..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-04-27  5:41 ` Alexander V. Lukyanov
@ 2009-04-28 12:57   ` Alexander V. Lukyanov
  2009-04-28 20:49     ` Jarek Poplawski
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander V. Lukyanov @ 2009-04-28 12:57 UTC (permalink / raw)
  To: linux-kernel, netdev

On Mon, Apr 27, 2009 at 09:41:03AM +0400, Alexander V. Lukyanov wrote:
> On Wed, Apr 22, 2009 at 09:57:35AM +0400, Alexander V. Lukyanov wrote:
> > unregister_netdevice: waiting for eth0.2 to become free. Usage count = 4
> > 
> > eth0.2 is a vlan interface, eth0 is 02:00.0 Ethernet controller: Realtek
> > Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet
> > controller (rev 01)
> 
> Ok, now I did manually 'vconfig rem eth0.2' and I get these repeated messages.
> How do I find out what exactly holds the interface being used?

BTW, it looks like all vlan interfaces (I have many of them) have similar
problem, when it happens - every few days:

unregister_netdevice: waiting for eth0.907 to become free. Usage count = 20

and when I run a `vconfig rem', I cannot run another one. It s(t)ucks.

-- 
   Alexander..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-04-28 12:57   ` Alexander V. Lukyanov
@ 2009-04-28 20:49     ` Jarek Poplawski
  2009-04-29  5:45       ` Alexander V. Lukyanov
  0 siblings, 1 reply; 17+ messages in thread
From: Jarek Poplawski @ 2009-04-28 20:49 UTC (permalink / raw)
  To: Alexander V. Lukyanov; +Cc: linux-kernel, netdev

Alexander V. Lukyanov wrote, On 04/28/2009 02:57 PM:

> On Mon, Apr 27, 2009 at 09:41:03AM +0400, Alexander V. Lukyanov wrote:
>> On Wed, Apr 22, 2009 at 09:57:35AM +0400, Alexander V. Lukyanov wrote:
>>> unregister_netdevice: waiting for eth0.2 to become free. Usage count = 4
>>>
>>> eth0.2 is a vlan interface, eth0 is 02:00.0 Ethernet controller: Realtek
>>> Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet
>>> controller (rev 01)
>> Ok, now I did manually 'vconfig rem eth0.2' and I get these repeated messages.
>> How do I find out what exactly holds the interface being used?
> 
> BTW, it looks like all vlan interfaces (I have many of them) have similar
> problem, when it happens - every few days:
> 
> unregister_netdevice: waiting for eth0.907 to become free. Usage count = 20
> 
> and when I run a `vconfig rem', I cannot run another one. It s(t)ucks.
> 


Do you mean if you wait a bit longer (until the first one is really removed)
before running another one, it doesn't s(t)uck? Is there a change e.g. wrt.
2.6.28?

Jarek P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-04-28 20:49     ` Jarek Poplawski
@ 2009-04-29  5:45       ` Alexander V. Lukyanov
  2009-04-29  9:08         ` Jarek Poplawski
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander V. Lukyanov @ 2009-04-29  5:45 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: linux-kernel, netdev

On Tue, Apr 28, 2009 at 10:49:58PM +0200, Jarek Poplawski wrote:
> Do you mean if you wait a bit longer (until the first one is really removed)

The problem is that it is never removed. I waited for at least 30 minutes.

> before running another one, it doesn't s(t)uck? Is there a change e.g. wrt.
> 2.6.28?

I mean, 'vconfig rem' on another interface stucks if the previous vconfig
has not finished, and it never finishes. So I cannot check if other vlan
interfaces have the same 'refcnt' problem. So far I tried 'vconfig rem' on
two vlan interfaces and I have a dozen.

Again, the problem only happens after I notice surprisingly high LA (30
instead of 2-6) and interactive slowness. When it happens, I check network
usage, disk usage, cpu usage, mem usage - they are normal or even lower than
usual.

The server is running transparent squid and named. The last kernel version
was 2.6.27.21 and it did not have this problem.

-- 
   Alexander.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-04-29  5:45       ` Alexander V. Lukyanov
@ 2009-04-29  9:08         ` Jarek Poplawski
  2009-05-08  6:26           ` Alexander V. Lukyanov
  0 siblings, 1 reply; 17+ messages in thread
From: Jarek Poplawski @ 2009-04-29  9:08 UTC (permalink / raw)
  To: Alexander V. Lukyanov; +Cc: linux-kernel, netdev

On Wed, Apr 29, 2009 at 09:45:10AM +0400, Alexander V. Lukyanov wrote:
> On Tue, Apr 28, 2009 at 10:49:58PM +0200, Jarek Poplawski wrote:
> > Do you mean if you wait a bit longer (until the first one is really removed)
> 
> The problem is that it is never removed. I waited for at least 30 minutes.
> 
> > before running another one, it doesn't s(t)uck? Is there a change e.g. wrt.
> > 2.6.28?
> 
> I mean, 'vconfig rem' on another interface stucks if the previous vconfig
> has not finished, and it never finishes. So I cannot check if other vlan
> interfaces have the same 'refcnt' problem. So far I tried 'vconfig rem' on
> two vlan interfaces and I have a dozen.
> 
> Again, the problem only happens after I notice surprisingly high LA (30
> instead of 2-6) and interactive slowness. When it happens, I check network
> usage, disk usage, cpu usage, mem usage - they are normal or even lower than
> usual.
> 
> The server is running transparent squid and named. The last kernel version
> was 2.6.27.21 and it did not have this problem.

So looks like a regression. Alas this thing could be hard to debug and
still more data is needed. For the beginning maybe: .config, dmesg,
and a few SysRq logs while this happens e.g. Alt-PrtScr with t, d, w, q
(gzipped or as attachments to a bugzilla report). (If it's not a big
problem trying 2.6.28.9 could be helpful too.)

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-04-29  9:08         ` Jarek Poplawski
@ 2009-05-08  6:26           ` Alexander V. Lukyanov
  2009-05-08 10:46             ` Jarek Poplawski
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander V. Lukyanov @ 2009-05-08  6:26 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 484 bytes --]

On Wed, Apr 29, 2009 at 11:08:09AM +0200, Jarek Poplawski wrote:
> So looks like a regression. Alas this thing could be hard to debug and
> still more data is needed. For the beginning maybe: .config, dmesg,
> and a few SysRq logs while this happens e.g. Alt-PrtScr with t, d, w, q
> (gzipped or as attachments to a bugzilla report). (If it's not a big
> problem trying 2.6.28.9 could be helpful too.)

It happened again with 2.6.29.2. Here is the requested info.

-- 
   Alexander..

[-- Attachment #2: report.tar.lzma --]
[-- Type: application/octet-stream, Size: 49395 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-05-08  6:26           ` Alexander V. Lukyanov
@ 2009-05-08 10:46             ` Jarek Poplawski
  2009-05-10  7:35               ` Alexander V. Lukyanov
  0 siblings, 1 reply; 17+ messages in thread
From: Jarek Poplawski @ 2009-05-08 10:46 UTC (permalink / raw)
  To: Alexander V. Lukyanov; +Cc: linux-kernel, netdev

On Fri, May 08, 2009 at 10:26:40AM +0400, Alexander V. Lukyanov wrote:
> On Wed, Apr 29, 2009 at 11:08:09AM +0200, Jarek Poplawski wrote:
> > So looks like a regression. Alas this thing could be hard to debug and
> > still more data is needed. For the beginning maybe: .config, dmesg,
> > and a few SysRq logs while this happens e.g. Alt-PrtScr with t, d, w, q
> > (gzipped or as attachments to a bugzilla report). (If it's not a big
> > problem trying 2.6.28.9 could be helpful too.)
> 
> It happened again with 2.6.29.2. Here is the requested info.

I can't see anything suspicious for now, except these UDP and TCP
warnings. Did you see similar messages with 2.6.27? Btw., could this
eth0.987 be "connected" with any of this traffic? (IP# ?)

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-05-08 10:46             ` Jarek Poplawski
@ 2009-05-10  7:35               ` Alexander V. Lukyanov
  2009-05-10 12:46                 ` Jarek Poplawski
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander V. Lukyanov @ 2009-05-10  7:35 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: linux-kernel, netdev

On Fri, May 08, 2009 at 10:46:28AM +0000, Jarek Poplawski wrote:
> I can't see anything suspicious for now, except these UDP and TCP
> warnings. Did you see similar messages with 2.6.27? Btw., could this

Yes. Such messages show up with any kernel version.

> eth0.987 be "connected" with any of this traffic? (IP# ?)

No, eth0.987 is only used for traffic output.

BTW, it seems that only actively used vlan interfaces have the problem
(even when the traffic stops). Other vlan interfaces with little traffic
can be removed with no problems.

-- 
   Alexander..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-05-10  7:35               ` Alexander V. Lukyanov
@ 2009-05-10 12:46                 ` Jarek Poplawski
  2009-05-15  7:19                   ` Alexander V. Lukyanov
  0 siblings, 1 reply; 17+ messages in thread
From: Jarek Poplawski @ 2009-05-10 12:46 UTC (permalink / raw)
  To: Alexander V. Lukyanov; +Cc: linux-kernel, netdev

On Sun, May 10, 2009 at 11:35:59AM +0400, Alexander V. Lukyanov wrote:
> On Fri, May 08, 2009 at 10:46:28AM +0000, Jarek Poplawski wrote:
> > I can't see anything suspicious for now, except these UDP and TCP
> > warnings. Did you see similar messages with 2.6.27? Btw., could this
> 
> Yes. Such messages show up with any kernel version.
> 
> > eth0.987 be "connected" with any of this traffic? (IP# ?)
> 
> No, eth0.987 is only used for traffic output.
> 
> BTW, it seems that only actively used vlan interfaces have the problem
> (even when the traffic stops). Other vlan interfaces with little traffic
> can be removed with no problems.

OK, I'll try to look around this, but how about trying 2.6.28.10 in
the meantime? It could limit "a bit" the number of places/lines.

Jarek P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-05-10 12:46                 ` Jarek Poplawski
@ 2009-05-15  7:19                   ` Alexander V. Lukyanov
  2009-05-15  8:06                     ` Jarek Poplawski
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander V. Lukyanov @ 2009-05-15  7:19 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: linux-kernel, netdev

On Sun, May 10, 2009 at 02:46:41PM +0200, Jarek Poplawski wrote:
> On Sun, May 10, 2009 at 11:35:59AM +0400, Alexander V. Lukyanov wrote:
> > On Fri, May 08, 2009 at 10:46:28AM +0000, Jarek Poplawski wrote:
> > > I can't see anything suspicious for now, except these UDP and TCP
> > > warnings. Did you see similar messages with 2.6.27? Btw., could this
> > 
> > Yes. Such messages show up with any kernel version.
> > 
> > > eth0.987 be "connected" with any of this traffic? (IP# ?)
> > 
> > No, eth0.987 is only used for traffic output.
> > 
> > BTW, it seems that only actively used vlan interfaces have the problem
> > (even when the traffic stops). Other vlan interfaces with little traffic
> > can be removed with no problems.
> 
> OK, I'll try to look around this, but how about trying 2.6.28.10 in
> the meantime? It could limit "a bit" the number of places/lines.

Looks like 2.6.28.10 does not have this refcnt problem. Also, after I have
reversed net/ipv4/route.c changes from 2.6.29, the problem does not occur either.

-- 
   Alexander..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-05-15  7:19                   ` Alexander V. Lukyanov
@ 2009-05-15  8:06                     ` Jarek Poplawski
  2009-05-15  8:54                       ` Alexander V. Lukyanov
  0 siblings, 1 reply; 17+ messages in thread
From: Jarek Poplawski @ 2009-05-15  8:06 UTC (permalink / raw)
  To: Alexander V. Lukyanov; +Cc: linux-kernel, netdev

On Fri, May 15, 2009 at 11:19:46AM +0400, Alexander V. Lukyanov wrote:
...
> Looks like 2.6.28.10 does not have this refcnt problem. Also, after I have
> reversed net/ipv4/route.c changes from 2.6.29, the problem does not occur either.

Very nice work! It should be a piece of cake now - there were not much
changes. I'll look at it, as usual ;-), but, of course if it's
possible, it would be great to try "in the meantime" with reverted the
biggest of them append below.

Thanks,
Jarek P.
---

commit 1080d709fb9d8cd4392f93476ee46a9d6ea05a5b
Author: Neil Horman <nhorman@tuxdriver.com>
Date:   Mon Oct 27 12:28:25 2008 -0700

    net: implement emergency route cache rebulds when gc_elasticity is exceeded
    
    This is a patch to provide on demand route cache rebuilding.  Currently, our
    route cache is rebulid periodically regardless of need.  This introduced
    unneeded periodic latency.  This patch offers a better approach.  Using code
    provided by Eric Dumazet, we compute the standard deviation of the average hash
    bucket chain length while running rt_check_expire.  Should any given chain
    length grow to larger that average plus 4 standard deviations, we trigger an
    emergency hash table rebuild for that net namespace.  This allows for the common
    case in which chains are well behaved and do not grow unevenly to not incur any
    latency at all, while those systems (which may be being maliciously attacked),
    only rebuild when the attack is detected.  This patch take 2 other factors into
    account:
    1) chains with multiple entries that differ by attributes that do not affect the
    hash value are only counted once, so as not to unduly bias system to rebuilding
    if features like QOS are heavily used
    2) if rebuilding crosses a certain threshold (which is adjustable via the added
    sysctl in this patch), route caching is disabled entirely for that net
    namespace, since constant rebuilding is less efficient that no caching at all
    
    Tested successfully by me.
    
    Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
    Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d849326..c771278 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -27,6 +27,12 @@ min_adv_mss - INTEGER
 	The advertised MSS depends on the first hop route MTU, but will
 	never be lower than this setting.
 
+rt_cache_rebuild_count - INTEGER
+	The per net-namespace route cache emergency rebuild threshold.
+	Any net-namespace having its route cache rebuilt due to
+	a hash bucket chain being too long more than this many times
+	will have its route caching disabled
+
 IP Fragmentation:
 
 ipfrag_high_thresh - INTEGER
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ece1c92..977f482 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -49,6 +49,8 @@ struct netns_ipv4 {
 	int sysctl_icmp_ratelimit;
 	int sysctl_icmp_ratemask;
 	int sysctl_icmp_errors_use_inbound_ifaddr;
+	int sysctl_rt_cache_rebuild_count;
+	int current_rt_cache_rebuild_count;
 
 	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 2ea6dcc..21ce7e1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,6 +129,7 @@ static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
 static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
+static int rt_chain_length_max __read_mostly	= 20;
 
 static void rt_worker_func(struct work_struct *work);
 static DECLARE_DELAYED_WORK(expires_work, rt_worker_func);
@@ -145,6 +146,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -201,6 +203,7 @@ const __u8 ip_tos2prio[16] = {
 struct rt_hash_bucket {
 	struct rtable	*chain;
 };
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
 	defined(CONFIG_PROVE_LOCKING)
 /*
@@ -674,6 +677,20 @@ static inline u32 rt_score(struct rtable *rt)
 	return score;
 }
 
+static inline bool rt_caching(const struct net *net)
+{
+	return net->ipv4.current_rt_cache_rebuild_count <=
+		net->ipv4.sysctl_rt_cache_rebuild_count;
+}
+
+static inline bool compare_hash_inputs(const struct flowi *fl1,
+					const struct flowi *fl2)
+{
+	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
+		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
+		(fl1->iif ^ fl2->iif)) == 0);
+}
+
 static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
 {
 	return ((__force u32)((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
@@ -753,11 +770,24 @@ static void rt_do_flush(int process_context)
 	}
 }
 
+/*
+ * While freeing expired entries, we compute average chain length
+ * and standard deviation, using fixed-point arithmetic.
+ * This to have an estimation of rt_chain_length_max
+ *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
+ * We use 3 bits for frational part, and 29 (or 61) for magnitude.
+ */
+
+#define FRACT_BITS 3
+#define ONE (1UL << FRACT_BITS)
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
+	unsigned long length = 0, samples = 0;
+	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
 
 	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
@@ -766,6 +796,7 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
+	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -775,6 +806,8 @@ static void rt_check_expire(void)
 		if (need_resched())
 			cond_resched();
 
+		samples++;
+
 		if (*rthp == NULL)
 			continue;
 		spin_lock_bh(rt_hash_lock_addr(i));
@@ -789,11 +822,29 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
+					/*
+					 * Only bump our length if the hash
+					 * inputs on entries n and n+1 are not
+					 * the same, we only count entries on
+					 * a chain with equal hash inputs once
+					 * so that entries for different QOS
+					 * levels, and other non-hash input
+					 * attributes don't unfairly skew
+					 * the length computation
+					 */
+					if ((*rthp == NULL) ||
+					    !compare_hash_inputs(&(*rthp)->fl,
+								 &rth->fl))
+						length += ONE;
 					continue;
 				}
 			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
 				tmo >>= 1;
 				rthp = &rth->u.dst.rt_next;
+				if ((*rthp == NULL) ||
+				    !compare_hash_inputs(&(*rthp)->fl,
+							 &rth->fl))
+					length += ONE;
 				continue;
 			}
 
@@ -802,6 +853,15 @@ static void rt_check_expire(void)
 			rt_free(rth);
 		}
 		spin_unlock_bh(rt_hash_lock_addr(i));
+		sum += length;
+		sum2 += length*length;
+	}
+	if (samples) {
+		unsigned long avg = sum / samples;
+		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
+		rt_chain_length_max = max_t(unsigned long,
+					ip_rt_gc_elasticity,
+					(avg + 4*sd) >> FRACT_BITS);
 	}
 	rover = i;
 }
@@ -851,6 +911,26 @@ static void rt_secret_rebuild(unsigned long __net)
 	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
 }
 
+static void rt_secret_rebuild_oneshot(struct net *net)
+{
+	del_timer_sync(&net->ipv4.rt_secret_timer);
+	rt_cache_invalidate(net);
+	if (ip_rt_secret_interval) {
+		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
+		add_timer(&net->ipv4.rt_secret_timer);
+	}
+}
+
+static void rt_emergency_hash_rebuild(struct net *net)
+{
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot(net);
+}
+
 /*
    Short description of GC goals.
 
@@ -989,6 +1069,7 @@ out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
+	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -1002,7 +1083,13 @@ restart:
 	candp = NULL;
 	now = jiffies;
 
+	if (!rt_caching(dev_net(rt->u.dst.dev))) {
+		rt_drop(rt);
+		return 0;
+	}
+
 	rthp = &rt_hash_table[hash].chain;
+	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1048,6 +1135,17 @@ restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
+
+		/*
+		 * check to see if the next entry in the chain
+		 * contains the same hash input values as rt.  If it does
+		 * This is where we will insert into the list, instead of
+		 * at the head.  This groups entries that differ by aspects not
+		 * relvant to the hash function together, which we use to adjust
+		 * our chain length
+		 */
+		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
+			rthi = rth;
 	}
 
 	if (cand) {
@@ -1061,6 +1159,16 @@ restart:
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
 		}
+	} else {
+		if (chain_length > rt_chain_length_max) {
+			struct net *net = dev_net(rt->u.dst.dev);
+			int num = ++net->ipv4.current_rt_cache_rebuild_count;
+			if (!rt_caching(dev_net(rt->u.dst.dev))) {
+				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
+					rt->u.dst.dev->name, num);
+			}
+			rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
+		}
 	}
 
 	/* Try to bind route to arp only if it is output
@@ -1098,7 +1206,11 @@ restart:
 		}
 	}
 
-	rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	if (rthi)
+		rt->u.dst.rt_next = rthi->u.dst.rt_next;
+	else
+		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
 		struct rtable *trt;
@@ -1114,7 +1226,11 @@ restart:
 	 * previous writes to rt are comitted to memory
 	 * before making rt visible to other CPUS.
 	 */
-	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
+	if (rthi)
+		rcu_assign_pointer(rthi->u.dst.rt_next, rt);
+	else
+		rcu_assign_pointer(rt_hash_table[hash].chain, rt);
+
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;
 	return 0;
@@ -1217,6 +1333,9 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 	    || ipv4_is_zeronet(new_gw))
 		goto reject_redirect;
 
+	if (!rt_caching(net))
+		goto reject_redirect;
+
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
@@ -2130,6 +2249,10 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	struct net *net;
 
 	net = dev_net(dev);
+
+	if (!rt_caching(net))
+		goto skip_cache;
+
 	tos &= IPTOS_RT_MASK;
 	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
 
@@ -2154,6 +2277,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	}
 	rcu_read_unlock();
 
+skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2539,6 +2663,9 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	unsigned hash;
 	struct rtable *rth;
 
+	if (!rt_caching(net))
+		goto slow_output;
+
 	hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
 
 	rcu_read_lock_bh();
@@ -2563,6 +2690,7 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	}
 	rcu_read_unlock_bh();
 
+slow_output:
 	return ip_route_output_slow(net, rp, flp);
 }
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1bb10df..0cc8d31 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -795,6 +795,14 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "rt_cache_rebuild_count",
+		.data		= &init_net.ipv4.sysctl_rt_cache_rebuild_count,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 	{ }
 };
 
@@ -827,8 +835,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			&net->ipv4.sysctl_icmp_ratelimit;
 		table[5].data =
 			&net->ipv4.sysctl_icmp_ratemask;
+		table[6].data =
+			&net->ipv4.sysctl_rt_cache_rebuild_count;
 	}
 
+	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-05-15  8:06                     ` Jarek Poplawski
@ 2009-05-15  8:54                       ` Alexander V. Lukyanov
  2009-05-15 10:12                         ` Jarek Poplawski
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander V. Lukyanov @ 2009-05-15  8:54 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: linux-kernel, netdev

On Fri, May 15, 2009 at 08:06:49AM +0000, Jarek Poplawski wrote:
> On Fri, May 15, 2009 at 11:19:46AM +0400, Alexander V. Lukyanov wrote:
> ...
> > Looks like 2.6.28.10 does not have this refcnt problem. Also, after I have
> > reversed net/ipv4/route.c changes from 2.6.29, the problem does not occur either.
> 
> Very nice work! It should be a piece of cake now - there were not much
> changes. I'll look at it, as usual ;-), but, of course if it's
> possible, it would be great to try "in the meantime" with reverted the
> biggest of them append below.

I suspect the problem occurs when cand==rthi.

-- 
   Alexander.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-05-15  8:54                       ` Alexander V. Lukyanov
@ 2009-05-15 10:12                         ` Jarek Poplawski
  2009-05-16  7:06                           ` Jarek Poplawski
  0 siblings, 1 reply; 17+ messages in thread
From: Jarek Poplawski @ 2009-05-15 10:12 UTC (permalink / raw)
  To: Alexander V. Lukyanov; +Cc: linux-kernel, netdev

On Fri, May 15, 2009 at 12:54:14PM +0400, Alexander V. Lukyanov wrote:
> On Fri, May 15, 2009 at 08:06:49AM +0000, Jarek Poplawski wrote:
> > On Fri, May 15, 2009 at 11:19:46AM +0400, Alexander V. Lukyanov wrote:
> > ...
> > > Looks like 2.6.28.10 does not have this refcnt problem. Also, after I have
> > > reversed net/ipv4/route.c changes from 2.6.29, the problem does not occur either.
> > 
> > Very nice work! It should be a piece of cake now - there were not much
> > changes. I'll look at it, as usual ;-), but, of course if it's
> > possible, it would be great to try "in the meantime" with reverted the
> > biggest of them append below.
> 
> I suspect the problem occurs when cand==rthi.

Looks like good catch! (But there could be more than this.) Then, of
course it would be interesting to try first some fix (like nulling
rthi after rt_free(cand), I guess).

Jarek P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.6.29.1: unregister_netdevice problem
  2009-05-15 10:12                         ` Jarek Poplawski
@ 2009-05-16  7:06                           ` Jarek Poplawski
  0 siblings, 0 replies; 17+ messages in thread
From: Jarek Poplawski @ 2009-05-16  7:06 UTC (permalink / raw)
  Cc: Alexander V. Lukyanov, linux-kernel, netdev

Jarek Poplawski wrote, On 05/15/2009 12:12 PM:

> On Fri, May 15, 2009 at 12:54:14PM +0400, Alexander V. Lukyanov wrote:
>> On Fri, May 15, 2009 at 08:06:49AM +0000, Jarek Poplawski wrote:
>>> On Fri, May 15, 2009 at 11:19:46AM +0400, Alexander V. Lukyanov wrote:
>>> ...
>>>> Looks like 2.6.28.10 does not have this refcnt problem. Also, after I have
>>>> reversed net/ipv4/route.c changes from 2.6.29, the problem does not occur either.
>>> Very nice work! It should be a piece of cake now - there were not much
>>> changes. I'll look at it, as usual ;-), but, of course if it's
>>> possible, it would be great to try "in the meantime" with reverted the
>>> biggest of them append below.
>> I suspect the problem occurs when cand==rthi.
> 
> Looks like good catch! (But there could be more than this.) Then, of
> course it would be interesting to try first some fix (like nulling
> rthi after rt_free(cand), I guess).


On the other hand, since this looks like quite obvious bug (even if it
doesn't fix your problem), feel free to send here a patch fixing it any
way you like without waiting for the final test results (Cc-ing authors
of the offending patch, I hope).

Jarek P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-05-16  7:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-22  5:57 2.6.29.1: unregister_netdevice problem Alexander V. Lukyanov
2009-04-23 23:30 ` Eric W. Biederman
2009-04-24 21:16   ` Bruno Prémont
2009-04-27  5:41 ` Alexander V. Lukyanov
2009-04-28 12:57   ` Alexander V. Lukyanov
2009-04-28 20:49     ` Jarek Poplawski
2009-04-29  5:45       ` Alexander V. Lukyanov
2009-04-29  9:08         ` Jarek Poplawski
2009-05-08  6:26           ` Alexander V. Lukyanov
2009-05-08 10:46             ` Jarek Poplawski
2009-05-10  7:35               ` Alexander V. Lukyanov
2009-05-10 12:46                 ` Jarek Poplawski
2009-05-15  7:19                   ` Alexander V. Lukyanov
2009-05-15  8:06                     ` Jarek Poplawski
2009-05-15  8:54                       ` Alexander V. Lukyanov
2009-05-15 10:12                         ` Jarek Poplawski
2009-05-16  7:06                           ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).