All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG: IPv6 stops working after a while, needs ip ne del command to reset
@ 2010-08-13 17:55 Thomas Habets
  2010-08-13 21:34 ` David Miller
  2010-08-16 10:19 ` Eric Dumazet
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Habets @ 2010-08-13 17:55 UTC (permalink / raw)
  To: linux-kernel


(originally sent to netdev on aug 6th)

IPv6 initially works, but when I leave it alone overnight I'm unable to ping 
even my default gw.

Static global IPv6 addresses configured on both ends. No access lists on either 
end.

Kernel version: 2.6.35 mainline (amd64) and 2.6.33.6.
Kernel config: http://pastebin.com/raw.php?i=Y6S8iKW7
Dist: Debian Lenny (5.0.5), nothing special to my knowledge.

I seem to have the same issue that Mikael Abrahamsson encountered with Ubuntu 
kernels 2.6.26.3, 2.6.26-5-generic and 2.6.27-2-generic, and mainline kernels 
2.6.25, 2.6.26 and 2.6.27:
   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263260

He got IPv6 running again without rebooting using "networking stop, ifconfig 
eth0 down, networking start, kill dhclient", while I narrowed it down to just 
deleting the ipv6 neighbor (ip ne del..., see below). Rebooting also causes it 
to start working again.

It's very reproducible. I just leave it overnight and it breaks every time.

I am willing and able to try patches at any time, the box is not in production.

No iptables, no ip6tables. IP6tables support is not even compiled in.

NIC is "Broadcom Corporation NetXtreme BCM5715 Gigabit ethernet (rev a3)"
according to lspci.

Other end is a directly connected Cisco 7600 (routed port) that I have access 
to, but it's in production use. IPv4 works perfectly over this same port. Only 
lo and eth0 are UP.


Output when broken
------------------
$ uname -a
Linux XXXXX 2.6.35 #1 SMP Tue Aug 3 09:25:51 CEST 2010 x86_64
GNU/Linux

$ ip -6 a sh
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436
     inet6 2a00:800:1000:64::1/128 scope global
        valid_lft forever preferred_lft forever
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
     inet6 2a00:800:752:1::5c:2/112 scope global
        valid_lft forever preferred_lft forever
     inet6 fe80::224:81ff:fea3:4424/64 scope link
        valid_lft forever preferred_lft forever

(I have tried removing 2a00:800:1000:64::1/128 from lo, same issue)

$ ip -6 r sh
2a00:800:752:1::5c:0/112 dev eth0  proto kernel  metric 256  mtu 1500
advmss 14 hoplimit 4294967295 unreachable
2a00:800:1000:64::1 dev lo proto kernel  metric 256  error -101 mtu 16436 
advmss 16376 hoplimit 4294967295
fe80::/64 dev eth0  proto kernel  metric 256  mtu 1500 advmss 1440
hoplimit 4294967295
default via 2a00:800:752:1::5c:1 dev eth0  metric 1024  mtu 1500 advmss 1440 
hoplimit 4294967295

$ ping6 2a00:800:752:1::5c:1
PING 2a00:800:752:1::5c:1(2a00:800:752:1::5c:1) 56 data bytes
^C
--- 2a00:800:752:1::5c:1 ping statistics ---
22 packets transmitted, 0 received, 100% packet loss, time 21006ms


# Tcpdpump on the problem machine shows mostly the pings, but also periodically 
some ND:

[...]
12:54:02.683672 00:24:81:a3:44:24 > 00:22:55:17:4b:80, ethertype IPv6
(0x86dd), length 118: 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo 
request, seq 12, length 64
12:54:02.693669 00:24:81:a3:44:24 > 00:22:55:17:4b:80, ethertype IPv6
(0x86dd), length 86: fe80::224:81ff:fea3:4424 > 2a00:800:752:1::5c:1: ICMP6, 
neighbor solicitation, who has 2a00:800:752:1::5c:1, length 32
12:54:02.693832 00:22:55:17:4b:80 > 00:24:81:a3:44:24, ethertype IPv6
(0x86dd), length 78: 2a00:800:752:1::5c:1 > fe80::224:81ff:fea3:4424: ICMP6, 
neighbor advertisement, tgt is 2a00:800:752:1::5c:1, length 24
12:54:03.683672 00:24:81:a3:44:24 > 00:22:55:17:4b:80, ethertype IPv6
(0x86dd), length 118: 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo 
request, seq 13, length 64
[...]

$ ip -6 ne
fe80::222:55ff:fe17:4b80 dev eth0 lladdr 00:22:55:17:4b:80 router STALE
2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router STALE


Fixing the adjacency
--------------------
$ ping6 2a00:800:752:1::5c:1
PING 2a00:800:752:1::5c:1(2a00:800:752:1::5c:1) 56 data bytes
^C
--- 2a00:800:752:1::5c:1 ping statistics ---
51 packets transmitted, 0 received, 100% packet loss, time 50006ms

$ sudo ip ne del 2a00:800:752:1::5c:1 dev eth0
$ ping6 2a00:800:752:1::5c:1
PING 2a00:800:752:1::5c:1(2a00:800:752:1::5c:1) 56 data bytes
64 bytes from 2a00:800:752:1::5c:1: icmp_seq=1 ttl=64 time=31.9 ms
64 bytes from 2a00:800:752:1::5c:1: icmp_seq=2 ttl=64 time=0.212 ms

$ ip -6 ne
fe80::222:55ff:fe17:4b80 dev eth0 lladdr 00:22:55:17:4b:80 router REACHABLE
2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router REACHABLE

(Note that after a few minutes it goes back to STALE, but pinging still works 
and brings back the state to REACHABLE, so it's not that it can't get out of 
STALE once there, it seems).

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-13 17:55 BUG: IPv6 stops working after a while, needs ip ne del command to reset Thomas Habets
@ 2010-08-13 21:34 ` David Miller
  2010-08-16 10:19 ` Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: David Miller @ 2010-08-13 21:34 UTC (permalink / raw)
  To: thomas; +Cc: linux-kernel

From: Thomas Habets <thomas@habets.pp.se>
Date: Fri, 13 Aug 2010 19:55:40 +0200 (CEST)

> 
> (originally sent to netdev on aug 6th)

If you didn't get an answer on netdev, sending your query again to
linux-kernel isn't going to help.  Networking experts generally do not
read this list.

Nobody has simply had an opportunity to look into your problem yet,
that is all.  I personally have it saved in my inbox and plan to look
at it when I get a chance unless someone else gets to it first.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-13 17:55 BUG: IPv6 stops working after a while, needs ip ne del command to reset Thomas Habets
  2010-08-13 21:34 ` David Miller
@ 2010-08-16 10:19 ` Eric Dumazet
  2010-08-16 10:59   ` Thomas Habets
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2010-08-16 10:19 UTC (permalink / raw)
  To: Thomas Habets; +Cc: linux-kernel, netdev

Le vendredi 13 août 2010 à 19:55 +0200, Thomas Habets a écrit :
> (originally sent to netdev on aug 6th)
> 

CC netdev again

> IPv6 initially works, but when I leave it alone overnight I'm unable to ping 
> even my default gw.
> 
> Static global IPv6 addresses configured on both ends. No access lists on either 
> end.
> 
> Kernel version: 2.6.35 mainline (amd64) and 2.6.33.6.
> Kernel config: http://pastebin.com/raw.php?i=Y6S8iKW7
> Dist: Debian Lenny (5.0.5), nothing special to my knowledge.
> 
> I seem to have the same issue that Mikael Abrahamsson encountered with Ubuntu 
> kernels 2.6.26.3, 2.6.26-5-generic and 2.6.27-2-generic, and mainline kernels 
> 2.6.25, 2.6.26 and 2.6.27:
>    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263260
> 
> He got IPv6 running again without rebooting using "networking stop, ifconfig 
> eth0 down, networking start, kill dhclient", while I narrowed it down to just 
> deleting the ipv6 neighbor (ip ne del..., see below). Rebooting also causes it 
> to start working again.
> 
> It's very reproducible. I just leave it overnight and it breaks every time.
> 
> I am willing and able to try patches at any time, the box is not in production.
> 
> No iptables, no ip6tables. IP6tables support is not even compiled in.
> 
> NIC is "Broadcom Corporation NetXtreme BCM5715 Gigabit ethernet (rev a3)"
> according to lspci.
> 
> Other end is a directly connected Cisco 7600 (routed port) that I have access 
> to, but it's in production use. IPv4 works perfectly over this same port. Only 
> lo and eth0 are UP.
> 
> 
> Output when broken
> ------------------
> $ uname -a
> Linux XXXXX 2.6.35 #1 SMP Tue Aug 3 09:25:51 CEST 2010 x86_64
> GNU/Linux
> 
> $ ip -6 a sh
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436
>      inet6 2a00:800:1000:64::1/128 scope global
>         valid_lft forever preferred_lft forever
>      inet6 ::1/128 scope host
>         valid_lft forever preferred_lft forever
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
>      inet6 2a00:800:752:1::5c:2/112 scope global
>         valid_lft forever preferred_lft forever
>      inet6 fe80::224:81ff:fea3:4424/64 scope link
>         valid_lft forever preferred_lft forever
> 
> (I have tried removing 2a00:800:1000:64::1/128 from lo, same issue)
> 
> $ ip -6 r sh
> 2a00:800:752:1::5c:0/112 dev eth0  proto kernel  metric 256  mtu 1500
> advmss 14 hoplimit 4294967295 unreachable

advmss 14 ? or is it a copy/paste error ?
unreachable ?

This route seems wrong.

> 2a00:800:1000:64::1 dev lo proto kernel  metric 256  error -101 mtu 16436 
> advmss 16376 hoplimit 4294967295
> fe80::/64 dev eth0  proto kernel  metric 256  mtu 1500 advmss 1440
> hoplimit 4294967295
> default via 2a00:800:752:1::5c:1 dev eth0  metric 1024  mtu 1500 advmss 1440 
> hoplimit 4294967295
> 
> $ ping6 2a00:800:752:1::5c:1
> PING 2a00:800:752:1::5c:1(2a00:800:752:1::5c:1) 56 data bytes
> ^C
> --- 2a00:800:752:1::5c:1 ping statistics ---
> 22 packets transmitted, 0 received, 100% packet loss, time 21006ms
> 
> 
> # Tcpdpump on the problem machine shows mostly the pings, but also periodically 
> some ND:
> 
> [...]
> 12:54:02.683672 00:24:81:a3:44:24 > 00:22:55:17:4b:80, ethertype IPv6
> (0x86dd), length 118: 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo 
> request, seq 12, length 64

> 12:54:02.693669 00:24:81:a3:44:24 > 00:22:55:17:4b:80, ethertype IPv6
> (0x86dd), length 86: fe80::224:81ff:fea3:4424 > 2a00:800:752:1::5c:1: ICMP6, 
> neighbor solicitation, who has 2a00:800:752:1::5c:1, length 32

Sollicitation comes from fe80::224:81ff:fea3:4424 instead of
2a00:800:752:1::5c:2

> 12:54:02.693832 00:22:55:17:4b:80 > 00:24:81:a3:44:24, ethertype IPv6
> (0x86dd), length 78: 2a00:800:752:1::5c:1 > fe80::224:81ff:fea3:4424: ICMP6, 
> neighbor advertisement, tgt is 2a00:800:752:1::5c:1, length 24



> 12:54:03.683672 00:24:81:a3:44:24 > 00:22:55:17:4b:80, ethertype IPv6
> (0x86dd), length 118: 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo 
> request, seq 13, length 64
> [...]
> 
> $ ip -6 ne
> fe80::222:55ff:fe17:4b80 dev eth0 lladdr 00:22:55:17:4b:80 router STALE
> 2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router STALE
> 
> 
> Fixing the adjacency
> --------------------
> $ ping6 2a00:800:752:1::5c:1
> PING 2a00:800:752:1::5c:1(2a00:800:752:1::5c:1) 56 data bytes
> ^C
> --- 2a00:800:752:1::5c:1 ping statistics ---
> 51 packets transmitted, 0 received, 100% packet loss, time 50006ms
> 
> $ sudo ip ne del 2a00:800:752:1::5c:1 dev eth0
> $ ping6 2a00:800:752:1::5c:1
> PING 2a00:800:752:1::5c:1(2a00:800:752:1::5c:1) 56 data bytes
> 64 bytes from 2a00:800:752:1::5c:1: icmp_seq=1 ttl=64 time=31.9 ms
> 64 bytes from 2a00:800:752:1::5c:1: icmp_seq=2 ttl=64 time=0.212 ms
> 
> $ ip -6 ne
> fe80::222:55ff:fe17:4b80 dev eth0 lladdr 00:22:55:17:4b:80 router REACHABLE
> 2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router REACHABLE
> 
> (Note that after a few minutes it goes back to STALE, but pinging still works 
> and brings back the state to REACHABLE, so it's not that it can't get out of 
> STALE once there, it seems).
> 

I am wondering if you have some lowlevel problem, say lost frames in an
otherwise idle link, maybe a full/half duplex mismatch ?




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-16 10:19 ` Eric Dumazet
@ 2010-08-16 10:59   ` Thomas Habets
  2010-08-17  5:35     ` Thomas Habets
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Habets @ 2010-08-16 10:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev, Thomas Habets

On Mon, 16 Aug 2010, Eric Dumazet wrote:
>> $ ip -6 r sh
>> 2a00:800:752:1::5c:0/112 dev eth0  proto kernel  metric 256  mtu 1500
>> advmss 14 hoplimit 4294967295 unreachable
>
> advmss 14 ? or is it a copy/paste error ?
> unreachable ?

Copy/paste error, sorry.

advmss 1440, and no "unreachable"

Complete (correct) line:

2a00:800:752:1::5c:0/112 dev eth0 proto kernel metric 256 mtu 1500 advmss 
1440 hoplimit 4294967295

> This route seems wrong.

Did you mean the below?

>> 2a00:800:1000:64::1 dev lo proto kernel  metric 256  error -101 mtu 
>> 16436 advmss 16376 hoplimit 4294967295

Yes, that's where the "unreachable" belonged.

unreachable 2a00:800:1000:64::1 dev lo proto kernel  metric 256  error 
-101 mtu 1500 advmss 1440 hoplimit 4294967295

That's an additional address on the lo interface, seen here again:

>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436
>>      inet6 2a00:800:1000:64::1/128 scope global
>>         valid_lft forever preferred_lft forever
>>      inet6 ::1/128 scope host
>>         valid_lft forever preferred_lft forever

Am I not allowed to add addresses to lo? That I've deconfigured this 
address and rebooted the box to check if this matters.

> I am wondering if you have some lowlevel problem, say lost frames in an
> otherwise idle link, maybe a full/half duplex mismatch ?

But at first it works perfectly, and then it doesn't work at all. The link 
is ~idle both before and after, and IPv4 is unaffected. When I run "ip ne 
del ..." it *immediately* starts working again. From 100% packet loss to 
0%.

Duplex is full according to dmesg and ethtool (mii-tool says 1000BaseT-HD, 
but I suppose mii-tool is not as reliable?).

Cisco router also says "Full-duplex, 1000Mb/s", so there doesn't seem to 
be a mismatch. No errors in "show int giX/Y" either.

ethtool info right after reboot (when ipv6 is still working):
$ sudo ethtool eth0
Settings for eth0:
         Supported ports: [ TP ]
         Supported link modes:   10baseT/Half 10baseT/Full
                                 100baseT/Half 100baseT/Full
                                 1000baseT/Half 1000baseT/Full
         Supports auto-negotiation: Yes
         Advertised link modes:  10baseT/Half 10baseT/Full
                                 100baseT/Half 100baseT/Full
                                 1000baseT/Half 1000baseT/Full
         Advertised auto-negotiation: Yes
         Speed: 1000Mb/s
         Duplex: Full
         Port: Twisted Pair
         PHYAD: 1
         Transceiver: internal
         Auto-negotiation: on
         Supports Wake-on: g
         Wake-on: g
         Current message level: 0x000000ff (255)
         Link detected: yes


No errors show in "ethtool -S eth0 | grep -v ': 0$'" now that it's 
working.

I will re-check ethtool and Cisco router output for mismatches when it 
breaks again to make sure that there's no change or errors counting up.


---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-16 10:59   ` Thomas Habets
@ 2010-08-17  5:35     ` Thomas Habets
  2010-08-17  6:00       ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Habets @ 2010-08-17  5:35 UTC (permalink / raw)
  To: Thomas Habets; +Cc: Eric Dumazet, linux-kernel, netdev

On Mon, 16 Aug 2010, Thomas Habets wrote:
> Am I not allowed to add addresses to lo? That I've deconfigured this address 
> and rebooted the box to check if this matters.

It didn't help.

> I will re-check ethtool and Cisco router output for mismatches when it breaks 
> again to make sure that there's no change or errors counting up.

IPv6 is currently not working and it's still 1000 Full on both sides 
("show int GiX/Y" and "ethtool eth0").
No errors in "ethtool -S eth0" or "show int GiX/Y".

"ethtool eth0" output is the same as yesterday.

Here's the addresses and routing table as they are now, and have been 
since reboot:

$ ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
     inet6 2a00:800:752:1::5c:2/112 scope global
        valid_lft forever preferred_lft forever
     inet6 fe80::224:81ff:fea3:4424/64 scope link
        valid_lft forever preferred_lft forever


$ ip -6 r sh
2a00:800:752:1::5c:0/112 dev eth0  proto kernel  metric 256  mtu 1500
advmss 1440 hoplimit 4294967295

fe80::/64 dev eth0  proto kernel  metric 256  mtu 1500 advmss 1440
hoplimit 4294967295

default via 2a00:800:752:1::5c:1 dev eth0  metric 1024  mtu 1500 advmss
1440 hoplimit 4294967295

$ ip -6 ne sh
2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router STALE

[try ping6 again, no reply]

$ ip -6 ne sh
2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router DELAY

[try ping6 again, no reply]

$ ip -6 ne sh
2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router REACHABLE

[try ping6 again, no reply]

Configured network with /etc/network/interfaces:

auto lo
iface lo inet loopback

allow-hotplug eth0
iface eth0 inet static
 	address x.x.x.x
 	netmask 255.255.255.252
 	broadcast x.x.x.x
 	gateway x.x.x.x
 	up ip a a 2a00:800:752:1::5c:2/112 dev eth0
 	up ip r a default via 2a00:800:752:1::5c:1
 	dns-nameservers x.x.x.x
 	dns-search xxxx.net

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17  5:35     ` Thomas Habets
@ 2010-08-17  6:00       ` Eric Dumazet
  2010-08-17 11:08         ` Thomas Habets
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2010-08-17  6:00 UTC (permalink / raw)
  To: Thomas Habets; +Cc: linux-kernel, netdev

Le mardi 17 août 2010 à 07:35 +0200, Thomas Habets a écrit :
> On Mon, 16 Aug 2010, Thomas Habets wrote:
> > Am I not allowed to add addresses to lo? That I've deconfigured this address 
> > and rebooted the box to check if this matters.
> 
> It didn't help.
> 
> > I will re-check ethtool and Cisco router output for mismatches when it breaks 
> > again to make sure that there's no change or errors counting up.
> 
> IPv6 is currently not working and it's still 1000 Full on both sides 
> ("show int GiX/Y" and "ethtool eth0").
> No errors in "ethtool -S eth0" or "show int GiX/Y".
> 
> "ethtool eth0" output is the same as yesterday.
> 
> Here's the addresses and routing table as they are now, and have been 
> since reboot:
> 
> $ ip -6 a
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436
>      inet6 ::1/128 scope host
>         valid_lft forever preferred_lft forever
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
>      inet6 2a00:800:752:1::5c:2/112 scope global
>         valid_lft forever preferred_lft forever
>      inet6 fe80::224:81ff:fea3:4424/64 scope link
>         valid_lft forever preferred_lft forever
> 
> 
> $ ip -6 r sh
> 2a00:800:752:1::5c:0/112 dev eth0  proto kernel  metric 256  mtu 1500
> advmss 1440 hoplimit 4294967295
> 
> fe80::/64 dev eth0  proto kernel  metric 256  mtu 1500 advmss 1440
> hoplimit 4294967295
> 
> default via 2a00:800:752:1::5c:1 dev eth0  metric 1024  mtu 1500 advmss
> 1440 hoplimit 4294967295
> 
> $ ip -6 ne sh
> 2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router STALE
> 
> [try ping6 again, no reply]
> 
> $ ip -6 ne sh
> 2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router DELAY
> 
> [try ping6 again, no reply]
> 
> $ ip -6 ne sh
> 2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router REACHABLE
> 

This seems a bit different than previous mail. Apparently discovery now
works ?

Could you have a tcpdump on both sides ?

Maybe your host is OK, but other side drops your ICMP packets ?

> [try ping6 again, no reply]
> 
> Configured network with /etc/network/interfaces:
> 
> auto lo
> iface lo inet loopback
> 
> allow-hotplug eth0
> iface eth0 inet static
>  	address x.x.x.x
>  	netmask 255.255.255.252
>  	broadcast x.x.x.x
>  	gateway x.x.x.x
>  	up ip a a 2a00:800:752:1::5c:2/112 dev eth0
>  	up ip r a default via 2a00:800:752:1::5c:1
>  	dns-nameservers x.x.x.x
>  	dns-search xxxx.net



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17  6:00       ` Eric Dumazet
@ 2010-08-17 11:08         ` Thomas Habets
  2010-08-17 13:15           ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Habets @ 2010-08-17 11:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Thomas Habets, linux-kernel, netdev


Aha! New development:

The Cisco router can't discover the address of the Linux box because Linux 
doesn't seem to be listening to ff02::1 (all-nodes).

-----------
cisco#ping ff02::1
Output Interface: GigabitEthernet1/2
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to FF02::1, timeout is 2 seconds:
Packet sent with a source address of 
FE80::222:55FF:FE17:4B80%GigabitEthernet1/2

Request 0 timed out
Request 1 timed out
Request 2 timed out
Request 3 timed out
Request 4 timed out
Success rate is 0 percent (0/5)
0 multicast replies and 0 errors.
------------

If i set promisc mode on the interface (tcpdump without -p or "ip link set 
promisc on eth0") it starts working (both normal ping and the above ping 
from the Cisco to ff02::1). It continues working until I guess the 
neighbor table on the cisco times out (leaving it overnight seems to 
be enough idle time) or I manually do a "clear ipv6 neig".

So great news! I can reproduce it at will with no waiting time! Right 
after rebooting the Linux box I run "clear ipv6 neighbors" and Linux can 
no longer ping the router. Tested reproducing it immediately after reboot.

The Linux box itself can ping ff02::1%eth0 with no problem, and gets 
replies from the fe80:: link-local of itself and the Cisco router.

So could this be that for some reason the NIC isn't listening 
multicast MAC address 33:33:ff:5c:00:02 ?

Is there a way to see the list of addresses that get past the NIC? Or can 
this perhaps be filtered after the NIC, but before tcpdump -p?

Since this now looks like a NIC thing, here's some info about eth0:

$ dmesg | grep eth0
[...]
tg3 0000:03:04.0: eth0: Tigon3 [partno(N/A) rev 9003] (PCIX:133MHz:64-bit) 
MAC address 00:24:81:a3:44:24
tg3 0000:03:04.0: eth0: attached PHY is 5714 (10/100/1000Base-T Ethernet) 
(WireSpeed[1])
tg3 0000:03:04.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
tg3 0000:03:04.0: eth0: dma_rwctrl[76148000] dma_mask[40-bit]
[...]

$ sudo lspci -v -s 03:04.0
03:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 
Gigabit Ethernet (rev a3)
Subsystem: Hewlett-Packard Company NC326i PCIe Dual Port Gigabit Server 
Adapter
Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 47
Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
Expansion ROM at <ignored> [disabled]
Capabilities: [40] PCI-X non-bridge device
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data <?>
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 
Enable+
Kernel driver in use: tg3
Kernel modules: tg3

$ sudo ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:24:81:a3:44:24
           inet addr:x.x.x.x  Bcast:x.x.x.x 
Mask:255.255.255.252
           inet6 addr: 2a00:800:752:1::5c:2/112 Scope:Global
           inet6 addr: fe80::224:81ff:fea3:4424/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:928 errors:0 dropped:0 overruns:0 frame:0
           TX packets:834 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:142281 (138.9 KiB)  TX bytes:154616 (150.9 KiB)
           Interrupt:16

I have doublechecked iptables, ip6tables and arptables, and they are 
either not compiled in the kernel or they are empty ACCEPT lists.

I have answered your questions below even if they may no longer be 
applicable.


On Tue, 17 Aug 2010, Eric Dumazet wrote:
>> $ ip -6 ne sh
>> 2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router STALE
>>
>> [try ping6 again, no reply]
>>
>> $ ip -6 ne sh
>> 2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router DELAY
>>
>> [try ping6 again, no reply]
>>
>> $ ip -6 ne sh
>> 2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router REACHABLE
>>
> This seems a bit different than previous mail. Apparently discovery now
> works ?

I didn't post the "ip -6 ne sh" immediately after ping attempt last time. 
I'm not sure this changed since last time.

But the tcpdump output from last time seems to indicate that ND did work 
then, at least in one direction, even if solicitation came from link-local 
address and not the global address. The solicitation was answered, after 
all (as seen in the tcpdump in in the original mail).

> Could you have a tcpdump on both sides ?

Not easily. The other end is a Cisco and a bit inconvenient to get to. I'm 
going there tomorrow night, so I can hook up a cable and do a monitor 
port then if needed.

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 11:08         ` Thomas Habets
@ 2010-08-17 13:15           ` Eric Dumazet
  2010-08-17 14:09             ` Thomas Habets
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2010-08-17 13:15 UTC (permalink / raw)
  To: Thomas Habets; +Cc: linux-kernel, netdev

Le mardi 17 août 2010 à 13:08 +0200, Thomas Habets a écrit :
> Aha! New development:
> 
> The Cisco router can't discover the address of the Linux box because Linux 
> doesn't seem to be listening to ff02::1 (all-nodes).
> 
> -----------
> cisco#ping ff02::1
> Output Interface: GigabitEthernet1/2
> Type escape sequence to abort.
> Sending 5, 100-byte ICMP Echos to FF02::1, timeout is 2 seconds:
> Packet sent with a source address of 
> FE80::222:55FF:FE17:4B80%GigabitEthernet1/2
> 
> Request 0 timed out
> Request 1 timed out
> Request 2 timed out
> Request 3 timed out
> Request 4 timed out
> Success rate is 0 percent (0/5)
> 0 multicast replies and 0 errors.
> ------------
> 
> If i set promisc mode on the interface (tcpdump without -p or "ip link set 
> promisc on eth0") it starts working (both normal ping and the above ping 
> from the Cisco to ff02::1). It continues working until I guess the 
> neighbor table on the cisco times out (leaving it overnight seems to 
> be enough idle time) or I manually do a "clear ipv6 neig".
> 
> So great news! I can reproduce it at will with no waiting time! Right 
> after rebooting the Linux box I run "clear ipv6 neighbors" and Linux can 
> no longer ping the router. Tested reproducing it immediately after reboot.
> 
> The Linux box itself can ping ff02::1%eth0 with no problem, and gets 
> replies from the fe80:: link-local of itself and the Cisco router.
> 
> So could this be that for some reason the NIC isn't listening 
> multicast MAC address 33:33:ff:5c:00:02 ?
> 

That would be very surprising, but who knows...

Can you try : "ifconfig eth0 allmulti"

Maybe tg3 driver has a problem building the mulicast table for this 5715


> Is there a way to see the list of addresses that get past the NIC? Or can 
> this perhaps be filtered after the NIC, but before tcpdump -p?
> 
> Since this now looks like a NIC thing, here's some info about eth0:
> 
> $ dmesg | grep eth0
> [...]
> tg3 0000:03:04.0: eth0: Tigon3 [partno(N/A) rev 9003] (PCIX:133MHz:64-bit) 
> MAC address 00:24:81:a3:44:24
> tg3 0000:03:04.0: eth0: attached PHY is 5714 (10/100/1000Base-T Ethernet) 
> (WireSpeed[1])
> tg3 0000:03:04.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
> tg3 0000:03:04.0: eth0: dma_rwctrl[76148000] dma_mask[40-bit]
> [...]
> 
> $ sudo lspci -v -s 03:04.0
> 03:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 
> Gigabit Ethernet (rev a3)
> Subsystem: Hewlett-Packard Company NC326i PCIe Dual Port Gigabit Server 
> Adapter
> Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 47
> Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> Expansion ROM at <ignored> [disabled]
> Capabilities: [40] PCI-X non-bridge device
> Capabilities: [48] Power Management version 2
> Capabilities: [50] Vital Product Data <?>
> Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 
> Enable+
> Kernel driver in use: tg3
> Kernel modules: tg3
> 
> $ sudo ifconfig eth0
> eth0      Link encap:Ethernet  HWaddr 00:24:81:a3:44:24
>            inet addr:x.x.x.x  Bcast:x.x.x.x 
> Mask:255.255.255.252
>            inet6 addr: 2a00:800:752:1::5c:2/112 Scope:Global
>            inet6 addr: fe80::224:81ff:fea3:4424/64 Scope:Link
>            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>            RX packets:928 errors:0 dropped:0 overruns:0 frame:0
>            TX packets:834 errors:0 dropped:0 overruns:0 carrier:0
>            collisions:0 txqueuelen:1000
>            RX bytes:142281 (138.9 KiB)  TX bytes:154616 (150.9 KiB)
>            Interrupt:16
> 
> I have doublechecked iptables, ip6tables and arptables, and they are 
> either not compiled in the kernel or they are empty ACCEPT lists.


If you let a "tcpdump" running with -p option, do you receive the packet
sent to ethernet dest 33:33:ff:5c:00:02 ?

If you can see it with tcpdump, then NIC gave the frame to us.




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 13:15           ` Eric Dumazet
@ 2010-08-17 14:09             ` Thomas Habets
  2010-08-17 14:34               ` Eric Dumazet
  2010-08-17 16:14               ` Thomas Habets
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Habets @ 2010-08-17 14:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Thomas Habets, linux-kernel, netdev

On Tue, 17 Aug 2010, Eric Dumazet wrote:
> Can you try : "ifconfig eth0 allmulti"

That didn't help. "ifconfig eth0" and "ip l" shows that allmulti is now 
set, but no other difference. Can't ping router, and router gets no answer 
when pinging ff02::1. No message in dmesg saying allmulti isn't supported 
or anything like that either.

> If you let a "tcpdump" running with -p option, do you receive the packet
> sent to ethernet dest 33:33:ff:5c:00:02 ?

No. Commented tcpdump output below.

> If you can see it with tcpdump, then NIC gave the frame to us.

Seems to be invisible unless I or tcpdump set promisc mode. But when 
promisc mode is set I can immediately see the 33:33:ff:5c:00:02 packet 
(ND solicitation) and I see that Linux is answering it.

Here's a tcpdump from the Linux host. It's slightliy trimmed to fit in
an email, but the full dump is at http://www.habets.pp.se/tmp/ipv6.pcap

$ sudo tcpdump -pnli eth0 -s0 -w ipv6.pcap ip6
[...]

$ tcpdump -nlr ipv6.pcap
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply

[ here I run "clear ipv6 neighbors" on the Cisco router ]

2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
[ ... more repeated echo requests, no reply ... ]

[ here i run "ip l set promisc on eth0" ]

2a00:800:752:1::5c:1 > ff02::1:ff5c:2: ICMP6, neighbor solicitation, who 
has 2a00:800:752:1::5c:2
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6,neighbor advertisement,
tgt is 2a00:800:752:1::5c:2
2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply

[ here I run "clear ipv6 neigbors" again ]
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 14:09             ` Thomas Habets
@ 2010-08-17 14:34               ` Eric Dumazet
  2010-08-17 15:58                 ` Thomas Habets
  2010-08-17 16:14               ` Thomas Habets
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2010-08-17 14:34 UTC (permalink / raw)
  To: Thomas Habets; +Cc: linux-kernel, netdev, Matt Carlson, Michael Chan

Le mardi 17 août 2010 à 16:09 +0200, Thomas Habets a écrit :
> On Tue, 17 Aug 2010, Eric Dumazet wrote:
> > Can you try : "ifconfig eth0 allmulti"
> 
> That didn't help. "ifconfig eth0" and "ip l" shows that allmulti is now 
> set, but no other difference. Can't ping router, and router gets no answer 
> when pinging ff02::1. No message in dmesg saying allmulti isn't supported 
> or anything like that either.
> 
> > If you let a "tcpdump" running with -p option, do you receive the packet
> > sent to ethernet dest 33:33:ff:5c:00:02 ?
> 
> No. Commented tcpdump output below.
> 
> > If you can see it with tcpdump, then NIC gave the frame to us.
> 
> Seems to be invisible unless I or tcpdump set promisc mode. But when 
> promisc mode is set I can immediately see the 33:33:ff:5c:00:02 packet 
> (ND solicitation) and I see that Linux is answering it.
> 
> Here's a tcpdump from the Linux host. It's slightliy trimmed to fit in
> an email, but the full dump is at http://www.habets.pp.se/tmp/ipv6.pcap
> 
> $ sudo tcpdump -pnli eth0 -s0 -w ipv6.pcap ip6
> [...]
> 
> $ tcpdump -nlr ipv6.pcap
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
> 
> [ here I run "clear ipv6 neighbors" on the Cisco router ]
> 
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> [ ... more repeated echo requests, no reply ... ]
> 
> [ here i run "ip l set promisc on eth0" ]
> 
> 2a00:800:752:1::5c:1 > ff02::1:ff5c:2: ICMP6, neighbor solicitation, who 
> has 2a00:800:752:1::5c:2
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6,neighbor advertisement,
> tgt is 2a00:800:752:1::5c:2
> 2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply
> 
> [ here I run "clear ipv6 neigbors" again ]
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request

I suspect its time to ask broadcom guys some help :)

I have same adapter here (Hewlett-Packard Company NC326m PCIe Dual Port
Adapter) and could not reproduce the problem.

Try following patch to check tg3 receives correct multicast list (its OK
for me, seen on dmesg output)

[17162.120238]  add mc_addr(ha->addr=33:33:00:00:00:01)
[17162.120270]  add mc_addr(ha->addr=01:00:5e:00:00:01)
[17162.120298]  add mc_addr(ha->addr=33:33:ff:87:96:ce)
[17162.120326]  add mc_addr(ha->addr=33:33:ff:5c:00:02)
[17162.120355] filters=80000001 00000000 00400000 40000000


But if problem remains even with "ifconfig eth0 allmulti" I suspect a
NIC firmware problem. (allmulti set to 1 all the 128 bits of filters)

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index bc3af78..34510f5 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -9317,12 +9317,14 @@ static void __tg3_set_rx_mode(struct net_device *dev)
 		u32 crc;
 
 		netdev_for_each_mc_addr(ha, dev) {
+			pr_err("add mc_addr(ha->addr=%pM)\n", ha->addr);
 			crc = calc_crc(ha->addr, ETH_ALEN);
 			bit = ~crc & 0x7f;
 			regidx = (bit & 0x60) >> 5;
 			bit &= 0x1f;
 			mc_filter[regidx] |= (1 << bit);
 		}
+		pr_err("filters=%08X %08x %08x %08x\n", mc_filter[0], mc_filter[1], mc_filter[2], mc_filter[3]);
 
 		tw32(MAC_HASH_REG_0, mc_filter[0]);
 		tw32(MAC_HASH_REG_1, mc_filter[1]);



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 14:34               ` Eric Dumazet
@ 2010-08-17 15:58                 ` Thomas Habets
  2010-08-17 17:11                   ` Matt Carlson
  2010-08-17 17:13                   ` Eric Dumazet
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Habets @ 2010-08-17 15:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Habets, linux-kernel, netdev, Matt Carlson, Michael Chan

On Tue, 17 Aug 2010, Eric Dumazet wrote:
> Try following patch to check tg3 receives correct multicast list (its OK
> for me, seen on dmesg output)
>
> [17162.120238]  add mc_addr(ha->addr=33:33:00:00:00:01)
> [17162.120270]  add mc_addr(ha->addr=01:00:5e:00:00:01)
> [17162.120298]  add mc_addr(ha->addr=33:33:ff:87:96:ce)
> [17162.120326]  add mc_addr(ha->addr=33:33:ff:5c:00:02)
> [17162.120355] filters=80000001 00000000 00400000 40000000

Right after boot:

$ dmesg | egrep 'eth0|^add mc|^filters='
tg3 0000:03:04.0: eth0: Tigon3 [partno(N/A) rev 9003] (PCIX:133MHz:64-bit) 
MAC address 00:24:81:a3:44:24
tg3 0000:03:04.0: eth0: attached PHY is 5714 (10/100/1000Base-T Ethernet) 
(WireSpeed[1])
tg3 0000:03:04.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
tg3 0000:03:04.0: eth0: dma_rwctrl[76148000] dma_mask[40-bit]
add mc_addr(ha->addr=33:33:00:00:00:01)
filters=80000000 00000000 00000000 00000000
add mc_addr(ha->addr=33:33:00:00:00:01)
filters=80000000 00000000 00000000 00000000
add mc_addr(ha->addr=33:33:00:00:00:01)
filters=80000000 00000000 00000000 00000000
add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
filters=80000000 00000000 00000000 40000000
ADDRCONF(NETDEV_UP): eth0: link is not ready
add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
filters=80000000 00000000 00000000 40000000
add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
filters=80000000 00000000 00000000 40000000
add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
add mc_addr(ha->addr=33:33:ff:5c:00:02)
filters=80000001 00000000 00000000 40000000
tg3 0000:03:04.0: eth0: Link is up at 1000 Mbps, full duplex
tg3 0000:03:04.0: eth0: Flow control is off for TX and off for RX
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
add mc_addr(ha->addr=33:33:ff:5c:00:02)
add mc_addr(ha->addr=33:33:ff:a3:44:24)
filters=80020001 00000000 00000000 40000000
eth0: no IPv6 routers present

[ ifconfig eth0 allmulti
   (ip l and ifconfig say ALLMULTI is on)
]

add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
add mc_addr(ha->addr=33:33:ff:5c:00:02)
add mc_addr(ha->addr=33:33:ff:a3:44:24)
filters=80020001 00000000 00000000 40000000

[
   $ sudo ifconfig eth0 -allmulti
   Warning: Interface eth0 still in ALLMULTI mode.
   (ip l and ifconfig say ALLMULTI is now off)
]

add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
add mc_addr(ha->addr=33:33:ff:5c:00:02)
add mc_addr(ha->addr=33:33:ff:a3:44:24)
filters=80020001 00000000 00000000 40000000

[ ifconfig eth0 allmulti
   (same effect)
]

add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
add mc_addr(ha->addr=33:33:ff:5c:00:02)
add mc_addr(ha->addr=33:33:ff:a3:44:24)
filters=80020001 00000000 00000000 40000000

[
   $ sudo ifconfig eth0 -allmulti
   Warning: Interface eth0 still in ALLMULTI mode.
   (same effect)
]

add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
add mc_addr(ha->addr=33:33:ff:5c:00:02)
add mc_addr(ha->addr=33:33:ff:a3:44:24)
filters=80020001 00000000 00000000 40000000


> But if problem remains even with "ifconfig eth0 allmulti" I suspect a
> NIC firmware problem. (allmulti set to 1 all the 128 bits of filters)

If you expected more bits set in "filters" with allmulti than without it, 
that doesn't seem to be the case.

Applied your patch to v2.6.35.

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 14:09             ` Thomas Habets
  2010-08-17 14:34               ` Eric Dumazet
@ 2010-08-17 16:14               ` Thomas Habets
  1 sibling, 0 replies; 25+ messages in thread
From: Thomas Habets @ 2010-08-17 16:14 UTC (permalink / raw)
  To: Thomas Habets; +Cc: Eric Dumazet, linux-kernel, netdev

On Tue, 17 Aug 2010, Thomas Habets wrote:
> [ here i run "ip l set promisc on eth0" ]
[...]
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:1 > 2a00:800:752:1::5c:2: ICMP6, echo reply

I should add that I turned promisc mode back off at about here. Otherwise 
the ND would work properly right after the "clear ipv6 neigbors" command.

> [ here I run "clear ipv6 neigbors" again ]
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request
> 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, echo request

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 15:58                 ` Thomas Habets
@ 2010-08-17 17:11                   ` Matt Carlson
  2010-08-17 17:29                     ` Thomas Habets
  2010-08-17 17:13                   ` Eric Dumazet
  1 sibling, 1 reply; 25+ messages in thread
From: Matt Carlson @ 2010-08-17 17:11 UTC (permalink / raw)
  To: Thomas Habets
  Cc: Eric Dumazet, linux-kernel, netdev, Matthew Carlson, Michael Chan

On Tue, Aug 17, 2010 at 08:58:26AM -0700, Thomas Habets wrote:
> On Tue, 17 Aug 2010, Eric Dumazet wrote:
> > Try following patch to check tg3 receives correct multicast list (its OK
> > for me, seen on dmesg output)
> >
> > [17162.120238]  add mc_addr(ha->addr=33:33:00:00:00:01)
> > [17162.120270]  add mc_addr(ha->addr=01:00:5e:00:00:01)
> > [17162.120298]  add mc_addr(ha->addr=33:33:ff:87:96:ce)
> > [17162.120326]  add mc_addr(ha->addr=33:33:ff:5c:00:02)
> > [17162.120355] filters=80000001 00000000 00400000 40000000
> 
> Right after boot:
> 
> $ dmesg | egrep 'eth0|^add mc|^filters='
> tg3 0000:03:04.0: eth0: Tigon3 [partno(N/A) rev 9003] (PCIX:133MHz:64-bit) 
> MAC address 00:24:81:a3:44:24
> tg3 0000:03:04.0: eth0: attached PHY is 5714 (10/100/1000Base-T Ethernet) 
> (WireSpeed[1])
> tg3 0000:03:04.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
> tg3 0000:03:04.0: eth0: dma_rwctrl[76148000] dma_mask[40-bit]
> add mc_addr(ha->addr=33:33:00:00:00:01)
> filters=80000000 00000000 00000000 00000000
> add mc_addr(ha->addr=33:33:00:00:00:01)
> filters=80000000 00000000 00000000 00000000
> add mc_addr(ha->addr=33:33:00:00:00:01)
> filters=80000000 00000000 00000000 00000000
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> filters=80000000 00000000 00000000 40000000
> ADDRCONF(NETDEV_UP): eth0: link is not ready
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> filters=80000000 00000000 00000000 40000000
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> filters=80000000 00000000 00000000 40000000
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> add mc_addr(ha->addr=33:33:ff:5c:00:02)
> filters=80000001 00000000 00000000 40000000
> tg3 0000:03:04.0: eth0: Link is up at 1000 Mbps, full duplex
> tg3 0000:03:04.0: eth0: Flow control is off for TX and off for RX
> ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> add mc_addr(ha->addr=33:33:ff:5c:00:02)
> add mc_addr(ha->addr=33:33:ff:a3:44:24)
> filters=80020001 00000000 00000000 40000000
> eth0: no IPv6 routers present
> 
> [ ifconfig eth0 allmulti
>    (ip l and ifconfig say ALLMULTI is on)
> ]
> 
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> add mc_addr(ha->addr=33:33:ff:5c:00:02)
> add mc_addr(ha->addr=33:33:ff:a3:44:24)
> filters=80020001 00000000 00000000 40000000
> 
> [
>    $ sudo ifconfig eth0 -allmulti
>    Warning: Interface eth0 still in ALLMULTI mode.
>    (ip l and ifconfig say ALLMULTI is now off)
> ]
> 
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> add mc_addr(ha->addr=33:33:ff:5c:00:02)
> add mc_addr(ha->addr=33:33:ff:a3:44:24)
> filters=80020001 00000000 00000000 40000000
> 
> [ ifconfig eth0 allmulti
>    (same effect)
> ]
> 
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> add mc_addr(ha->addr=33:33:ff:5c:00:02)
> add mc_addr(ha->addr=33:33:ff:a3:44:24)
> filters=80020001 00000000 00000000 40000000
> 
> [
>    $ sudo ifconfig eth0 -allmulti
>    Warning: Interface eth0 still in ALLMULTI mode.
>    (same effect)
> ]
> 
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> add mc_addr(ha->addr=33:33:ff:5c:00:02)
> add mc_addr(ha->addr=33:33:ff:a3:44:24)
> filters=80020001 00000000 00000000 40000000
> 
> 
> > But if problem remains even with "ifconfig eth0 allmulti" I suspect a
> > NIC firmware problem. (allmulti set to 1 all the 128 bits of filters)

I suspect Eric is right.

Thomas, can you give me the output of 'ethtool -i eth0'?

> If you expected more bits set in "filters" with allmulti than without it, 
> that doesn't seem to be the case.

"allmulti" has the effect of enabling all 128 bits of the multicast hash
filters.  It doesn't explicitly enable them all though.

> Applied your patch to v2.6.35.
> 
> ---------
> typedef struct me_s {
>    char name[]      = { "Thomas Habets" };
>    char email[]     = { "thomas@habets.pp.se" };
>    char kernel[]    = { "Linux" };
>    char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
>    char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
>    char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
> } me_t;
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 15:58                 ` Thomas Habets
  2010-08-17 17:11                   ` Matt Carlson
@ 2010-08-17 17:13                   ` Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2010-08-17 17:13 UTC (permalink / raw)
  To: Thomas Habets; +Cc: linux-kernel, netdev, Matt Carlson, Michael Chan

Le mardi 17 août 2010 à 17:58 +0200, Thomas Habets a écrit :

> If you expected more bits set in "filters" with allmulti than without it, 
> that doesn't seem to be the case.


Nope, the patch displays mc list and filters bits only if not
promiscuous and not allmulti (normal ethernet mode)

If promiscuous -> a special PROMISC bit is selected on NIC (no display)
If allmulti -> all 128 bits are set (but not displayed in my patch)

I wanted to make sure the correct list of mc addrs is handled on your
machine. It seems to be the case, so there might be a hardware problem
with the multicast rx on this particular NIC




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 17:11                   ` Matt Carlson
@ 2010-08-17 17:29                     ` Thomas Habets
  2010-08-17 18:31                       ` Matt Carlson
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Habets @ 2010-08-17 17:29 UTC (permalink / raw)
  To: Matt Carlson
  Cc: Thomas Habets, Eric Dumazet, linux-kernel, netdev, Michael Chan

On Tue, 17 Aug 2010, Matt Carlson wrote:
>>> But if problem remains even with "ifconfig eth0 allmulti" I suspect a
>>> NIC firmware problem. (allmulti set to 1 all the 128 bits of filters)
> Thomas, can you give me the output of 'ethtool -i eth0'?

$ sudo ethtool -i eth0
driver: tg3
version: 3.110
firmware-version: 5715-v3.28, UMP 1.15
bus-info: 0000:03:04.0

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 17:29                     ` Thomas Habets
@ 2010-08-17 18:31                       ` Matt Carlson
  2010-08-17 18:52                         ` Thomas Habets
  0 siblings, 1 reply; 25+ messages in thread
From: Matt Carlson @ 2010-08-17 18:31 UTC (permalink / raw)
  To: Thomas Habets
  Cc: Matthew Carlson, Eric Dumazet, linux-kernel, netdev, Michael Chan

On Tue, Aug 17, 2010 at 10:29:54AM -0700, Thomas Habets wrote:
> On Tue, 17 Aug 2010, Matt Carlson wrote:
> >>> But if problem remains even with "ifconfig eth0 allmulti" I suspect a
> >>> NIC firmware problem. (allmulti set to 1 all the 128 bits of filters)
> > Thomas, can you give me the output of 'ethtool -i eth0'?
> 
> $ sudo ethtool -i eth0
> driver: tg3
> version: 3.110
> firmware-version: 5715-v3.28, UMP 1.15
> bus-info: 0000:03:04.0

Thanks.  I put the question out to the firmware developer.  While we
wait, can you keep Eric's patch in place and give me the results along
with the output of 'ethtool -d eth0 | grep 0x047' after the problem
happens?

Eric's patch shows the hash registers at the time they are programmed.
I'm interested to see if the values change (by firmware) after the
failure.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 18:31                       ` Matt Carlson
@ 2010-08-17 18:52                         ` Thomas Habets
  2010-08-18  1:23                           ` Matt Carlson
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Habets @ 2010-08-17 18:52 UTC (permalink / raw)
  To: Matt Carlson
  Cc: Thomas Habets, Eric Dumazet, linux-kernel, netdev, Michael Chan

On Tue, 17 Aug 2010, Matt Carlson wrote:
> Thanks.  I put the question out to the firmware developer.  While we
> wait, can you keep Eric's patch in place and give me the results along
> with the output of 'ethtool -d eth0 | grep 0x047' after the problem
> happens?

Sure.

I think the problem occurs shortly after booting, or is triggered by it 
Linux getting a neighbor table entry for the router. The reason it took a 
while for everything to actually stop working is that the router was 
caching and presumably updating its neighbors cache when it saw traffic.

That is, maybe it only works if the router sets up its neigbor table 
first, and not otherwise.

The problem is there now. Last output in the kernel log about this is:

$ dmesg | egrep 'eth0|^add mc|^filters='
[...]
add mc_addr(ha->addr=33:33:00:00:00:01)
add mc_addr(ha->addr=01:00:5e:00:00:01)
add mc_addr(ha->addr=33:33:ff:5c:00:02)
add mc_addr(ha->addr=33:33:ff:a3:44:24)
filters=80020001 00000000 00000000 40000000

$ sudo ethtool -d eth0 | grep 0x047
0x0470	0x80020001
0x0474	0x00000000
0x0478	0x00000000
0x047c	0x40000000

> Eric's patch shows the hash registers at the time they are programmed.
> I'm interested to see if the values change (by firmware) after the
> failure.

Look the same.

But a strange thing is that if I delete the ipv6 neighbor on the Linux 
box (ip ne del 2a00:800:752:1::5c:1 dev eth0) it suddenly answers a ND 
solicitation. I tried it just now and it "wakes it up".

Nothing was written to the kernel log when I ran this command, and the 
ethtools -d output is the same afterwards as it was before. So unless 
there's another code path that changes the registers when I do "ip ne 
del" it may still be something else.

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-17 18:52                         ` Thomas Habets
@ 2010-08-18  1:23                           ` Matt Carlson
  2010-08-18  7:02                             ` Thomas Habets
  0 siblings, 1 reply; 25+ messages in thread
From: Matt Carlson @ 2010-08-18  1:23 UTC (permalink / raw)
  To: Thomas Habets
  Cc: Matthew Carlson, Eric Dumazet, linux-kernel, netdev, Michael Chan

On Tue, Aug 17, 2010 at 11:52:27AM -0700, Thomas Habets wrote:
> On Tue, 17 Aug 2010, Matt Carlson wrote:
> > Thanks.  I put the question out to the firmware developer.  While we
> > wait, can you keep Eric's patch in place and give me the results along
> > with the output of 'ethtool -d eth0 | grep 0x047' after the problem
> > happens?
> 
> Sure.
> 
> I think the problem occurs shortly after booting, or is triggered by it 
> Linux getting a neighbor table entry for the router. The reason it took a 
> while for everything to actually stop working is that the router was 
> caching and presumably updating its neighbors cache when it saw traffic.
> 
> That is, maybe it only works if the router sets up its neigbor table 
> first, and not otherwise.
> 
> The problem is there now. Last output in the kernel log about this is:
> 
> $ dmesg | egrep 'eth0|^add mc|^filters='
> [...]
> add mc_addr(ha->addr=33:33:00:00:00:01)
> add mc_addr(ha->addr=01:00:5e:00:00:01)
> add mc_addr(ha->addr=33:33:ff:5c:00:02)
> add mc_addr(ha->addr=33:33:ff:a3:44:24)
> filters=80020001 00000000 00000000 40000000
> 
> $ sudo ethtool -d eth0 | grep 0x047
> 0x0470	0x80020001
> 0x0474	0x00000000
> 0x0478	0x00000000
> 0x047c	0x40000000
> 
> > Eric's patch shows the hash registers at the time they are programmed.
> > I'm interested to see if the values change (by firmware) after the
> > failure.
> 
> Look the same.
> 
> But a strange thing is that if I delete the ipv6 neighbor on the Linux 
> box (ip ne del 2a00:800:752:1::5c:1 dev eth0) it suddenly answers a ND 
> solicitation. I tried it just now and it "wakes it up".
> 
> Nothing was written to the kernel log when I ran this command, and the 
> ethtools -d output is the same afterwards as it was before. So unless 
> there's another code path that changes the registers when I do "ip ne 
> del" it may still be something else.

Do you have access to any diagnostic software that might have come with
your machine?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-18  1:23                           ` Matt Carlson
@ 2010-08-18  7:02                             ` Thomas Habets
  2010-09-01  9:21                               ` Thomas Habets
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Habets @ 2010-08-18  7:02 UTC (permalink / raw)
  To: Matt Carlson
  Cc: Thomas Habets, Eric Dumazet, linux-kernel, netdev, Michael Chan

On Tue, 17 Aug 2010, Matt Carlson wrote:
> Do you have access to any diagnostic software that might have come with
> your machine?

I'm don't know what diagnostic software that would be, nor does other 
people here. So "no", i guess.

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-08-18  7:02                             ` Thomas Habets
@ 2010-09-01  9:21                               ` Thomas Habets
  2010-09-01 13:19                                 ` Eric Dumazet
  2010-09-01 14:40                                 ` Brian Haley
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Habets @ 2010-09-01  9:21 UTC (permalink / raw)
  To: Thomas Habets
  Cc: Matt Carlson, Eric Dumazet, linux-kernel, netdev, Michael Chan


I've continued this a bit off-list but thought I would summarize for the 
archives.


Summary
-------
It looks like a firmware issue on the network card. When ILO is enabled it 
shares the first network card with the OS. When it does this multicast 
is broken. When multicast (on a L2 level) is broken IPv6 neighbor 
discovery breaks. Only eth0 breaks, eth1 is unaffected.


System
------
HP Proliant DL320 G5p
Xeon 3GHz
1GB RAM
Arch: amd64
NIC: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)
Debian Lenny (5.0.5)
Kernels: 2.6.35 mainline, 2.6.33.6
Config: http://pastebin.com/raw.php?i=Y6S8iKW7


Problem
-------
Buggy box will not answer IPv6 ND or ping to ff02::1. May work at some 
point in the boot process, but once box is fully booted it does not.

If I on the neighboring Cisco router run "clear ipv6 neighbors" (or it 
times out) that router cannot re-acquire the neigborship with the buggy 
box. Instant IPv6 breakage until I do one of:
* Turn on promisc mode long enough for IPv6 ND to do its thing
* ip ne del <address of neighbor> on the buggy host.


Workarounds
-----------
Either one of these will hide the problem:
* Set promisc mode on interface (ip link set promisc on eth0) forever
* Disable ILO
* Use eth1 instead of eth0.


Troubleshooting
---------------
Got patch for kernel from Eric Dumazet (eric.dumazet@gmail.com) to output 
what MAC addresses are being subscribed to, and some registers from the 
card. Output is earlier in this thread, along with "ethtool -i eth0" and 
some other data.

Managed to get diagnostic tool[1] booting from stick (no CD drive in 
server), but did not set up memory (himem.sys etc..). Running b57udiag 
it therefore failed due to insufficient memory at test "Group D. Driver 
Associated tests". Card is assumed to be OK anyway.

Matt Carlson (mcarlson@broadcom.com) suspected firmware bug and asked me 
to try disabling ASF and/or IPMI using the diagnostic tool, but running 
"setasf -d" and "setipmi -d" inside "b57udiag -cmd" did not seem to stick 
across reboot. It stuck properly before reboot (confirmed with setasf -q). 
Also tried "b57udiag -u 0". Tried both C-A-D reboot and powercycling (by 
power cord).

At boot Linux still said ASF[1] for eth0 and ASF[0] for eth1:
tg3 0000:03:04.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
tg3 0000:03:04.1: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
(this output never changed throughout the process)
ethtool -d eth1 | grep 0x047 did not change either.

Then I disabled ILO and PXE in ILO bios and BIOS respectively. That fixed 
it. eth0 now works with multicast.

I don't use ILO on this server so in this case that fixes it for me, but 
the bug is still there.

At this point Matt thinks I should file a bug report with HP. I will 
attempt to do that.

I have more detailed logs of what I did and when, and what the effect was.


Related
-------
May be the same issue as this:
   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263260
Which means it's the same with Ubuntu kernels 2.6.26.3, 2.6.26-5-generic 
and 2.6.27-2-generic, and mainline kernels 2.6.25, 2.6.26 and 2.6.27.


[1] http://www.broadcom.com/support/ethernet_nic/netxtreme_server.php

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-09-01  9:21                               ` Thomas Habets
@ 2010-09-01 13:19                                 ` Eric Dumazet
  2010-09-01 14:40                                 ` Brian Haley
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2010-09-01 13:19 UTC (permalink / raw)
  To: Thomas Habets; +Cc: Matt Carlson, linux-kernel, netdev, Michael Chan

Le mercredi 01 septembre 2010 à 11:21 +0200, Thomas Habets a écrit :
> I've continued this a bit off-list but thought I would summarize for the 
> archives.
> 
> 
> Summary
> -------
> It looks like a firmware issue on the network card. When ILO is enabled it 
> shares the first network card with the OS. When it does this multicast 
> is broken. When multicast (on a L2 level) is broken IPv6 neighbor 
> discovery breaks. Only eth0 breaks, eth1 is unaffected.
> 
> 
> System
> ------
> HP Proliant DL320 G5p
> Xeon 3GHz
> 1GB RAM
> Arch: amd64
> NIC: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)
> Debian Lenny (5.0.5)
> Kernels: 2.6.35 mainline, 2.6.33.6
> Config: http://pastebin.com/raw.php?i=Y6S8iKW7
> 
> 
> Problem
> -------
> Buggy box will not answer IPv6 ND or ping to ff02::1. May work at some 
> point in the boot process, but once box is fully booted it does not.
> 
> If I on the neighboring Cisco router run "clear ipv6 neighbors" (or it 
> times out) that router cannot re-acquire the neigborship with the buggy 
> box. Instant IPv6 breakage until I do one of:
> * Turn on promisc mode long enough for IPv6 ND to do its thing
> * ip ne del <address of neighbor> on the buggy host.
> 
> 
> Workarounds
> -----------
> Either one of these will hide the problem:
> * Set promisc mode on interface (ip link set promisc on eth0) forever
> * Disable ILO
> * Use eth1 instead of eth0.
> 
> 
> Troubleshooting
> ---------------
> Got patch for kernel from Eric Dumazet (eric.dumazet@gmail.com) to output 
> what MAC addresses are being subscribed to, and some registers from the 
> card. Output is earlier in this thread, along with "ethtool -i eth0" and 
> some other data.
> 
> Managed to get diagnostic tool[1] booting from stick (no CD drive in 
> server), but did not set up memory (himem.sys etc..). Running b57udiag 
> it therefore failed due to insufficient memory at test "Group D. Driver 
> Associated tests". Card is assumed to be OK anyway.
> 
> Matt Carlson (mcarlson@broadcom.com) suspected firmware bug and asked me 
> to try disabling ASF and/or IPMI using the diagnostic tool, but running 
> "setasf -d" and "setipmi -d" inside "b57udiag -cmd" did not seem to stick 
> across reboot. It stuck properly before reboot (confirmed with setasf -q). 
> Also tried "b57udiag -u 0". Tried both C-A-D reboot and powercycling (by 
> power cord).
> 
> At boot Linux still said ASF[1] for eth0 and ASF[0] for eth1:
> tg3 0000:03:04.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
> tg3 0000:03:04.1: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> (this output never changed throughout the process)
> ethtool -d eth1 | grep 0x047 did not change either.
> 
> Then I disabled ILO and PXE in ILO bios and BIOS respectively. That fixed 
> it. eth0 now works with multicast.
> 
> I don't use ILO on this server so in this case that fixes it for me, but 
> the bug is still there.
> 
> At this point Matt thinks I should file a bug report with HP. I will 
> attempt to do that.
> 
> I have more detailed logs of what I did and when, and what the effect was.
> 
> 
> Related
> -------
> May be the same issue as this:
>    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263260
> Which means it's the same with Ubuntu kernels 2.6.26.3, 2.6.26-5-generic 
> and 2.6.27-2-generic, and mainline kernels 2.6.25, 2.6.26 and 2.6.27.
> 
> 
> [1] http://www.broadcom.com/support/ethernet_nic/netxtreme_server.php
> 


Thanks a lot Thomas for this very detailed report !



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-09-01  9:21                               ` Thomas Habets
  2010-09-01 13:19                                 ` Eric Dumazet
@ 2010-09-01 14:40                                 ` Brian Haley
  2010-09-14 19:56                                   ` Thomas Habets
  1 sibling, 1 reply; 25+ messages in thread
From: Brian Haley @ 2010-09-01 14:40 UTC (permalink / raw)
  To: Thomas Habets
  Cc: Matt Carlson, Eric Dumazet, linux-kernel, netdev, Michael Chan

Hi Thomas,

On 09/01/2010 05:21 AM, Thomas Habets wrote:
> 
> I've continued this a bit off-list but thought I would summarize for the
> archives.
> 
> 
> Summary
> -------
> It looks like a firmware issue on the network card. When ILO is enabled
> it shares the first network card with the OS. When it does this
> multicast is broken. When multicast (on a L2 level) is broken IPv6
> neighbor discovery breaks. Only eth0 breaks, eth1 is unaffected.

So are you running with this set to "Shared Network Port" mode?  I'm
guessing you are.

> System
> ------
> HP Proliant DL320 G5p
> Xeon 3GHz
> 1GB RAM
> Arch: amd64
> NIC: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)

There was another report on netdev back in 11/2008 on this exact hardware,
with the same problem.

> Problem
> -------
> Buggy box will not answer IPv6 ND or ping to ff02::1. May work at some
> point in the boot process, but once box is fully booted it does not.

I dug-up my notes on the problem, and from what I can tell, the receive
multicast filters on the NIC were getting removed, causing both incoming
IPv6 and IPv4 multicast packets to get dropped.  I'm not sure if there
was ever a fix developed, or if we ever came to a conclusion on where the
bug was - iLO, tg3, or some other area.

-Brian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-09-01 14:40                                 ` Brian Haley
@ 2010-09-14 19:56                                   ` Thomas Habets
  2010-09-15 17:37                                     ` Brian Haley
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Habets @ 2010-09-14 19:56 UTC (permalink / raw)
  To: Brian Haley
  Cc: Matt Carlson, Eric Dumazet, linux-kernel, netdev, Michael Chan


Sorry for the late reply. I've been swamped.

On Wed, 1 Sep 2010, Brian Haley wrote:
> So are you running with this set to "Shared Network Port" mode?  I'm
> guessing you are.

Yes, there's no dedicated ILO port.

> There was another report on netdev back in 11/2008 on this exact hardware,
> with the same problem.

I can't seem to find it. Do you happen to have the subject line or 
something?

> I dug-up my notes on the problem, and from what I can tell, the receive
> multicast filters on the NIC were getting removed, causing both incoming
> IPv6 and IPv4 multicast packets to get dropped.

Sounds about right. From what I understand the relevant registers were 
still the same for me when it wasn't working though (if that indeed is 
how the filter is implemented).

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: BUG: IPv6 stops working after a while, needs ip ne del command to reset
  2010-09-14 19:56                                   ` Thomas Habets
@ 2010-09-15 17:37                                     ` Brian Haley
  0 siblings, 0 replies; 25+ messages in thread
From: Brian Haley @ 2010-09-15 17:37 UTC (permalink / raw)
  To: Thomas Habets
  Cc: Matt Carlson, Eric Dumazet, linux-kernel, netdev, Michael Chan

On 09/14/2010 03:56 PM, Thomas Habets wrote:
> 
> Sorry for the late reply. I've been swamped.
> 
> On Wed, 1 Sep 2010, Brian Haley wrote:
>> So are you running with this set to "Shared Network Port" mode?  I'm
>> guessing you are.
> 
> Yes, there's no dedicated ILO port.
> 
>> There was another report on netdev back in 11/2008 on this exact
>> hardware,
>> with the same problem.
> 
> I can't seem to find it. Do you happen to have the subject line or
> something?

It was actually a month earlier in 2008, I mis-typed, here's the link:

http://marc.info/?l=linux-netdev&m=122280545121251&w=2

>> I dug-up my notes on the problem, and from what I can tell, the receive
>> multicast filters on the NIC were getting removed, causing both incoming
>> IPv6 and IPv4 multicast packets to get dropped.
> 
> Sounds about right. From what I understand the relevant registers were
> still the same for me when it wasn't working though (if that indeed is
> how the filter is implemented).

One of the outcomes of that investigation was to update the firmware
and/or iLO, I'm not sure if either fixed the problem.

-Brian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* BUG: IPv6 stops working after a while, needs ip ne del command to reset
@ 2010-08-06  8:25 Thomas Habets
  0 siblings, 0 replies; 25+ messages in thread
From: Thomas Habets @ 2010-08-06  8:25 UTC (permalink / raw)
  To: netdev


IPv6 initially works, but when I leave it alone overnight I'm unable to 
ping even my default gw.

Static global IPv6 addresses configured on both ends. No access lists on 
either end.

Kernel version: 2.6.35 mainline (amd64) and 2.6.33.6.
Kernel config: http://pastebin.com/raw.php?i=Y6S8iKW7
Dist: Debian Lenny (5.0.5), nothing special to my knowledge.

I seem to have the same issue that Mikael Abrahamsson encountered with 
Ubuntu kernels 2.6.26.3, 2.6.26-5-generic and 2.6.27-2-generic, and 
mainline kernels 2.6.25, 2.6.26 and 2.6.27:
   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263260

He got IPv6 running again without rebooting using "networking stop, 
ifconfig eth0 down, networking start, kill dhclient", while I narrowed it 
down to just deleting the ipv6 neighbor (ip ne del..., see below). 
Rebooting also causes it to start working again.

It's very reproducible. I just leave it overnight and it breaks every 
time.

I am willing and able to try patches at any time, the box is not in 
production.

No iptables, no ip6tables. IP6tables support is not even compiled in.

NIC is "Broadcom Corporation NetXtreme BCM5715 Gigabit ethernet (rev a3)"
according to lspci.

Other end is a directly connected Cisco 7600 (routed port) that I have 
access to, but it's in production use. IPv4 works perfectly over this same 
port. Only lo and eth0 are UP.


Output when broken
------------------
$ uname -a
Linux XXXXX 2.6.35 #1 SMP Tue Aug 3 09:25:51 CEST 2010 x86_64
GNU/Linux

$ ip -6 a sh
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436
     inet6 2a00:800:1000:64::1/128 scope global
        valid_lft forever preferred_lft forever
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
     inet6 2a00:800:752:1::5c:2/112 scope global
        valid_lft forever preferred_lft forever
     inet6 fe80::224:81ff:fea3:4424/64 scope link
        valid_lft forever preferred_lft forever

(I have tried removing 2a00:800:1000:64::1/128 from lo, same issue)

$ ip -6 r sh
2a00:800:752:1::5c:0/112 dev eth0  proto kernel  metric 256  mtu 1500
advmss 14 hoplimit 4294967295 unreachable
2a00:800:1000:64::1 dev lo proto kernel  metric 256  error -101 mtu 16436 advmss 16376 hoplimit 
4294967295
fe80::/64 dev eth0  proto kernel  metric 256  mtu 1500 advmss 1440
hoplimit 4294967295
default via 2a00:800:752:1::5c:1 dev eth0  metric 1024  mtu 1500 advmss 
1440 hoplimit 4294967295

$ ping6 2a00:800:752:1::5c:1
PING 2a00:800:752:1::5c:1(2a00:800:752:1::5c:1) 56 data bytes
^C
--- 2a00:800:752:1::5c:1 ping statistics ---
22 packets transmitted, 0 received, 100% packet loss, time 21006ms


# Tcpdpump on the problem machine shows mostly the pings, but also 
periodically some ND:

[...]
12:54:02.683672 00:24:81:a3:44:24 > 00:22:55:17:4b:80, ethertype IPv6
(0x86dd), length 118: 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, 
echo request, seq 12, length 64
12:54:02.693669 00:24:81:a3:44:24 > 00:22:55:17:4b:80, ethertype IPv6
(0x86dd), length 86: fe80::224:81ff:fea3:4424 > 2a00:800:752:1::5c:1: 
ICMP6, neighbor solicitation, who has 2a00:800:752:1::5c:1, length 32
12:54:02.693832 00:22:55:17:4b:80 > 00:24:81:a3:44:24, ethertype IPv6
(0x86dd), length 78: 2a00:800:752:1::5c:1 > fe80::224:81ff:fea3:4424: 
ICMP6, neighbor advertisement, tgt is 2a00:800:752:1::5c:1, length 24
12:54:03.683672 00:24:81:a3:44:24 > 00:22:55:17:4b:80, ethertype IPv6
(0x86dd), length 118: 2a00:800:752:1::5c:2 > 2a00:800:752:1::5c:1: ICMP6, 
echo request, seq 13, length 64
[...]

$ ip -6 ne
fe80::222:55ff:fe17:4b80 dev eth0 lladdr 00:22:55:17:4b:80 router STALE
2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router STALE


Fixing the adjacency
--------------------
$ ping6 2a00:800:752:1::5c:1
PING 2a00:800:752:1::5c:1(2a00:800:752:1::5c:1) 56 data bytes
^C
--- 2a00:800:752:1::5c:1 ping statistics ---
51 packets transmitted, 0 received, 100% packet loss, time 50006ms

$ sudo ip ne del 2a00:800:752:1::5c:1 dev eth0
$ ping6 2a00:800:752:1::5c:1
PING 2a00:800:752:1::5c:1(2a00:800:752:1::5c:1) 56 data bytes
64 bytes from 2a00:800:752:1::5c:1: icmp_seq=1 ttl=64 time=31.9 ms
64 bytes from 2a00:800:752:1::5c:1: icmp_seq=2 ttl=64 time=0.212 ms

$ ip -6 ne
fe80::222:55ff:fe17:4b80 dev eth0 lladdr 00:22:55:17:4b:80 router 
REACHABLE
2a00:800:752:1::5c:1 dev eth0 lladdr 00:22:55:17:4b:80 router REACHABLE

(Note that after a few minutes it goes back to STALE, but pinging still 
works and brings back the state to REACHABLE, so it's not that it can't 
get out of STALE once there, it seems).

---------
typedef struct me_s {
   char name[]      = { "Thomas Habets" };
   char email[]     = { "thomas@habets.pp.se" };
   char kernel[]    = { "Linux" };
   char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt" };
   char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE  0945 286A E90A AD48 E854" };
   char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2010-09-15 17:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-13 17:55 BUG: IPv6 stops working after a while, needs ip ne del command to reset Thomas Habets
2010-08-13 21:34 ` David Miller
2010-08-16 10:19 ` Eric Dumazet
2010-08-16 10:59   ` Thomas Habets
2010-08-17  5:35     ` Thomas Habets
2010-08-17  6:00       ` Eric Dumazet
2010-08-17 11:08         ` Thomas Habets
2010-08-17 13:15           ` Eric Dumazet
2010-08-17 14:09             ` Thomas Habets
2010-08-17 14:34               ` Eric Dumazet
2010-08-17 15:58                 ` Thomas Habets
2010-08-17 17:11                   ` Matt Carlson
2010-08-17 17:29                     ` Thomas Habets
2010-08-17 18:31                       ` Matt Carlson
2010-08-17 18:52                         ` Thomas Habets
2010-08-18  1:23                           ` Matt Carlson
2010-08-18  7:02                             ` Thomas Habets
2010-09-01  9:21                               ` Thomas Habets
2010-09-01 13:19                                 ` Eric Dumazet
2010-09-01 14:40                                 ` Brian Haley
2010-09-14 19:56                                   ` Thomas Habets
2010-09-15 17:37                                     ` Brian Haley
2010-08-17 17:13                   ` Eric Dumazet
2010-08-17 16:14               ` Thomas Habets
  -- strict thread matches above, loose matches on Subject: below --
2010-08-06  8:25 Thomas Habets

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.