All of lore.kernel.org
 help / color / mirror / Atom feed
* bnx2 cards intermittantly going offline
@ 2011-01-18 10:54 Mills, Tony
  2011-01-18 17:55 ` Michael Chan
  2011-11-15 17:41 ` Ken
  0 siblings, 2 replies; 8+ messages in thread
From: Mills, Tony @ 2011-01-18 10:54 UTC (permalink / raw)
  To: netdev

Hi, 

I was running Debian lenny 64bit with a 2.6.24 kernel which seemed to have a rather old version of the bnx2 driver, I have been getting reports that there have been issues with connectivity, this seems to happen randomly across many different servers in different data centres. 

Further investigation showed that the interfaces become completely unresponsive for periods of time, whereby machines are unable to ping the host with the problem and the server  with the problem is unable to ping out, our tcp application which is time critical will kick off connections. 

The network cards are Broadcom NetXtreme II BCM5708 Gigabit Ethernet cards on Dell 2950's plugged into  Cisco 3750E's. 

Reading various posts indicated that I might be experiencing a problem that may have already been solved so attempted to build  the drivers from the Broadcom website into the 2.4.24 kernel without success, eventually compiling against a 2.3.32 kernel worked great. 

I have installed this on 4 machines in different data centres and followed some of the other posts I have found, in an attempt to fix the issues, however none of the things i have tried appear to be affective :-

1.	 Was seeing rx_fw_discards so upped rx ring buffer to both 1020 and 4080, this stopped the rx_fw_discards but not the "unresponsiveness". 
2.	Have enabled flow control on one of the machines, this still has the unresponsive behaviour and the port on the switch shows 0 pause frames received. 
3.	Upgraded the kernel and driver to latest available. 

Last night i setup a machine to monitor overnight and at 3:52 this morning it became unresponsive. 

The switch was setup to "flowcontrol desired", and the machine had the following settings:-

ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:                         4080
RX Mini:               0
RX Jumbo:           16320
TX:                          255
Current hardware settings:
RX:                         4080
RX Mini:               0
RX Jumbo:           0
TX:                          255

ethtool -a eth0
Pause parameters for eth0:
Autonegotiate: on
RX:                         on
TX:                          on

The output from ethtool -S eth0

NIC statistics:
     rx_bytes: 65832403312
     rx_error_bytes: 0
     tx_bytes: 141615699363
     tx_error_bytes: 0
     rx_ucast_packets: 565468011
     rx_mcast_packets: 3
     rx_bcast_packets: 193008
     tx_ucast_packets: 768277404
     tx_mcast_packets: 8
     tx_bcast_packets: 657
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 0
     rx_align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     tx_deferred: 0
     tx_excess_collisions: 0
     tx_late_collisions: 0
     tx_total_collisions: 0
     rx_fragments: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_oversize_packets: 0
     rx_64_byte_packets: 398958533
     rx_65_to_127_byte_packets: 125222178
     rx_128_to_255_byte_packets: 16962519
     rx_256_to_511_byte_packets: 6100929
     rx_512_to_1023_byte_packets: 2314593
     rx_1024_to_1522_byte_packets: 16102270
     rx_1523_to_9022_byte_packets: 0
     tx_64_byte_packets: 331974057
     tx_65_to_127_byte_packets: 239480821
     tx_128_to_255_byte_packets: 78102231
     tx_256_to_511_byte_packets: 33163946
     tx_512_to_1023_byte_packets: 57321357
     tx_1024_to_1522_byte_packets: 28235657
     tx_1523_to_9022_byte_packets: 0
     rx_xon_frames: 0
     rx_xoff_frames: 0
     tx_xon_frames: 0
     tx_xoff_frames: 0
     rx_mac_ctrl_frames: 0
     rx_filtered_packets: 43417
     rx_ftq_discards: 0
     rx_discards: 0
     rx_fw_discards: 0

The switch port showed no pause frames

show interfaces gigabitEthernet X/X/X flowcontrol 
Port       Send FlowControl  Receive FlowControl  RxPause TxPause
           admin    oper     admin    oper                       
---------  -------- -------- -------- --------    ------- -------
GiX/X/X   Unsupp.  Unsupp.  desired  on          0       0   

(The switch is unable to send flow control packets but can receive).

Would someone be able to point me at anything else that may help identify/fix the issue. 

Best Regards

Tony Mills
-- 
IMPORTANT NOTICE

The sender does not guarantee that this message, including any attachment, is secure
or virus free. Also, it is confidential and may be privileged or otherwise protected
from disclosure. If you are not the intended recipient, do not disclose or copy it
or its contents. Please telephone or email the sender and delete the message
entirely from your system.
Jagex Limited is a company registered in England & Wales with company number
03982706 and a registered office at St John's Innovation Centre, Cowley Road, 
Cambridge, CB4 0WS, UK.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bnx2 cards intermittantly going offline
  2011-01-18 10:54 bnx2 cards intermittantly going offline Mills, Tony
@ 2011-01-18 17:55 ` Michael Chan
  2011-01-26 12:44   ` Mills, Tony
  2012-09-13 13:51   ` Marc A. Donges
  2011-11-15 17:41 ` Ken
  1 sibling, 2 replies; 8+ messages in thread
From: Michael Chan @ 2011-01-18 17:55 UTC (permalink / raw)
  To: Mills, Tony; +Cc: netdev


On Tue, 2011-01-18 at 02:54 -0800, Mills, Tony wrote:
> Last night i setup a machine to monitor overnight and at 3:52 this
> morning it became unresponsive. 
> 

When it becomes unresponsive, please send some packets to the NIC (such
as ping) and monitor statistics with ethtool -S.  See if the packets are
being received or discarded.  Also, run tcpdump on the machine to see if
the packets are properly received by the stack.  Thanks.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: bnx2 cards intermittantly going offline
  2011-01-18 17:55 ` Michael Chan
@ 2011-01-26 12:44   ` Mills, Tony
  2012-09-13 13:51   ` Marc A. Donges
  1 sibling, 0 replies; 8+ messages in thread
From: Mills, Tony @ 2011-01-26 12:44 UTC (permalink / raw)
  To: Michael Chan; +Cc: netdev

Hi, Thanks for your response. 

I have done some further investigation and found that we have had a massive amount of interrupts occurring on our Broadcom cards, i have set the affinity for irq 36 and 48 to run on the first two vcpu's on the box and the java processes to run on the others. I have also replaced the debian bnx2 driver with the latest from the Broadcom website and made the rx ring buffer to 4080, this has stopped a multiplexed server running at 1.6 cycles per second from missing cycles due to interrupts on the interface and allowing much better processing time, and the ring buffer up at 4080 stops the rx_fw_discards i was seeing periodically, (even upping that to 1020 or 2040 did not sort the issue but the maximum setting from the Broadcom driver does. 

I am now monitoring the system to see if the card ever becomes unresponsive. But i do have a question. 

If i setup the smp_affinity with a mask of cpu 0 and 1 (the first two on the box) for the APIC-fasteoi irq's for the Ethernet devices, it appears that the kernel does not balance and will continue to use the same cpu to do the interrupts even though there is processing power on the other one. Is this a known issue or am i doing something wrong?

Can i add it's on an dell r610 with a 12 core intel Xeon X5680, this shows up as 24 vcpu's. 

Best Regards

Tony Mills


-----Original Message-----
From: Michael Chan [mailto:mchan@broadcom.com] 
Sent: 18 January 2011 17:56
To: Mills, Tony
Cc: netdev@vger.kernel.org
Subject: Re: bnx2 cards intermittantly going offline


On Tue, 2011-01-18 at 02:54 -0800, Mills, Tony wrote:
> Last night i setup a machine to monitor overnight and at 3:52 this
> morning it became unresponsive. 
> 

When it becomes unresponsive, please send some packets to the NIC (such
as ping) and monitor statistics with ethtool -S.  See if the packets are
being received or discarded.  Also, run tcpdump on the machine to see if
the packets are properly received by the stack.  Thanks.


-- 
IMPORTANT NOTICE

The sender does not guarantee that this message, including any attachment, is secure
or virus free. Also, it is confidential and may be privileged or otherwise protected
from disclosure. If you are not the intended recipient, do not disclose or copy it
or its contents. Please telephone or email the sender and delete the message
entirely from your system.
Jagex Limited is a company registered in England & Wales with company number
03982706 and a registered office at St John's Innovation Centre, Cowley Road, 
Cambridge, CB4 0WS, UK.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bnx2 cards intermittantly going offline
  2011-01-18 10:54 bnx2 cards intermittantly going offline Mills, Tony
  2011-01-18 17:55 ` Michael Chan
@ 2011-11-15 17:41 ` Ken
  1 sibling, 0 replies; 8+ messages in thread
From: Ken @ 2011-11-15 17:41 UTC (permalink / raw)
  To: netdev

+1 with identical L2 components and symptoms.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bnx2 cards intermittantly going offline
  2011-01-18 17:55 ` Michael Chan
  2011-01-26 12:44   ` Mills, Tony
@ 2012-09-13 13:51   ` Marc A. Donges
  2012-09-13 15:45     ` Sven Ulland
  1 sibling, 1 reply; 8+ messages in thread
From: Marc A. Donges @ 2012-09-13 13:51 UTC (permalink / raw)
  To: netdev; +Cc: Michael Chan

[This is a reply to a somewhat older thread]

"Michael Chan" wrote:
> On Tue, 2011-01-18 at 02:54 -0800, Mills, Tony wrote:
>> Last night i setup a machine to monitor overnight and at 3:52 this
>> morning it became unresponsive. 
>> 
> 
> When it becomes unresponsive, please send some packets to the NIC (such
> as ping) and monitor statistics with ethtool -S.  See if the packets are
> being received or discarded.  Also, run tcpdump on the machine to see if
> the packets are properly received by the stack.  Thanks.

Hi Michael, hi netdev,

I appear to be having the same problem as Tony (or at least a problem matching
his description).

The machine uses the BCM5709 chipset:

03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
03:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
04:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

It is running Debian stable (with the Debian stable firmware-bnx2 package).

After 55 days of operation the machine (A) suddenly was no longer reachable via
network. Strangely, a second machine (B) that should take over the IP addresses
(keepalived) did not take over. Only after shutting the switchport to which A
is attached did B take over.

Logging in to the machine via serial, I noticed that it did not receive any
packets via the network interface (after unshutting the switchport), only
traffic sent by the host A was visible in tcpdump, no traffic that was sent to
it (there should have been at least ARP traffic). In order to verify this, I
dumped traffic on another host in the broadcast domain and indeed, the traffic
sent out by A is seen on the network, it just doesn't receive any that is sent
to it.

This explains the lack of failover of keepalived, because A still considers
itself master and is able to announce that to the network, while it cannot see
the packets from its partner B (that wants to take over because of its,
meanwhile, higher priority).

No neighbors see the machine in their ARP tables any more.

I think the number of packets that are sent to the host are reflected in the
interface variable rx_ftq_discards: It increases by about 10 per second while
idle, and by about 80 per second when I send floodpings to the machine. Here
you see a dump of the interface statistics spaced ten seconds apart, while
floodpinging the host:

A:~# ethtool -S eth0; sleep 10; echo ---; ethtool -S eth0
NIC statistics:
     rx_bytes: 35498373071360
     rx_error_bytes: 0
     tx_bytes: 35475382869262
     tx_error_bytes: 0
     rx_ucast_packets: 45479514105
     rx_mcast_packets: 9800399
     rx_bcast_packets: 4901866
     tx_ucast_packets: 45364190447
     tx_mcast_packets: 7285029
     tx_bcast_packets: 3111
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 0
     rx_align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     tx_deferred: 0
     tx_excess_collisions: 0
     tx_late_collisions: 0
     tx_total_collisions: 0
     rx_fragments: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_oversize_packets: 0
     rx_64_byte_packets: 3465587589
     rx_65_to_127_byte_packets: 422897833
     rx_128_to_255_byte_packets: 3996306350
     rx_256_to_511_byte_packets: 1500221686
     rx_512_to_1023_byte_packets: 1351649898
     rx_1024_to_1522_byte_packets: 397814646
     rx_1523_to_9022_byte_packets: 0
     tx_64_byte_packets: 3451623430
     tx_65_to_127_byte_packets: 366024709
     tx_128_to_255_byte_packets: 3954496418
     tx_256_to_511_byte_packets: 1499757422
     tx_512_to_1023_byte_packets: 1351506958
     tx_1024_to_1522_byte_packets: 388331444
     tx_1523_to_9022_byte_packets: 0
     rx_xon_frames: 0
     rx_xoff_frames: 0
     tx_xon_frames: 81
     tx_xoff_frames: 81
     rx_mac_ctrl_frames: 0
     rx_filtered_packets: 26701433
     rx_ftq_discards: 1796839
     rx_discards: 369
     rx_fw_discards: 0
---
NIC statistics:
     rx_bytes: 35498373162770
     rx_error_bytes: 0
     tx_bytes: 35475382869262
     tx_error_bytes: 0
     rx_ucast_packets: 45479514920
     rx_mcast_packets: 9800483
     rx_bcast_packets: 4901876
     tx_ucast_packets: 45364190447
     tx_mcast_packets: 7285029
     tx_bcast_packets: 3111
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 0
     rx_align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     tx_deferred: 0
     tx_excess_collisions: 0
     tx_late_collisions: 0
     tx_total_collisions: 0
     rx_fragments: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_oversize_packets: 0
     rx_64_byte_packets: 3465587625
     rx_65_to_127_byte_packets: 422898706
     rx_128_to_255_byte_packets: 3996306350
     rx_256_to_511_byte_packets: 1500221686
     rx_512_to_1023_byte_packets: 1351649898
     rx_1024_to_1522_byte_packets: 397814646
     rx_1523_to_9022_byte_packets: 0
     tx_64_byte_packets: 3451623430
     tx_65_to_127_byte_packets: 366024709
     tx_128_to_255_byte_packets: 3954496418
     tx_256_to_511_byte_packets: 1499757422
     tx_512_to_1023_byte_packets: 1351506958
     tx_1024_to_1522_byte_packets: 388331444
     tx_1523_to_9022_byte_packets: 0
     rx_xon_frames: 0
     rx_xoff_frames: 0
     tx_xon_frames: 81
     tx_xoff_frames: 81
     rx_mac_ctrl_frames: 0
     rx_filtered_packets: 26701433
     rx_ftq_discards: 1797748
     rx_discards: 369
     rx_fw_discards: 0

The number of interrupts for the NIC is no longer increasing on host A. It is increasing on the otherwise identical and now active host B.

A:~# cat /proc/interrupts | fgrep eth0; sleep 10; echo ---; cat /proc/interrupts | fgrep eth0
  74:    7353715          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-0
  75:  150160682          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-1
  76:  261739096          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-2
  77: 3118389637          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-3
  78: 3538415303          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-4
  79: 3437432016          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-5
  80: 4130864322          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-6
  81: 3844677189          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-7
---
  74:    7353715          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-0
  75:  150160682          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-1
  76:  261739096          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-2
  77: 3118389637          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-3
  78: 3538415303          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-4
  79: 3437432016          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-5
  80: 4130864322          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-6
  81: 3844677189          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-7

B:~# cat /proc/interrupts | fgrep eth0; sleep 10; echo ---; cat /proc/interrupts | fgrep eth0
  74:    8496700          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-0
  75: 2605649299          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-1
  76: 2278350057          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-2
  77: 2119009356          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-3
  78: 2004958460          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-4
  79: 2005171437          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-5
  80: 2318332903          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-6
  81: 2087470150          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-7
---
  74:    8496713          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-0
  75: 2605688265          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-1
  76: 2278397958          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-2
  77: 2119043500          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-3
  78: 2005000430          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-4
  79: 2005205617          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-5
  80: 2318373260          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-6
  81: 2087518969          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth0-7

There are no (significant) interface errors on the switchport of machine A (Cisco 6500):
  Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 3354643
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 0 bits/sec, 0 packets/sec
  5 minute output rate 73000 bits/sec, 90 packets/sec
     139005756894 packets input, 106028470724434 bytes, 0 no buffer
     Received 41673355 broadcasts (41644823 multicasts)
     0 runts, 0 giants, 0 throttles 
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 0 multicast, 0 pause input
     0 input packets with dribble condition detected
     139565849434 packets output, 106109148647056 bytes, 0 underruns
     0 output errors, 0 collisions, 3 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

For reference, switchport of machine B:
  Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 561319
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 168420000 bits/sec, 27846 packets/sec
  5 minute output rate 168547000 bits/sec, 27951 packets/sec
     12477681177 packets input, 9891434829664 bytes, 0 no buffer
     Received 4452361 broadcasts (4434737 multicasts)
     0 runts, 0 giants, 0 throttles 
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 0 multicast, 0 pause input
     0 input packets with dribble condition detected
     12725512555 packets output, 9944380037353 bytes, 0 underruns
     0 output errors, 0 collisions, 2 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

This error occured about five hours ago, the interface did not recover.

We have five pairs of basically identical machines performing the same task
(each pair for one site). The error has not occured with any other one, but
this site is the busiest:

eth0      Link encap:Ethernet  HWaddr 3c:d9:2b:ef:f6:3c  
          inet addr:172.16.100.23  Bcast:172.16.100.63  Mask:255.255.255.192
          inet6 addr: fe80::3ed9:2bff:feef:f63c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:45494315484 errors:1896322 dropped:1896322 overruns:0 frame:1896322
          TX packets:45371478602 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:35498383041926 (32.2 TiB)  TX bytes:35475382870222 (32.2 TiB)
          Interrupt:30 Memory:f4000000-f4012800 

The host performs NAT, input and output interface being eth0, therefore the RX and TX counters are similar.

I would appreciate any suggestions for diagnosing this further.

Kind regards
Marc

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bnx2 cards intermittantly going offline
  2012-09-13 13:51   ` Marc A. Donges
@ 2012-09-13 15:45     ` Sven Ulland
  2012-09-13 20:30       ` Michael Chan
  2012-09-16  3:47       ` Ben Hutchings
  0 siblings, 2 replies; 8+ messages in thread
From: Sven Ulland @ 2012-09-13 15:45 UTC (permalink / raw)
  To: netdev; +Cc: Marc A. Donges, Michael Chan

On 09/13/2012 03:51 PM, Marc A. Donges wrote:
> After 55 days of operation the machine (A) suddenly was no longer
> reachable via network. Strangely, a second machine (B) that should
> take over the IP addresses (keepalived) did not take over. Only
> after shutting the switchport to which A is attached did B take
> over.

Hi. We've had the same symptom with our BCM5709S [14e4:163a] on
Debian. Like you, we were on stable's 2.6.32-41squeeze2. Google led us
to many similar issues [1,2,3]. They concluded with the fix being in
mainline commit c441b8d2 [4]: "bnx2: Fix lost MSI-X problem on 5709
NICs".

Broadcom: Can you publish a tool that decodes ethtool -d dumps to make
debugging easier, or do you deem it no longer necessary with the the
register dump commits in 555069da?

Now, Debian's 2.6.32-41squeeze2 is based on longterm release 2.6.32.54
[5]. That version includes commit 0b7817ed [6], which is a backport of
the already mentioned mainline commit c441b8d2.

So we tried digging further and applying some seemingly relevant
commits [7,8] to our 2.6.32, but without any change in behaviour. Our
temporary fix was to run 'ethtool -t ethX' to reset the device every
time it locked up.

This dragged on with various builds, until we ended up on mainline
2.6.38 where we no longer saw any symptoms. I don't know in which
kernel version it was fixed, but we ended up on that one, sort of by
chance. Unfortunately, it had severe issues with kswapd memory
compaction causing CPU soft lockups [9], so we went straight to
squeeze-backports' 3.2.23-1~bpo60+2. We've been happy since then.

> We have five pairs of basically identical machines performing the
> same task (each pair for one site). The error has not occured with
> any other one, but this site is the busiest:

We also saw the issue only at a site with generally higher load
compared to other sites.

I'd love to know exactly which commit fixed the issue, but it's fairly
tricky to reproduce the issue, and the bisect count is fairly high (it
need not be a specific fix for bnx2).

sven


[1]: bnx2 driver crashes under random circumstances
https://bugzilla.redhat.com/show_bug.cgi?id=520888

[2]: Access denied. Come on, Red Hat!
https://bugzilla.redhat.com/show_bug.cgi?id=511368

[3]: NIC doesn't register packets [rhel-5.5.z]
https://bugzilla.redhat.com/show_bug.cgi?id=587799

[4]: bnx2: Fix lost MSI-X problem on 5709 NICs.
http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=object;h=c441b8d2cb2194b05550a558d6d95d8944e56a84

[5]: Debian Changelog linux-2.6 (2.6.32-45)
http://packages.debian.org/changelogs/pool/main/l/linux-2.6/linux-2.6_2.6.32-45/changelog#version2.6.32-41

[6]: bnx2: Fix lost MSI-X problem on 5709 NICs.
http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commit;h=0b7817edda5e44e5fa769645bd1220f5e7b0beb5

[7]: bnx2: reset_task is crashing the kernel. Fixing it.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4529819c45161e4a119134f56ef504e69420bc98

[8]: bnx2: fixing a timout error due not refreshing TX timers correctly
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e6bf95ffa8d6f8f4b7ee33ea01490d95b0bbeb6e

[9]: [PATCH] remove compaction from kswapd
http://thread.gmane.org/gmane.linux.kernel.mm/58962
https://lkml.org/lkml/2011/3/25/664

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bnx2 cards intermittantly going offline
  2012-09-13 15:45     ` Sven Ulland
@ 2012-09-13 20:30       ` Michael Chan
  2012-09-16  3:47       ` Ben Hutchings
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Chan @ 2012-09-13 20:30 UTC (permalink / raw)
  To: Sven Ulland; +Cc: netdev, Marc A. Donges


On Thu, 2012-09-13 at 17:45 +0200, Sven Ulland wrote:
> On 09/13/2012 03:51 PM, Marc A. Donges wrote:
> > After 55 days of operation the machine (A) suddenly was no longer
> > reachable via network. Strangely, a second machine (B) that should
> > take over the IP addresses (keepalived) did not take over. Only
> > after shutting the switchport to which A is attached did B take
> > over.

The rx_ftq_discards problem is a firmware problem.  FTQ discards mean
that the firmware is no longer running and the packets are dropped at
the FTQ.  This is likely fixed in:

commit 22fa159d37efbfe781bbb99279efe83f58b87d29
Author: Michael Chan <mchan@broadcom.com>
Date:   Mon Oct 11 16:12:00 2010 -0700

    bnx2: Update firmware to 6.0.x.


> 
> Hi. We've had the same symptom with our BCM5709S [14e4:163a] on
> Debian. Like you, we were on stable's 2.6.32-41squeeze2. Google led us
> to many similar issues [1,2,3]. They concluded with the fix being in
> mainline commit c441b8d2 [4]: "bnx2: Fix lost MSI-X problem on 5709
> NICs".

This is a different problem and will not result in FTQ discards.

> 
> Broadcom: Can you publish a tool that decodes ethtool -d dumps to make
> debugging easier, or do you deem it no longer necessary with the the
> register dump commits in 555069da?

The register dump during tx timeout is now quite comprehensive.

> 
> Now, Debian's 2.6.32-41squeeze2 is based on longterm release 2.6.32.54
> [5]. That version includes commit 0b7817ed [6], which is a backport of
> the already mentioned mainline commit c441b8d2.
> 
> So we tried digging further and applying some seemingly relevant
> commits [7,8] to our 2.6.32, but without any change in behaviour. Our
> temporary fix was to run 'ethtool -t ethX' to reset the device every
> time it locked up.
> 
> This dragged on with various builds, until we ended up on mainline
> 2.6.38 where we no longer saw any symptoms. I don't know in which
> kernel version it was fixed, but we ended up on that one, sort of by
> chance. Unfortunately, it had severe issues with kswapd memory
> compaction causing CPU soft lockups [9], so we went straight to
> squeeze-backports' 3.2.23-1~bpo60+2. We've been happy since then.
> 
> > We have five pairs of basically identical machines performing the
> > same task (each pair for one site). The error has not occured with
> > any other one, but this site is the busiest:
> 
> We also saw the issue only at a site with generally higher load
> compared to other sites.
> 
> I'd love to know exactly which commit fixed the issue, but it's fairly
> tricky to reproduce the issue, and the bisect count is fairly high (it
> need not be a specific fix for bnx2).

If you see the same FTQ discards, please try that firmware commit
mentioned above.  Thanks.

> 
> sven
> 
> 
> [1]: bnx2 driver crashes under random circumstances
> https://bugzilla.redhat.com/show_bug.cgi?id=520888
> 
> [2]: Access denied. Come on, Red Hat!
> https://bugzilla.redhat.com/show_bug.cgi?id=511368
> 
> [3]: NIC doesn't register packets [rhel-5.5.z]
> https://bugzilla.redhat.com/show_bug.cgi?id=587799
> 
> [4]: bnx2: Fix lost MSI-X problem on 5709 NICs.
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=object;h=c441b8d2cb2194b05550a558d6d95d8944e56a84
> 
> [5]: Debian Changelog linux-2.6 (2.6.32-45)
> http://packages.debian.org/changelogs/pool/main/l/linux-2.6/linux-2.6_2.6.32-45/changelog#version2.6.32-41
> 
> [6]: bnx2: Fix lost MSI-X problem on 5709 NICs.
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commit;h=0b7817edda5e44e5fa769645bd1220f5e7b0beb5
> 
> [7]: bnx2: reset_task is crashing the kernel. Fixing it.
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4529819c45161e4a119134f56ef504e69420bc98
> 
> [8]: bnx2: fixing a timout error due not refreshing TX timers correctly
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e6bf95ffa8d6f8f4b7ee33ea01490d95b0bbeb6e
> 
> [9]: [PATCH] remove compaction from kswapd
> http://thread.gmane.org/gmane.linux.kernel.mm/58962
> https://lkml.org/lkml/2011/3/25/664
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bnx2 cards intermittantly going offline
  2012-09-13 15:45     ` Sven Ulland
  2012-09-13 20:30       ` Michael Chan
@ 2012-09-16  3:47       ` Ben Hutchings
  1 sibling, 0 replies; 8+ messages in thread
From: Ben Hutchings @ 2012-09-16  3:47 UTC (permalink / raw)
  To: Sven Ulland; +Cc: netdev, Marc A. Donges, Michael Chan

[-- Attachment #1: Type: text/plain, Size: 2520 bytes --]

On Thu, 2012-09-13 at 17:45 +0200, Sven Ulland wrote:
> On 09/13/2012 03:51 PM, Marc A. Donges wrote:
> > After 55 days of operation the machine (A) suddenly was no longer
> > reachable via network. Strangely, a second machine (B) that should
> > take over the IP addresses (keepalived) did not take over. Only
> > after shutting the switchport to which A is attached did B take
> > over.
> 
> Hi. We've had the same symptom with our BCM5709S [14e4:163a] on
> Debian. Like you, we were on stable's 2.6.32-41squeeze2. Google led us
> to many similar issues [1,2,3]. They concluded with the fix being in
> mainline commit c441b8d2 [4]: "bnx2: Fix lost MSI-X problem on 5709
> NICs".
>
> Broadcom: Can you publish a tool that decodes ethtool -d dumps to make
> debugging easier, or do you deem it no longer necessary with the the
> register dump commits in 555069da?

This tool should be ethtool itself (it includes dump decoders for many
drivers).

> Now, Debian's 2.6.32-41squeeze2 is based on longterm release 2.6.32.54
> [5]. That version includes commit 0b7817ed [6], which is a backport of
> the already mentioned mainline commit c441b8d2.
>
> So we tried digging further and applying some seemingly relevant
> commits [7,8] to our 2.6.32, but without any change in behaviour. Our
> temporary fix was to run 'ethtool -t ethX' to reset the device every
> time it locked up.
> 
> This dragged on with various builds, until we ended up on mainline
> 2.6.38 where we no longer saw any symptoms. I don't know in which
> kernel version it was fixed, but we ended up on that one, sort of by
> chance. Unfortunately, it had severe issues with kswapd memory
> compaction causing CPU soft lockups [9], so we went straight to
> squeeze-backports' 3.2.23-1~bpo60+2. We've been happy since then.
>
> > We have five pairs of basically identical machines performing the
> > same task (each pair for one site). The error has not occured with
> > any other one, but this site is the busiest:
> 
> We also saw the issue only at a site with generally higher load
> compared to other sites.
> 
> I'd love to know exactly which commit fixed the issue, but it's fairly
> tricky to reproduce the issue, and the bisect count is fairly high (it
> need not be a specific fix for bnx2).

I don't see any changes to the driver itself that look relevant.
Perhaps this was a firmware bug?

Ben.

-- 
Ben Hutchings
Experience is what causes a person to make new mistakes instead of old ones.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-09-16  3:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-18 10:54 bnx2 cards intermittantly going offline Mills, Tony
2011-01-18 17:55 ` Michael Chan
2011-01-26 12:44   ` Mills, Tony
2012-09-13 13:51   ` Marc A. Donges
2012-09-13 15:45     ` Sven Ulland
2012-09-13 20:30       ` Michael Chan
2012-09-16  3:47       ` Ben Hutchings
2011-11-15 17:41 ` Ken

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.