netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BCM5721 transmit queue 0 timed out
@ 2013-07-18  8:47 Cosmin GIRADU
  2013-07-18 16:36 ` Nithin Nayak Sujir
  0 siblings, 1 reply; 4+ messages in thread
From: Cosmin GIRADU @ 2013-07-18  8:47 UTC (permalink / raw)
  To: Linux Net Dev

[-- Attachment #1: Type: text/plain, Size: 3157 bytes --]

Hi,

I need some help with the following situation:

We keep getting random lockups on our BCM5721 cards (most of them are
LOMs, multiple machines, running multiple kernel versions between 3.4
and 3.10.1), when the traffic is high (above 300Mbit/s). The hardware is
dual port "Tigon3 [partno(BCM95721) rev 4201] (PCI Express)" with 5750
chip inside.
The lockups look like this:

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x25a/0x270()
NETDEV WATCHDOG: eth2 (tg3): transmit queue 0 timed out
Modules linked in: ip_gre ip_tunnel gre loop processor thermal_sys
i2c_i801 lpc_ich coretemp button mfd_core
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.1.htb.104 #1
Hardware name: IBM IBM System x3250 -[43654BG]-/M31ip, BIOS IBM BIOS
Version 1.33-[G9E133AUS-1.33]- 08/28/2007
 ffffffff81781f16 ffff88003fd03d98 ffffffff8152f6eb ffff88003fd03dd8
 ffffffff8103659b ffff88003fd03dd8 ffff88003d3f0000 ffff88003e103d00
 0000000000000005 0000000000000001 ffff88003e0a9428 ffff88003fd03e38
Call Trace:
 <IRQ>  [<ffffffff8152f6eb>] dump_stack+0x19/0x1e
 [<ffffffff8103659b>] warn_slowpath_common+0x6b/0xa0
 [<ffffffff81036671>] warn_slowpath_fmt+0x41/0x50
 [<ffffffff81471d6a>] dev_watchdog+0x25a/0x270
 [<ffffffff81471b10>] ? __netdev_watchdog_up+0x80/0x80
 [<ffffffff8104312c>] call_timer_fn+0x2c/0x90
 [<ffffffff81043369>] run_timer_softirq+0x1d9/0x1f0
 [<ffffffff8103d351>] __do_softirq+0xd1/0x1a0
 [<ffffffff8103d4c5>] irq_exit+0x65/0x80
 [<ffffffff81024399>] smp_apic_timer_interrupt+0x69/0xa0
 [<ffffffff81533b0a>] apic_timer_interrupt+0x6a/0x70
 <EOI>  [<ffffffff8100a126>] ? default_idle+0x6/0x10
 [<ffffffff8100a2f6>] arch_cpu_idle+0x16/0x20
 [<ffffffff8106afd5>] cpu_startup_entry+0xa5/0x200
 [<ffffffff818c57ce>] start_secondary+0x267/0x269
---[ end trace d3a202af040f84f0 ]---
tg3 0000:01:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3 0000:01:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
tg3 0000:01:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3 0000:01:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2

As far as I can tell the "tg3_stop_block timed out" is thrown when the
card is being reset after the hang timer expires and is quite harmless
(hope I'm reading it right). However said hangs do tend to be more
frequent as the amount of traffic rises, and that does interfere with
operation.

As a workaround, disabling scatter-gather on the offending cards stops
the problem from reappearing, however I'd like to get to the bottom of
this once and for all.

-- 

Cosmin GIRADU
OSS Engineer
RCS & RDS - Unified Services
Phone:  +40-31-400-6323
Mobile: +40-77-020-0858
http://www.rcs-rds.ro

..........................................................................
Privileged/Confidential Information may be contained in this message. If
you are not the addressee indicated in this message (or responsible for
delivery of the message to such person), you may not copy or deliver
this message to anyone. In such a case, you should destroy this message
and kindly notify the sender by reply e-mail.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 555 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BCM5721 transmit queue 0 timed out
  2013-07-18  8:47 BCM5721 transmit queue 0 timed out Cosmin GIRADU
@ 2013-07-18 16:36 ` Nithin Nayak Sujir
  2013-07-19  7:24   ` Cosmin GIRADU
  0 siblings, 1 reply; 4+ messages in thread
From: Nithin Nayak Sujir @ 2013-07-18 16:36 UTC (permalink / raw)
  To: Cosmin GIRADU; +Cc: Linux Net Dev


On 7/18/2013 1:47 AM, Cosmin GIRADU wrote:
> Hi,
>
> I need some help with the following situation:
>
> We keep getting random lockups on our BCM5721 cards (most of them are
> LOMs, multiple machines, running multiple kernel versions between 3.4
> and 3.10.1), when the traffic is high (above 300Mbit/s). The hardware is
> dual port "Tigon3 [partno(BCM95721) rev 4201] (PCI Express)" with 5750
> chip inside.

Cosmin,
Can you send the full register dump from the kernel log?

Also can you give more details about the system and the traffic? Is it 
reproducible with something like netperf?

Nithin.


> The lockups look like this:
>
> ------------[ cut here ]------------
> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x25a/0x270()
> NETDEV WATCHDOG: eth2 (tg3): transmit queue 0 timed out
> Modules linked in: ip_gre ip_tunnel gre loop processor thermal_sys
> i2c_i801 lpc_ich coretemp button mfd_core
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.1.htb.104 #1
> Hardware name: IBM IBM System x3250 -[43654BG]-/M31ip, BIOS IBM BIOS
> Version 1.33-[G9E133AUS-1.33]- 08/28/2007
>   ffffffff81781f16 ffff88003fd03d98 ffffffff8152f6eb ffff88003fd03dd8
>   ffffffff8103659b ffff88003fd03dd8 ffff88003d3f0000 ffff88003e103d00
>   0000000000000005 0000000000000001 ffff88003e0a9428 ffff88003fd03e38
> Call Trace:
>   <IRQ>  [<ffffffff8152f6eb>] dump_stack+0x19/0x1e
>   [<ffffffff8103659b>] warn_slowpath_common+0x6b/0xa0
>   [<ffffffff81036671>] warn_slowpath_fmt+0x41/0x50
>   [<ffffffff81471d6a>] dev_watchdog+0x25a/0x270
>   [<ffffffff81471b10>] ? __netdev_watchdog_up+0x80/0x80
>   [<ffffffff8104312c>] call_timer_fn+0x2c/0x90
>   [<ffffffff81043369>] run_timer_softirq+0x1d9/0x1f0
>   [<ffffffff8103d351>] __do_softirq+0xd1/0x1a0
>   [<ffffffff8103d4c5>] irq_exit+0x65/0x80
>   [<ffffffff81024399>] smp_apic_timer_interrupt+0x69/0xa0
>   [<ffffffff81533b0a>] apic_timer_interrupt+0x6a/0x70
>   <EOI>  [<ffffffff8100a126>] ? default_idle+0x6/0x10
>   [<ffffffff8100a2f6>] arch_cpu_idle+0x16/0x20
>   [<ffffffff8106afd5>] cpu_startup_entry+0xa5/0x200
>   [<ffffffff818c57ce>] start_secondary+0x267/0x269
> ---[ end trace d3a202af040f84f0 ]---
> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
>
> As far as I can tell the "tg3_stop_block timed out" is thrown when the
> card is being reset after the hang timer expires and is quite harmless
> (hope I'm reading it right). However said hangs do tend to be more
> frequent as the amount of traffic rises, and that does interfere with
> operation.
>
> As a workaround, disabling scatter-gather on the offending cards stops
> the problem from reappearing, however I'd like to get to the bottom of
> this once and for all.
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BCM5721 transmit queue 0 timed out
  2013-07-18 16:36 ` Nithin Nayak Sujir
@ 2013-07-19  7:24   ` Cosmin GIRADU
  2013-07-19  7:48     ` Cosmin GIRADU
  0 siblings, 1 reply; 4+ messages in thread
From: Cosmin GIRADU @ 2013-07-19  7:24 UTC (permalink / raw)
  To: Nithin Nayak Sujir; +Cc: Linux Net Dev

[-- Attachment #1: Type: text/plain, Size: 5042 bytes --]

On 18/07/13 19:36, Nithin Nayak Sujir wrote:
>
> On 7/18/2013 1:47 AM, Cosmin GIRADU wrote:
>> Hi,
>>
>> I need some help with the following situation:
>>
>> We keep getting random lockups on our BCM5721 cards (most of them are
>> LOMs, multiple machines, running multiple kernel versions between 3.4
>> and 3.10.1), when the traffic is high (above 300Mbit/s). The hardware is
>> dual port "Tigon3 [partno(BCM95721) rev 4201] (PCI Express)" with 5750
>> chip inside.
>
> Cosmin,
> Can you send the full register dump from the kernel log?
This is what I have in the kernel log:

tg3 0000:01:00.0 eth0: Tigon3 [partno(BCM95721) rev 4201] (PCI Express)
MAC address 00:1a:64:6d:e7:57
tg3 0000:01:00.0 eth0: attached PHY is 5750 (10/100/1000Base-T Ethernet)
(WireSpeed[1], EEE[0])
tg3 0000:01:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
tg3 0000:01:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
tg3 0000:03:00.0 eth1: Tigon3 [partno(BCM95721) rev 4201] (PCI Express)
MAC address 00:1a:64:6d:e7:58
tg3 0000:03:00.0 eth1: attached PHY is 5750 (10/100/1000Base-T Ethernet)
(WireSpeed[1], EEE[0])
tg3 0000:03:00.0 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
tg3 0000:03:00.0 eth1: dma_rwctrl[76180000] dma_mask[64-bit]
tg3 0000:01:00.0: irq 42 for MSI/MSI-X
tg3 0000:03:00.0: irq 43 for MSI/MSI-X

It's the kernel log from the same machine, after init the NIC names get
translated like this: eth0 -> eth2, eth1 -> eth3, from udev.

If you meant something else, please instruct me on how to obtain it.

>
> Also can you give more details about the system and the traffic? Is it
> reproducible with something like netperf?
We use the system as a qos router with ~6k classes per interface, 40kpps
to 70kpps symmetric,
and the only way I can describe the flows is "completely random".

>
> Nithin.
>
Thank you!

>
>> The lockups look like this:
>>
>> ------------[ cut here ]------------
>> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x25a/0x270()
>> NETDEV WATCHDOG: eth2 (tg3): transmit queue 0 timed out
>> Modules linked in: ip_gre ip_tunnel gre loop processor thermal_sys
>> i2c_i801 lpc_ich coretemp button mfd_core
>> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.1.htb.104 #1
>> Hardware name: IBM IBM System x3250 -[43654BG]-/M31ip, BIOS IBM BIOS
>> Version 1.33-[G9E133AUS-1.33]- 08/28/2007
>>   ffffffff81781f16 ffff88003fd03d98 ffffffff8152f6eb ffff88003fd03dd8
>>   ffffffff8103659b ffff88003fd03dd8 ffff88003d3f0000 ffff88003e103d00
>>   0000000000000005 0000000000000001 ffff88003e0a9428 ffff88003fd03e38
>> Call Trace:
>>   <IRQ>  [<ffffffff8152f6eb>] dump_stack+0x19/0x1e
>>   [<ffffffff8103659b>] warn_slowpath_common+0x6b/0xa0
>>   [<ffffffff81036671>] warn_slowpath_fmt+0x41/0x50
>>   [<ffffffff81471d6a>] dev_watchdog+0x25a/0x270
>>   [<ffffffff81471b10>] ? __netdev_watchdog_up+0x80/0x80
>>   [<ffffffff8104312c>] call_timer_fn+0x2c/0x90
>>   [<ffffffff81043369>] run_timer_softirq+0x1d9/0x1f0
>>   [<ffffffff8103d351>] __do_softirq+0xd1/0x1a0
>>   [<ffffffff8103d4c5>] irq_exit+0x65/0x80
>>   [<ffffffff81024399>] smp_apic_timer_interrupt+0x69/0xa0
>>   [<ffffffff81533b0a>] apic_timer_interrupt+0x6a/0x70
>>   <EOI>  [<ffffffff8100a126>] ? default_idle+0x6/0x10
>>   [<ffffffff8100a2f6>] arch_cpu_idle+0x16/0x20
>>   [<ffffffff8106afd5>] cpu_startup_entry+0xa5/0x200
>>   [<ffffffff818c57ce>] start_secondary+0x267/0x269
>> ---[ end trace d3a202af040f84f0 ]---
>> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
>> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
>> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
>> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
>>
>> As far as I can tell the "tg3_stop_block timed out" is thrown when the
>> card is being reset after the hang timer expires and is quite harmless
>> (hope I'm reading it right). However said hangs do tend to be more
>> frequent as the amount of traffic rises, and that does interfere with
>> operation.
>>
>> As a workaround, disabling scatter-gather on the offending cards stops
>> the problem from reappearing, however I'd like to get to the bottom of
>> this once and for all.
>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 

Cosmin GIRADU
OSS Engineer
RCS & RDS - Unified Services
Phone:  +40-31-400-6323
Mobile: +40-77-020-0858
http://www.rcs-rds.ro

..........................................................................
Privileged/Confidential Information may be contained in this message. If
you are not the addressee indicated in this message (or responsible for
delivery of the message to such person), you may not copy or deliver
this message to anyone. In such a case, you should destroy this message
and kindly notify the sender by reply e-mail.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 555 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BCM5721 transmit queue 0 timed out
  2013-07-19  7:24   ` Cosmin GIRADU
@ 2013-07-19  7:48     ` Cosmin GIRADU
  0 siblings, 0 replies; 4+ messages in thread
From: Cosmin GIRADU @ 2013-07-19  7:48 UTC (permalink / raw)
  To: Nithin Nayak Sujir; +Cc: Linux Net Dev

[-- Attachment #1: Type: text/plain, Size: 5624 bytes --]

On 19/07/13 10:24, Cosmin GIRADU wrote:
> On 18/07/13 19:36, Nithin Nayak Sujir wrote:
>> On 7/18/2013 1:47 AM, Cosmin GIRADU wrote:
>>> Hi,
>>>
>>> I need some help with the following situation:
>>>
>>> We keep getting random lockups on our BCM5721 cards (most of them are
>>> LOMs, multiple machines, running multiple kernel versions between 3.4
>>> and 3.10.1), when the traffic is high (above 300Mbit/s). The hardware is
>>> dual port "Tigon3 [partno(BCM95721) rev 4201] (PCI Express)" with 5750
>>> chip inside.
>> Cosmin,
>> Can you send the full register dump from the kernel log?
> This is what I have in the kernel log:
>
> tg3 0000:01:00.0 eth0: Tigon3 [partno(BCM95721) rev 4201] (PCI Express)
> MAC address 00:1a:64:6d:e7:57
> tg3 0000:01:00.0 eth0: attached PHY is 5750 (10/100/1000Base-T Ethernet)
> (WireSpeed[1], EEE[0])
> tg3 0000:01:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
> tg3 0000:01:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
> tg3 0000:03:00.0 eth1: Tigon3 [partno(BCM95721) rev 4201] (PCI Express)
> MAC address 00:1a:64:6d:e7:58
> tg3 0000:03:00.0 eth1: attached PHY is 5750 (10/100/1000Base-T Ethernet)
> (WireSpeed[1], EEE[0])
> tg3 0000:03:00.0 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
> tg3 0000:03:00.0 eth1: dma_rwctrl[76180000] dma_mask[64-bit]
> tg3 0000:01:00.0: irq 42 for MSI/MSI-X
> tg3 0000:03:00.0: irq 43 for MSI/MSI-X
>
> It's the kernel log from the same machine, after init the NIC names get
> translated like this: eth0 -> eth2, eth1 -> eth3, from udev.
>
> If you meant something else, please instruct me on how to obtain it.

>> Also can you give more details about the system and the traffic? Is it
>> reproducible with something like netperf?
> We use the system as a qos router with ~6k classes per interface, 40kpps
> to 70kpps symmetric,
> and the only way I can describe the flows is "completely random".
I forgot to mention that only the first port is used for forwarding
packets, the second one is used for OOB management. The system looks
like this:

                           _ eth3(tg3) - management
                          /
                          |
internet - eth2(BCM5721) -+- eth0(e1000e) - MAN
                          |
                          +- eth1(e1000e) - MAN

           70kpps in
           70kpps out

>
>> Nithin.
>>
> Thank you!
>
>>> The lockups look like this:
>>>
>>> ------------[ cut here ]------------
>>> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x25a/0x270()
>>> NETDEV WATCHDOG: eth2 (tg3): transmit queue 0 timed out
>>> Modules linked in: ip_gre ip_tunnel gre loop processor thermal_sys
>>> i2c_i801 lpc_ich coretemp button mfd_core
>>> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.1.htb.104 #1
>>> Hardware name: IBM IBM System x3250 -[43654BG]-/M31ip, BIOS IBM BIOS
>>> Version 1.33-[G9E133AUS-1.33]- 08/28/2007
>>>   ffffffff81781f16 ffff88003fd03d98 ffffffff8152f6eb ffff88003fd03dd8
>>>   ffffffff8103659b ffff88003fd03dd8 ffff88003d3f0000 ffff88003e103d00
>>>   0000000000000005 0000000000000001 ffff88003e0a9428 ffff88003fd03e38
>>> Call Trace:
>>>   <IRQ>  [<ffffffff8152f6eb>] dump_stack+0x19/0x1e
>>>   [<ffffffff8103659b>] warn_slowpath_common+0x6b/0xa0
>>>   [<ffffffff81036671>] warn_slowpath_fmt+0x41/0x50
>>>   [<ffffffff81471d6a>] dev_watchdog+0x25a/0x270
>>>   [<ffffffff81471b10>] ? __netdev_watchdog_up+0x80/0x80
>>>   [<ffffffff8104312c>] call_timer_fn+0x2c/0x90
>>>   [<ffffffff81043369>] run_timer_softirq+0x1d9/0x1f0
>>>   [<ffffffff8103d351>] __do_softirq+0xd1/0x1a0
>>>   [<ffffffff8103d4c5>] irq_exit+0x65/0x80
>>>   [<ffffffff81024399>] smp_apic_timer_interrupt+0x69/0xa0
>>>   [<ffffffff81533b0a>] apic_timer_interrupt+0x6a/0x70
>>>   <EOI>  [<ffffffff8100a126>] ? default_idle+0x6/0x10
>>>   [<ffffffff8100a2f6>] arch_cpu_idle+0x16/0x20
>>>   [<ffffffff8106afd5>] cpu_startup_entry+0xa5/0x200
>>>   [<ffffffff818c57ce>] start_secondary+0x267/0x269
>>> ---[ end trace d3a202af040f84f0 ]---
>>> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
>>> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
>>> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
>>> tg3 0000:01:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
>>>
>>> As far as I can tell the "tg3_stop_block timed out" is thrown when the
>>> card is being reset after the hang timer expires and is quite harmless
>>> (hope I'm reading it right). However said hangs do tend to be more
>>> frequent as the amount of traffic rises, and that does interfere with
>>> operation.
>>>
>>> As a workaround, disabling scatter-gather on the offending cards stops
>>> the problem from reappearing, however I'd like to get to the bottom of
>>> this once and for all.
>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 

Cosmin GIRADU
OSS Engineer
RCS & RDS - Unified Services
Phone:  +40-31-400-6323
Mobile: +40-77-020-0858
http://www.rcs-rds.ro

..........................................................................
Privileged/Confidential Information may be contained in this message. If
you are not the addressee indicated in this message (or responsible for
delivery of the message to such person), you may not copy or deliver
this message to anyone. In such a case, you should destroy this message
and kindly notify the sender by reply e-mail.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 555 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-07-19  7:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-18  8:47 BCM5721 transmit queue 0 timed out Cosmin GIRADU
2013-07-18 16:36 ` Nithin Nayak Sujir
2013-07-19  7:24   ` Cosmin GIRADU
2013-07-19  7:48     ` Cosmin GIRADU

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).