All of lore.kernel.org
 help / color / mirror / Atom feed
* sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
@ 2010-08-12  0:48 Maciej Żenczykowski
  2010-08-12  1:59 ` Stephen Hemminger
  0 siblings, 1 reply; 10+ messages in thread
From: Maciej Żenczykowski @ 2010-08-12  0:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Linux NetDev

[See https://bugzilla.redhat.com/show_bug.cgi?id=592398 ]

Latest tested kernel (from koji for Fedora 13):

2.6.34.3-35.rc1.fc13.x86_64

Basically occasionally, but possibly more and more often with recent
kernels (I think .33 and .34 are worse then .32) the sky2 driver locks
up.

During this time the nic functions like a DSL line with a 95% drop
rate.  ie. sometimes something does get through, but mostly it's dead.
"ip link set eth0 down && ip link set eth0 up" is enough to fix it.

Here's the initial occurrence of this problem on the above kernel.

Aug 11 16:21:19 nike kernel: sky2 0000:0c:00.0: eth0: rx length error:
status 0x5d60100 length 2982
Aug 11 16:21:27 nike kernel: eth0: hw csum failure.
Aug 11 16:21:27 nike kernel: Pid: 0, comm: swapper Not tainted
2.6.34.3-35.rc1.fc13.x86_64 #1
Aug 11 16:21:27 nike kernel: Call Trace:
Aug 11 16:21:27 nike kernel: <IRQ>  [<ffffffff813a5c5b>]
netdev_rx_csum_fault+0x3b/0x3f
Aug 11 16:21:27 nike kernel: [<ffffffff8139f909>]
__skb_checksum_complete_head+0x51/0x65
Aug 11 16:21:27 nike kernel: [<ffffffff8139f92e>]
__skb_checksum_complete+0x11/0x13
Aug 11 16:21:27 nike kernel: [<ffffffff8140c339>] nf_ip_checksum+0xdd/0xe3
Aug 11 16:21:27 nike kernel: [<ffffffff813cc791>] udp_error+0x130/0x18a
Aug 11 16:21:27 nike kernel: [<ffffffff81037b51>] ? enqueue_task+0x5f/0x6a
Aug 11 16:21:27 nike kernel: [<ffffffff81037c67>] ? activate_task+0x2f/0x37
Aug 11 16:21:27 nike kernel: [<ffffffff813c7d69>] nf_conntrack_in+0x180/0x90e
Aug 11 16:21:27 nike kernel: [<ffffffff8103ea37>] ? enqueue_task_fair+0x44/0x87
Aug 11 16:21:27 nike kernel: [<ffffffff81037b51>] ? enqueue_task+0x5f/0x6a
Aug 11 16:21:27 nike kernel: [<ffffffff8140c995>] ipv4_conntrack_in+0x21/0x23
Aug 11 16:21:27 nike kernel: [<ffffffff813c4c56>] nf_iterate+0x46/0x89
Aug 11 16:21:27 nike kernel: [<ffffffff813d4790>] ? ip_rcv_finish+0x0/0x362
Aug 11 16:21:27 nike kernel: [<ffffffff813c4d03>] nf_hook_slow+0x6a/0xcb
Aug 11 16:21:27 nike kernel: [<ffffffff813d4790>] ? ip_rcv_finish+0x0/0x362
Aug 11 16:21:27 nike kernel: [<ffffffff813d4790>] ? ip_rcv_finish+0x0/0x362
Aug 11 16:21:27 nike kernel: [<ffffffff813d4e51>] NF_HOOK.clone.1+0x46/0x58
Aug 11 16:21:27 nike kernel: [<ffffffff8106e106>] ? getnstimeofday+0x63/0xb9
Aug 11 16:21:27 nike kernel: [<ffffffff813d510b>] ip_rcv+0x256/0x283
Aug 11 16:21:27 nike kernel: [<ffffffff813a53de>] netif_receive_skb+0x493/0x4b9
Aug 11 16:21:27 nike kernel: [<ffffffff813a5baa>] napi_skb_finish+0x29/0x40
Aug 11 16:21:27 nike kernel: [<ffffffff813a5bf0>] napi_gro_receive+0x2f/0x34
Aug 11 16:21:27 nike kernel: [<ffffffffa0160381>] sky2_poll+0x9c5/0xc58 [sky2]
Aug 11 16:21:27 nike kernel: [<ffffffff813a568f>] net_rx_action+0xaf/0x1ca
Aug 11 16:21:27 nike kernel: [<ffffffff81053244>] __do_softirq+0xe5/0x1a6
Aug 11 16:21:27 nike kernel: [<ffffffff8109e119>] ? handle_IRQ_event+0x60/0x121
Aug 11 16:21:27 nike kernel: [<ffffffff8100ab5c>] call_softirq+0x1c/0x30
Aug 11 16:21:27 nike kernel: [<ffffffff8100c342>] do_softirq+0x46/0x83
Aug 11 16:21:27 nike kernel: [<ffffffff810530b5>] irq_exit+0x3b/0x7d
Aug 11 16:21:27 nike kernel: [<ffffffff81452434>] do_IRQ+0xac/0xc3
Aug 11 16:21:27 nike kernel: [<ffffffff8144cb93>] ret_from_intr+0x0/0x11
Aug 11 16:21:27 nike kernel: <EOI>  [<ffffffff8127ef7b>] ?
acpi_idle_enter_bm+0x288/0x2bc
Aug 11 16:21:27 nike kernel: [<ffffffff8127ef74>] ?
acpi_idle_enter_bm+0x281/0x2bc
Aug 11 16:21:27 nike kernel: [<ffffffff81379458>] cpuidle_idle_call+0x99/0xf1
Aug 11 16:21:27 nike kernel: [<ffffffff81008c22>] cpu_idle+0xaa/0xe4
Aug 11 16:21:27 nike kernel: [<ffffffff8144553e>] start_secondary+0x253/0x294
Aug 11 16:21:34 nike kernel: eth0: hw csum failure.
Aug 11 16:21:34 nike kernel: Pid: 0, comm: swapper Not tainted
2.6.34.3-35.rc1.fc13.x86_64 #1
Aug 11 16:21:34 nike kernel: Call Trace:
Aug 11 16:21:34 nike kernel: <IRQ>  [<ffffffff813a5c5b>]
netdev_rx_csum_fault+0x3b/0x3f
Aug 11 16:21:34 nike kernel: [<ffffffff8139f909>]
__skb_checksum_complete_head+0x51/0x65
Aug 11 16:21:34 nike kernel: [<ffffffff8139f92e>] __skb_checksum_complete+0x11/0
...
etc, 700 messages over the course of the next hour (until I came back
and ip link down/up fixed it).

# cat /var/log/messages | egrep 'rx len'
Aug 11 16:21:19 nike kernel: sky2 0000:0c:00.0: eth0: rx length error:
status 0x5d60100 length 2982

(also seen on an older kernel [ 2.6.33.5-112.fc13.x86_64 ]:
  Jul 17 12:43:10 nike kernel: sky2 eth0: rx length error: status
0x5ea0100 length 3018
  Jul 28 02:34:46 nike kernel: sky2 eth0: rx length error: status
0x5ea0100 length 1642
  Jul 30 09:49:16 nike kernel: sky2 eth0: rx length error: status
0x5ea0100 length 3018
  Jul 31 00:20:26 nike kernel: sky2 eth0: rx length error: status
0x5ea0100 length 3018
and kernels before that, including 2.6.32.12-115.fc12.x86_64, but I
think I might have seen the problem even further back than 2.6.32).

# cat /var/log/messages | egrep 'eth0: hw csum failure\.$' | wc -l
694

The call stacks differ, here's the most common symbols with the number
of times they occur
(although this probably isn't particularly useful):

# cat /var/log/messages | egrep ffffffff | sed -rn 's@^^Aug ..
..:..:.. nike kernel: @@p' | sort | uniq -c | egrep -v '^     [
1-9][0-9] '
    602 <EOI>  [<ffffffff8127ef7b>] ? acpi_idle_enter_bm+0x288/0x2bc
    630 [<ffffffff81008c22>] cpu_idle+0xaa/0xe4
    694 [<ffffffff8100ab5c>] call_softirq+0x1c/0x30
    693 [<ffffffff8100c342>] do_softirq+0x46/0x83
    273 [<ffffffff81010261>] ? sched_clock+0x9/0xd
    105 [<ffffffff8101038f>] ? native_sched_clock+0x2d/0x5f
    254 [<ffffffff810205a8>] ? lapic_next_event+0x1d/0x21
    190 [<ffffffff81037b51>] ? enqueue_task+0x5f/0x6a
    285 [<ffffffff81037c67>] ? activate_task+0x2f/0x37
    144 [<ffffffff8103ea37>] ? enqueue_task_fair+0x44/0x87
    693 [<ffffffff810530b5>] irq_exit+0x3b/0x7d
    694 [<ffffffff81053244>] __do_softirq+0xe5/0x1a6
    103 [<ffffffff8106b281>] ? sched_clock_local+0x1c/0x82
    693 [<ffffffff8106e106>] ? getnstimeofday+0x63/0xb9
    202 [<ffffffff8107148d>] ? clockevents_program_event+0x7a/0x83
    255 [<ffffffff810725e5>] ? tick_dev_program_event+0x3c/0xfc
    703 [<ffffffff8109e119>] ? handle_IRQ_event+0x60/0x121
    348 [<ffffffff810fe9af>] ? virt_to_head_page+0xe/0x2f
    528 [<ffffffff81216662>] ? __bitmap_weight+0x40/0x8f
    602 [<ffffffff8127ef74>] ? acpi_idle_enter_bm+0x281/0x2bc
    629 [<ffffffff81379458>] cpuidle_idle_call+0x99/0xf1
    115 [<ffffffff8139cffd>] ? __kfree_skb+0x7d/0x81
    694 [<ffffffff8139f909>] __skb_checksum_complete_head+0x51/0x65
    694 [<ffffffff8139f92e>] __skb_checksum_complete+0x11/0x13
    694 [<ffffffff813a53de>] netif_receive_skb+0x493/0x4b9
    694 [<ffffffff813a568f>] net_rx_action+0xaf/0x1ca
    694 [<ffffffff813a5baa>] napi_skb_finish+0x29/0x40
    694 [<ffffffff813a5bf0>] napi_gro_receive+0x2f/0x34
    695 [<ffffffff813c4c56>] nf_iterate+0x46/0x89
    695 [<ffffffff813c4d03>] nf_hook_slow+0x6a/0xcb
    145 [<ffffffff813c4d20>] ? nf_hook_slow+0x87/0xcb
    694 [<ffffffff813c7d69>] nf_conntrack_in+0x180/0x90e
    690 [<ffffffff813cc791>] udp_error+0x130/0x18a
   2083 [<ffffffff813d4790>] ? ip_rcv_finish+0x0/0x362
    163 [<ffffffff813d4c58>] ? ip_local_deliver_finish+0x0/0x1b3
    694 [<ffffffff813d4e51>] NF_HOOK.clone.1+0x46/0x58
    694 [<ffffffff813d510b>] ip_rcv+0x256/0x283
    694 [<ffffffff8140c339>] nf_ip_checksum+0xdd/0xe3
    694 [<ffffffff8140c995>] ipv4_conntrack_in+0x21/0x23
    338 [<ffffffff81434d5a>] rest_init+0x7e/0x80
    295 [<ffffffff8144553e>] start_secondary+0x253/0x294
    151 [<ffffffff8144c8a6>] ? _raw_spin_unlock_bh+0x15/0x17
    687 [<ffffffff8144cb93>] ret_from_intr+0x0/0x11
    687 [<ffffffff81452434>] do_IRQ+0xac/0xc3
    338 [<ffffffff81bae2c8>] x86_64_start_reservations+0xb3/0xb7
    338 [<ffffffff81bae3c4>] x86_64_start_kernel+0xf8/0x107
    338 [<ffffffff81baee6f>] start_kernel+0x413/0x41e
    694 [<ffffffffa0160381>] sky2_poll+0x9c5/0xc58 [sky2]
    150 [<ffffffffa05850ea>] ? nf_nat_cleanup_conntrack+0x69/0x6d [nf_nat]
    694 <IRQ>  [<ffffffff813a5c5b>] netdev_rx_csum_fault+0x3b/0x3f

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
  2010-08-12  0:48 sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully Maciej Żenczykowski
@ 2010-08-12  1:59 ` Stephen Hemminger
  2010-08-12  5:36   ` Maciej Żenczykowski
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Hemminger @ 2010-08-12  1:59 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: Stephen Hemminger, Linux NetDev

On Wed, 11 Aug 2010 17:48:59 -0700
Maciej Żenczykowski <zenczykowski@gmail.com> wrote:

> [See https://bugzilla.redhat.com/show_bug.cgi?id=592398 ]
> 
> Latest tested kernel (from koji for Fedora 13):
> 
> 2.6.34.3-35.rc1.fc13.x86_64
> 
> Basically occasionally, but possibly more and more often with recent
> kernels (I think .33 and .34 are worse then .32) the sky2 driver locks
> up.
> 
> During this time the nic functions like a DSL line with a 95% drop
> rate.  ie. sometimes something does get through, but mostly it's dead.
> "ip link set eth0 down && ip link set eth0 up" is enough to fix it.
> 
> Here's the initial occurrence of this problem on the above kernel.
> 
> Aug 11 16:21:19 nike kernel: sky2 0000:0c:00.0: eth0: rx length error:
> status 0x5d60100 length 2982
> Aug 11 16:21:27 nike kernel: eth0: hw csum failure.
> Aug 11 16:21:27 nike kernel: Pid: 0, comm: swapper Not tainted
> 2.6.34.3-35.rc1.fc13.x86_64 #1
> Aug 11 16:21:27 nike kernel: Call Trace:
> Aug 11 16:21:27 nike kernel: <IRQ>  [<ffffffff813a5c5b>]
> netdev_rx_csum_fault+0x3b/0x3f
> Aug 11 16:21:27 nike kernel: [<ffffffff8139f909>]
> __skb_checksum_complete_head+0x51/0x65
> Aug 11 16:21:27 nike kernel: [<ffffffff8139f92e>]
> __skb_checksum_complete+0x11/0x13
> Aug 11 16:21:27 nike kernel: [<ffffffff8140c339>] nf_ip_checksum+0xdd/0xe3
> Aug 11 16:21:27 nike kernel: [<ffffffff813cc791>] udp_error+0x130/0x18a
> Aug 11 16:21:27 nike kernel: [<ffffffff81037b51>] ? enqueue_task+0x5f/0x6a
> Aug 11 16:21:27 nike kernel: [<ffffffff81037c67>] ? activate_task+0x2f/0x37
> Aug 11 16:21:27 nike kernel: [<ffffffff813c7d69>] nf_conntrack_in+0x180/0x90e
> Aug 11 16:21:27 nike kernel: [<ffffffff8103ea37>] ? enqueue_task_fair+0x44/0x87
> Aug 11 16:21:27 nike kernel: [<ffffffff81037b51>] ? enqueue_task+0x5f/0x6a
> Aug 11 16:21:27 nike kernel: [<ffffffff8140c995>] ipv4_conntrack_in+0x21/0x23
> Aug 11 16:21:27 nike kernel: [<ffffffff813c4c56>] nf_iterate+0x46/0x89
> Aug 11 16:21:27 nike kernel: [<ffffffff813d4790>] ? ip_rcv_finish+0x0/0x362
> Aug 11 16:21:27 nike kernel: [<ffffffff813c4d03>] nf_hook_slow+0x6a/0xcb
> Aug 11 16:21:27 nike kernel: [<ffffffff813d4790>] ? ip_rcv_finish+0x0/0x362
> Aug 11 16:21:27 nike kernel: [<ffffffff813d4790>] ? ip_rcv_finish+0x0/0x362
> Aug 11 16:21:27 nike kernel: [<ffffffff813d4e51>] NF_HOOK.clone.1+0x46/0x58
> Aug 11 16:21:27 nike kernel: [<ffffffff8106e106>] ? getnstimeofday+0x63/0xb9
> Aug 11 16:21:27 nike kernel: [<ffffffff813d510b>] ip_rcv+0x256/0x283
> Aug 11 16:21:27 nike kernel: [<ffffffff813a53de>] netif_receive_skb+0x493/0x4b9
> Aug 11 16:21:27 nike kernel: [<ffffffff813a5baa>] napi_skb_finish+0x29/0x40
> Aug 11 16:21:27 nike kernel: [<ffffffff813a5bf0>] napi_gro_receive+0x2f/0x34
> Aug 11 16:21:27 nike kernel: [<ffffffffa0160381>] sky2_poll+0x9c5/0xc58 [sky2]
> Aug 11 16:21:27 nike kernel: [<ffffffff813a568f>] net_rx_action+0xaf/0x1ca
> Aug 11 16:21:27 nike kernel: [<ffffffff81053244>] __do_softirq+0xe5/0x1a6
> Aug 11 16:21:27 nike kernel: [<ffffffff8109e119>] ? handle_IRQ_event+0x60/0x121
> Aug 11 16:21:27 nike kernel: [<ffffffff8100ab5c>] call_softirq+0x1c/0x30
> Aug 11 16:21:27 nike kernel: [<ffffffff8100c342>] do_softirq+0x46/0x83
> Aug 11 16:21:27 nike kernel: [<ffffffff810530b5>] irq_exit+0x3b/0x7d
> Aug 11 16:21:27 nike kernel: [<ffffffff81452434>] do_IRQ+0xac/0xc3
> Aug 11 16:21:27 nike kernel: [<ffffffff8144cb93>] ret_from_intr+0x0/0x11
> Aug 11 16:21:27 nike kernel: <EOI>  [<ffffffff8127ef7b>] ?
> acpi_idle_enter_bm+0x288/0x2bc
> Aug 11 16:21:27 nike kernel: [<ffffffff8127ef74>] ?
> acpi_idle_enter_bm+0x281/0x2bc
> Aug 11 16:21:27 nike kernel: [<ffffffff81379458>] cpuidle_idle_call+0x99/0xf1
> Aug 11 16:21:27 nike kernel: [<ffffffff81008c22>] cpu_idle+0xaa/0xe4
> Aug 11 16:21:27 nike kernel: [<ffffffff8144553e>] start_secondary+0x253/0x294
> Aug 11 16:21:34 nike kernel: eth0: hw csum failure.
> Aug 11 16:21:34 nike kernel: Pid: 0, comm: swapper Not tainted
> 2.6.34.3-35.rc1.fc13.x86_64 #1
> Aug 11 16:21:34 nike kernel: Call Trace:
> Aug 11 16:21:34 nike kernel: <IRQ>  [<ffffffff813a5c5b>]
> netdev_rx_csum_fault+0x3b/0x3f
> Aug 11 16:21:34 nike kernel: [<ffffffff8139f909>]
> __skb_checksum_complete_head+0x51/0x65
> Aug 11 16:21:34 nike kernel: [<ffffffff8139f92e>] __skb_checksum_complete+0x11/0
> ...
> etc, 700 messages over the course of the next hour (until I came back
> and ip link down/up fixed it).
> 
> # cat /var/log/messages | egrep 'rx len'
> Aug 11 16:21:19 nike kernel: sky2 0000:0c:00.0: eth0: rx length error:
> status 0x5d60100 length 2982
> 
> (also seen on an older kernel [ 2.6.33.5-112.fc13.x86_64 ]:
>   Jul 17 12:43:10 nike kernel: sky2 eth0: rx length error: status
> 0x5ea0100 length 3018
>   Jul 28 02:34:46 nike kernel: sky2 eth0: rx length error: status
> 0x5ea0100 length 1642
>   Jul 30 09:49:16 nike kernel: sky2 eth0: rx length error: status
> 0x5ea0100 length 3018
>   Jul 31 00:20:26 nike kernel: sky2 eth0: rx length error: status
> 0x5ea0100 length 3018
> and kernels before that, including 2.6.32.12-115.fc12.x86_64, but I
> think I might have seen the problem even further back than 2.6.32).
> 
> # cat /var/log/messages | egrep 'eth0: hw csum failure\.$' | wc -l
> 694
> 
> The call stacks differ, here's the most common symbols with the number
> of times they occur
> (although this probably isn't particularly useful):
> 
> # cat /var/log/messages | egrep ffffffff | sed -rn 's@^^Aug ..
> ..:..:.. nike kernel: @@p' | sort | uniq -c | egrep -v '^     [
> 1-9][0-9] '
>     602 <EOI>  [<ffffffff8127ef7b>] ? acpi_idle_enter_bm+0x288/0x2bc
>     630 [<ffffffff81008c22>] cpu_idle+0xaa/0xe4
>     694 [<ffffffff8100ab5c>] call_softirq+0x1c/0x30
>     693 [<ffffffff8100c342>] do_softirq+0x46/0x83
>     273 [<ffffffff81010261>] ? sched_clock+0x9/0xd
>     105 [<ffffffff8101038f>] ? native_sched_clock+0x2d/0x5f
>     254 [<ffffffff810205a8>] ? lapic_next_event+0x1d/0x21
>     190 [<ffffffff81037b51>] ? enqueue_task+0x5f/0x6a
>     285 [<ffffffff81037c67>] ? activate_task+0x2f/0x37
>     144 [<ffffffff8103ea37>] ? enqueue_task_fair+0x44/0x87
>     693 [<ffffffff810530b5>] irq_exit+0x3b/0x7d
>     694 [<ffffffff81053244>] __do_softirq+0xe5/0x1a6
>     103 [<ffffffff8106b281>] ? sched_clock_local+0x1c/0x82
>     693 [<ffffffff8106e106>] ? getnstimeofday+0x63/0xb9
>     202 [<ffffffff8107148d>] ? clockevents_program_event+0x7a/0x83
>     255 [<ffffffff810725e5>] ? tick_dev_program_event+0x3c/0xfc
>     703 [<ffffffff8109e119>] ? handle_IRQ_event+0x60/0x121
>     348 [<ffffffff810fe9af>] ? virt_to_head_page+0xe/0x2f
>     528 [<ffffffff81216662>] ? __bitmap_weight+0x40/0x8f
>     602 [<ffffffff8127ef74>] ? acpi_idle_enter_bm+0x281/0x2bc
>     629 [<ffffffff81379458>] cpuidle_idle_call+0x99/0xf1
>     115 [<ffffffff8139cffd>] ? __kfree_skb+0x7d/0x81
>     694 [<ffffffff8139f909>] __skb_checksum_complete_head+0x51/0x65
>     694 [<ffffffff8139f92e>] __skb_checksum_complete+0x11/0x13
>     694 [<ffffffff813a53de>] netif_receive_skb+0x493/0x4b9
>     694 [<ffffffff813a568f>] net_rx_action+0xaf/0x1ca
>     694 [<ffffffff813a5baa>] napi_skb_finish+0x29/0x40
>     694 [<ffffffff813a5bf0>] napi_gro_receive+0x2f/0x34
>     695 [<ffffffff813c4c56>] nf_iterate+0x46/0x89
>     695 [<ffffffff813c4d03>] nf_hook_slow+0x6a/0xcb
>     145 [<ffffffff813c4d20>] ? nf_hook_slow+0x87/0xcb
>     694 [<ffffffff813c7d69>] nf_conntrack_in+0x180/0x90e
>     690 [<ffffffff813cc791>] udp_error+0x130/0x18a
>    2083 [<ffffffff813d4790>] ? ip_rcv_finish+0x0/0x362
>     163 [<ffffffff813d4c58>] ? ip_local_deliver_finish+0x0/0x1b3
>     694 [<ffffffff813d4e51>] NF_HOOK.clone.1+0x46/0x58
>     694 [<ffffffff813d510b>] ip_rcv+0x256/0x283
>     694 [<ffffffff8140c339>] nf_ip_checksum+0xdd/0xe3
>     694 [<ffffffff8140c995>] ipv4_conntrack_in+0x21/0x23
>     338 [<ffffffff81434d5a>] rest_init+0x7e/0x80
>     295 [<ffffffff8144553e>] start_secondary+0x253/0x294
>     151 [<ffffffff8144c8a6>] ? _raw_spin_unlock_bh+0x15/0x17
>     687 [<ffffffff8144cb93>] ret_from_intr+0x0/0x11
>     687 [<ffffffff81452434>] do_IRQ+0xac/0xc3
>     338 [<ffffffff81bae2c8>] x86_64_start_reservations+0xb3/0xb7
>     338 [<ffffffff81bae3c4>] x86_64_start_kernel+0xf8/0x107
>     338 [<ffffffff81baee6f>] start_kernel+0x413/0x41e
>     694 [<ffffffffa0160381>] sky2_poll+0x9c5/0xc58 [sky2]
>     150 [<ffffffffa05850ea>] ? nf_nat_cleanup_conntrack+0x69/0x6d [nf_nat]
>     694 <IRQ>  [<ffffffff813a5c5b>] netdev_rx_csum_fault+0x3b/0x3f

What is the dmesg and lspci info. Looks like a timing issue which
is unique to your machine/bus hardware combination.  could
you just turn off hardware rx checksum (with ethtool).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
  2010-08-12  1:59 ` Stephen Hemminger
@ 2010-08-12  5:36   ` Maciej Żenczykowski
  2010-08-12 16:00     ` Stephen Hemminger
  0 siblings, 1 reply; 10+ messages in thread
From: Maciej Żenczykowski @ 2010-08-12  5:36 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Stephen Hemminger, Linux NetDev

Here's lspci (it's an otherwise stock MacBook Pro 4,1 with a
non-standard wireless atheros mini-pci nic, replacing the std
broadcom.)

$ lspci
00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory
Controller Hub (rev 03)
00:01.0 PCI bridge: Intel Corporation Mobile PM965/GM965/GL960 PCI
Express Root Port (rev 03)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
UHCI Controller #5 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2
EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio
Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
Port 1 (rev 03)
00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
Port 3 (rev 03)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
Port 5 (rev 03)
00:1c.5 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
Port 6 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2
EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev f3)
00:1f.0 ISA bridge: Intel Corporation 82801HEM (ICH8M) LPC Interface
Controller (rev 03)
00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E)
IDE Controller (rev 03)
00:1f.2 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E)
SATA IDE Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 03)
01:00.0 VGA compatible controller: nVidia Corporation G84 [GeForce
8600M GT] (rev a1)
0b:00.0 Network controller: Atheros Communications Inc. AR928X
Wireless Network Adapter (PCI-Express) (rev 01)
0c:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8058
PCI-E Gigabit Ethernet Controller (rev 13)
0d:03.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b
Link Layer Controller (rev 02)

(more verbose lspci included in bugzilla entry)

At least one other person has seen this on a desktop non-mac machine
(see bugzilla) entry.
What would you like from dmesg?
Is the following enough?

Aug  9 12:09:11 nike kernel: sky2: driver version 1.27
Aug  9 12:09:11 nike kernel: sky2 0000:0c:00.0: PCI INT A -> GSI 17
(level, low) -> IRQ 17
Aug  9 12:09:11 nike kernel: sky2 0000:0c:00.0: Yukon-2 EC Ultra chip revision 3
Aug  9 12:09:11 nike kernel: sky2 0000:0c:00.0: eth0: addr 00:1f:5b:xx:xx:xx
...
Aug  9 12:09:22 nike kernel: sky2 0000:0c:00.0: eth0: enabling interface
Aug  9 12:09:22 nike kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
...
Aug  9 12:09:25 nike kernel: sky2 0000:0c:00.0: eth0: Link is up at
1000 Mbps, full duplex, flow control rx
Aug  9 12:09:25 nike kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
...
Aug 11 16:21:19 nike kernel: sky2 0000:0c:00.0: eth0: rx length error:
status 0x5d60100 length 2982
Aug 11 16:21:27 nike kernel: eth0: hw csum failure.
...

I'd just like to point out that this has happened something like 5
times in the past 30 days on a machine which is on 24/7 with wired
ethernet plugged in nearly 100% of the time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
  2010-08-12  5:36   ` Maciej Żenczykowski
@ 2010-08-12 16:00     ` Stephen Hemminger
  2010-08-12 16:16       ` Stephen Hemminger
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Hemminger @ 2010-08-12 16:00 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: Stephen Hemminger, Linux NetDev

On Wed, 11 Aug 2010 22:36:57 -0700
Maciej Żenczykowski <zenczykowski@gmail.com> wrote:

> Here's lspci (it's an otherwise stock MacBook Pro 4,1 with a
> non-standard wireless atheros mini-pci nic, replacing the std
> broadcom.)
> 
> $ lspci
> 00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory
> Controller Hub (rev 03)
> 00:01.0 PCI bridge: Intel Corporation Mobile PM965/GM965/GL960 PCI
> Express Root Port (rev 03)
> 00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
> UHCI Controller #4 (rev 03)
> 00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
> UHCI Controller #5 (rev 03)
> 00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2
> EHCI Controller #2 (rev 03)
> 00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio
> Controller (rev 03)
> 00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
> Port 1 (rev 03)
> 00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
> Port 3 (rev 03)
> 00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
> Port 5 (rev 03)
> 00:1c.5 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
> Port 6 (rev 03)
> 00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
> UHCI Controller #1 (rev 03)
> 00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
> UHCI Controller #2 (rev 03)
> 00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB
> UHCI Controller #3 (rev 03)
> 00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2
> EHCI Controller #1 (rev 03)
> 00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev f3)
> 00:1f.0 ISA bridge: Intel Corporation 82801HEM (ICH8M) LPC Interface
> Controller (rev 03)
> 00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E)
> IDE Controller (rev 03)
> 00:1f.2 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E)
> SATA IDE Controller (rev 03)
> 00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 03)
> 01:00.0 VGA compatible controller: nVidia Corporation G84 [GeForce
> 8600M GT] (rev a1)
> 0b:00.0 Network controller: Atheros Communications Inc. AR928X
> Wireless Network Adapter (PCI-Express) (rev 01)
> 0c:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8058
> PCI-E Gigabit Ethernet Controller (rev 13)
> 0d:03.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b
> Link Layer Controller (rev 02)
> 
> (more verbose lspci included in bugzilla entry)
> 
> At least one other person has seen this on a desktop non-mac machine
> (see bugzilla) entry.
> What would you like from dmesg?
> Is the following enough?
> 
> Aug  9 12:09:11 nike kernel: sky2: driver version 1.27
> Aug  9 12:09:11 nike kernel: sky2 0000:0c:00.0: PCI INT A -> GSI 17
> (level, low) -> IRQ 17
> Aug  9 12:09:11 nike kernel: sky2 0000:0c:00.0: Yukon-2 EC Ultra chip revision 3
> Aug  9 12:09:11 nike kernel: sky2 0000:0c:00.0: eth0: addr 00:1f:5b:xx:xx:xx
> ...
> Aug  9 12:09:22 nike kernel: sky2 0000:0c:00.0: eth0: enabling interface
> Aug  9 12:09:22 nike kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
> ...
> Aug  9 12:09:25 nike kernel: sky2 0000:0c:00.0: eth0: Link is up at
> 1000 Mbps, full duplex, flow control rx
> Aug  9 12:09:25 nike kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> ...
> Aug 11 16:21:19 nike kernel: sky2 0000:0c:00.0: eth0: rx length error:
> status 0x5d60100 length 2982
> Aug 11 16:21:27 nike kernel: eth0: hw csum failure.
> ...
> 
> I'd just like to point out that this has happened something like 5
> times in the past 30 days on a machine which is on 24/7 with wired
> ethernet plugged in nearly 100% of the time.

Probably he only thing the driver can do in these cases is automatically
turn off checksumming if it suspects the chip is having problems.

Is there a known good older kernel version?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
  2010-08-12 16:00     ` Stephen Hemminger
@ 2010-08-12 16:16       ` Stephen Hemminger
  2010-08-12 16:58         ` Maciej Żenczykowski
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Hemminger @ 2010-08-12 16:16 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: Stephen Hemminger, Linux NetDev


> > Aug 11 16:21:19 nike kernel: sky2 0000:0c:00.0: eth0: rx length error:
> > status 0x5d60100 length 2982
> > Aug 11 16:21:27 nike kernel: eth0: hw csum failure.

Are you trying to run with Jumbo >1500 MTU?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
  2010-08-12 16:16       ` Stephen Hemminger
@ 2010-08-12 16:58         ` Maciej Żenczykowski
  2010-08-12 19:18           ` Stephen Hemminger
  0 siblings, 1 reply; 10+ messages in thread
From: Maciej Żenczykowski @ 2010-08-12 16:58 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Stephen Hemminger, Linux NetDev

I'm not sure if there is a known good kernel.  It seems to be getting
worse over time (as I upgrade kernels), but maybe the hardware is
aging and the situation is becoming more likely.  When it first
started happening it was like once every 2-3 months or even rarer.
Now it has happened again since the last time I posted to this
thread...

Aug 12 08:29:08 nike kernel: sky2 0000:0c:00.0: eth0: rx length error:
status 0x5e50100 length 3013

> Are you trying to run with Jumbo >1500 MTU?

No, normal 1500 MTU network, with ipv4 and ipv6 native traffic.  Not a
huge amount of traffic either.
And indeed the problem seems to happen just as easily (if not easier)
when the machine (and thus the network) is close(r) to idle (ie.
overnight, etc) - although that might just be a matter of more time
passing.

Are you sure there is nothing the driver could do on seeing such an error?
It seems like since "ip link set eth0 down && ip link set eth0 up"
fixes it, what it should do is some sort of partial reset...

I will try to verify if 'ethtool -K eth0 rx off && ethtool -K eth0 rx
on' is enough to fix the problem (when it happens once again).
Afterwards I'll turn of rx csum (ethtool -K eth0 rx off) and will see
if it happens again.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
  2010-08-12 16:58         ` Maciej Żenczykowski
@ 2010-08-12 19:18           ` Stephen Hemminger
  2010-08-12 20:31             ` Maciej Żenczykowski
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Hemminger @ 2010-08-12 19:18 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: Stephen Hemminger, Linux NetDev

On Thu, 12 Aug 2010 09:58:01 -0700
Maciej Żenczykowski <zenczykowski@gmail.com> wrote:

> I'm not sure if there is a known good kernel.  It seems to be getting
> worse over time (as I upgrade kernels), but maybe the hardware is
> aging and the situation is becoming more likely.  When it first
> started happening it was like once every 2-3 months or even rarer.
> Now it has happened again since the last time I posted to this
> thread...
> 
> Aug 12 08:29:08 nike kernel: sky2 0000:0c:00.0: eth0: rx length error:
> status 0x5e50100 length 3013
> 
> > Are you trying to run with Jumbo >1500 MTU?
> 
> No, normal 1500 MTU network, with ipv4 and ipv6 native traffic.  Not a
> huge amount of traffic either.
> And indeed the problem seems to happen just as easily (if not easier)
> when the machine (and thus the network) is close(r) to idle (ie.
> overnight, etc) - although that might just be a matter of more time
> passing.
> 
> Are you sure there is nothing the driver could do on seeing such an error?
> It seems like since "ip link set eth0 down && ip link set eth0 up"
> fixes it, what it should do is some sort of partial reset...
> 
> I will try to verify if 'ethtool -K eth0 rx off && ethtool -K eth0 rx
> on' is enough to fix the problem (when it happens once again).
> Afterwards I'll turn of rx csum (ethtool -K eth0 rx off) and will see
> if it happens again.


The status values indicate that the GMAC (frame parser) got a reasonable
size frame but the DMA merged frames together. This indicates a timing
problem. There are some bits which even with NDA programmers manual doesn't
help with. The Linux driver expects the BIOS or EEPROM to set them correctly
because different problems different settings.

There is firmware in eeprom that configures internal state. On one motherboard
the vendor provided an update. There is no good way to update this from Linux,
you need to go system vendor and install firmware with their native OS (ie Windows
or MacOS).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
  2010-08-12 19:18           ` Stephen Hemminger
@ 2010-08-12 20:31             ` Maciej Żenczykowski
  2010-08-17 19:37               ` Stephen Hemminger
  0 siblings, 1 reply; 10+ messages in thread
From: Maciej Żenczykowski @ 2010-08-12 20:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Stephen Hemminger, Linux NetDev

> The status values indicate that the GMAC (frame parser) got a reasonable
> size frame but the DMA merged frames together. This indicates a timing
> problem. There are some bits which even with NDA programmers manual doesn't
> help with. The Linux driver expects the BIOS or EEPROM to set them correctly
> because different problems different settings.
>
> There is firmware in eeprom that configures internal state. On one motherboard
> the vendor provided an update. There is no good way to update this from Linux,
> you need to go system vendor and install firmware with their native OS (ie Windows
> or MacOS).

Perfectly reasonable response.  If there was a firmware update fix,
I'd apply it...
That would presumably prevent this from ever happening in the first place.

But why doesn't the network driver reset the nic when it detects this
'rx length' error?

I'm not asking for the error to not happen (besides it happens very rarely)...

I'm asking, why does this error happening permanently hose the network driver.
Once this happens the network card is not usable - traffic does not
flow through it.
You need to "ip link set down && ... up" to fix it.  Isn't this
something the driver could and should do all by itself?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
  2010-08-12 20:31             ` Maciej Żenczykowski
@ 2010-08-17 19:37               ` Stephen Hemminger
  2010-08-17 20:05                 ` Maciej Żenczykowski
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Hemminger @ 2010-08-17 19:37 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: Stephen Hemminger, Linux NetDev

On Thu, 12 Aug 2010 13:31:13 -0700
Maciej Żenczykowski <zenczykowski@gmail.com> wrote:

> > The status values indicate that the GMAC (frame parser) got a reasonable
> > size frame but the DMA merged frames together. This indicates a timing
> > problem. There are some bits which even with NDA programmers manual doesn't
> > help with. The Linux driver expects the BIOS or EEPROM to set them correctly
> > because different problems different settings.
> >
> > There is firmware in eeprom that configures internal state. On one motherboard
> > the vendor provided an update. There is no good way to update this from Linux,
> > you need to go system vendor and install firmware with their native OS (ie Windows
> > or MacOS).
> 
> Perfectly reasonable response.  If there was a firmware update fix,
> I'd apply it...
> That would presumably prevent this from ever happening in the first place.
> 
> But why doesn't the network driver reset the nic when it detects this
> 'rx length' error?
> 
> I'm not asking for the error to not happen (besides it happens very rarely)...
> 
> I'm asking, why does this error happening permanently hose the network driver.
> Once this happens the network card is not usable - traffic does not
> flow through it.
> You need to "ip link set down && ... up" to fix it.  Isn't this
> something the driver could and should do all by itself?

Also, the driver could schedule a reset (that is what the watchdog does),
but it looks like the receive DMA is walking past the end of the packet
and that is really dangerous since it could clobber random memory.

You might want to increase the size of rx DMA buffer and dump the
contents of the receive buffer to see if there is a memory corruption
risk. If the End Of Frame DMA hardware is not working, there is a real
danger if the driver silently continues.

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully
  2010-08-17 19:37               ` Stephen Hemminger
@ 2010-08-17 20:05                 ` Maciej Żenczykowski
  0 siblings, 0 replies; 10+ messages in thread
From: Maciej Żenczykowski @ 2010-08-17 20:05 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Stephen Hemminger, Linux NetDev

> Also, the driver could schedule a reset (that is what the watchdog does),
> but it looks like the receive DMA is walking past the end of the packet
> and that is really dangerous since it could clobber random memory.

I'm not sure to what watchdog you are referring.
I'm not aware of this machine having a hw (or sw) watchdog.

> You might want to increase the size of rx DMA buffer and dump the
> contents of the receive buffer to see if there is a memory corruption

Is this something that is a userspace tweak (how? pointers please?),
or do you mean to modify the kernel source code and recompile.

> risk. If the End Of Frame DMA hardware is not working, there is a real
> danger if the driver silently continues.

Agreed, although that does seem to be what the driver is currently
doing... silently continuing.

---

BTW, I've run into the issue once more and I've verified that turning
off all acceleration options
doesn't fix the problem (ethtool -k eth0 rx off tx off sg off tso off
gso off [ufo/gro/lro already were off]), nor does trying to get
the network card to renegotiate the link speed (ethtool -s eth0 speed
10; sleep 5; ethtool -s eth0 speed 1000; sleep 5; ethtool -r eth0).

I am now testing to see if the problem ever occurs if all the
acceleration options are turned off.
Of late the occurrence rate seems to be pretty steady at about twice per week.

Maciej

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-08-17 20:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-12  0:48 sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully Maciej Żenczykowski
2010-08-12  1:59 ` Stephen Hemminger
2010-08-12  5:36   ` Maciej Żenczykowski
2010-08-12 16:00     ` Stephen Hemminger
2010-08-12 16:16       ` Stephen Hemminger
2010-08-12 16:58         ` Maciej Żenczykowski
2010-08-12 19:18           ` Stephen Hemminger
2010-08-12 20:31             ` Maciej Żenczykowski
2010-08-17 19:37               ` Stephen Hemminger
2010-08-17 20:05                 ` Maciej Żenczykowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.