* sky2 panic in 2.6.32.1 under load @ 2009-12-21 23:52 Berck E. Nash 2009-12-22 0:09 ` Michael Breuer ` (2 more replies) 0 siblings, 3 replies; 145+ messages in thread From: Berck E. Nash @ 2009-12-21 23:52 UTC (permalink / raw) To: linux-kernel Since 2.6.32, I've been getting kernel panics under heavy network load (bittorrent usage). [ 8673.345873] sky2 eth0: receiver hang detected [ 8673.350368] sky2 eth0: disabling interface [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 [ 8673.373838] Oops: 0002 [#1] SMP [ 8673.373838] last sysfs file: /sys/devices/platform/coretemp.1/temp1_label [ 8673.383900] CPU 0 [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 usbcore tveeprom [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 #3 P5W DH Deluxe [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: 0000000000000000 [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: 0000000000000000 [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: 0000000000000000 [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: ffff88009bf3ab48 [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: 00000000955fa87e [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) knlGS:0000000000000000 [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: 00000000000026f0 [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo ffff8800bf8b4000, task ffff8800bf8a8650) [ 8673.539212] Stack: [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 000000428122a659 [ 8673.548658] <0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 ffffffff813eaf30 [ 8673.556929] <0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 ffffffff81231de8 [ 8673.565432] Call Trace: [ 8673.565432] <IRQ> [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 [ 8673.609833] <EOI> [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 4b 20 <4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] [ 8673.674902] RSP <ffff880001603df8> [ 8673.674902] CR2: 0000000000000010 [ 8673.682956] hda-intel: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj. [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D 2.6.32.1 #3 [ 8673.683021] Call Trace: [ 8673.683023] <IRQ> [<ffffffff8129839d>] panic+0x75/0x11c [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 [sky2] [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 [sky2] [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 [ 8673.683079] <EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 Note that sometimes I get the receiver hang and things continue without the crash: [ 1219.346571] sky2 eth0: receiver hang detected [ 1219.351145] sky2 eth0: disabling interface [ 1219.357389] sky2 eth0: enabling interface [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both Please CC as I'm not subscribed. Thanks, Berck Nash ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-21 23:52 sky2 panic in 2.6.32.1 under load Berck E. Nash @ 2009-12-22 0:09 ` Michael Breuer 2009-12-22 18:50 ` Michael Breuer 2009-12-22 0:52 ` sky2 panic in 2.6.32.1 under load Daniel Hazelton 2009-12-24 6:58 ` Andrew Morton 2 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-22 0:09 UTC (permalink / raw) To: Berck E. Nash; +Cc: linux-kernel Just an FYI - I got one of these today as well - also 2.6.32 (last git before 2.6.33 rc1). I also had two similar events previously - one 2.6.31, and one 2.6.32 rc7. I was doing some research today and came across a fair sampling of error reports all suggesting that new firmware is a solution. Gigabyte (not my MB) has firmware posted, but I haven't yet attempted to apply it (nor am I sure how to test it after doing so). For what it's worth, my three recent events were heavy smb usage - windows system backing up. Today's event did not generate an oops as my watchdog kicked in and rebooted the system after the kernel reported a sky2 error. The last messages before watchdog: Dec 20 05:33:58 mail kernel: DRHD: handling fault status reg 2 Dec 20 05:33:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] fault addr ffe1a7ffe000 Dec 20 05:33:58 mail kernel: DMAR:[fault reason 06] PTE Read access is not set Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: error interrupt status=0x80000000 Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) During the previous event I had vt-d disabled - it was enabled this time. On 12/21/2009 6:52 PM, Berck E. Nash wrote: > Since 2.6.32, I've been getting kernel panics under heavy network load > (bittorrent usage). > > [ 8673.345873] sky2 eth0: receiver hang detected > [ 8673.350368] sky2 eth0: disabling interface > [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000010 > [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 > [ 8673.373838] Oops: 0002 [#1] SMP > [ 8673.373838] last sysfs file: /sys/devices/platform/coretemp.1/temp1_label > [ 8673.383900] CPU 0 > [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl > auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer > hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 > snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec > snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss > snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common > snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd > v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 > usbcore tveeprom > [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 > #3 P5W DH Deluxe > [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] > sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 > [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: > 0000000000000000 > [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: > 0000000000000000 > [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: > 0000000000000000 > [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: > ffff88009bf3ab48 > [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: > 00000000955fa87e > [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) > knlGS:0000000000000000 > [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: > 00000000000026f0 > [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > 0000000000000400 > [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo ffff8800bf8b4000, > task ffff8800bf8a8650) > [ 8673.539212] Stack: > [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 > 000000428122a659 > [ 8673.548658]<0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 > ffffffff813eaf30 > [ 8673.556929]<0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 > ffffffff81231de8 > [ 8673.565432] Call Trace: > [ 8673.565432]<IRQ> > [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 > [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 > [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 > [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 > [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e > [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b > [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 > [ 8673.609833]<EOI> > [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b > [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 > [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 > [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 > [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 > [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 > [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 > [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 > c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 > 4b 20<4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b > [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.674902] RSP<ffff880001603df8> > [ 8673.674902] CR2: 0000000000000010 > [ 8673.682956] hda-intel: IRQ timing workaround is activated for card > #0. Suggest a bigger bdl_pos_adj. > [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- > [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt > [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D 2.6.32.1 #3 > [ 8673.683021] Call Trace: > [ 8673.683023]<IRQ> [<ffffffff8129839d>] panic+0x75/0x11c > [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e > [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd > [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 > [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 > [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 > [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 > [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 > [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 [sky2] > [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 > [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 > [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 > [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 > [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e > [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b > [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 > [ 8673.683079]<EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b > [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 > [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 > [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 > [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 > [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 > [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 > > Note that sometimes I get the receiver hang and things continue without > the crash: > > [ 1219.346571] sky2 eth0: receiver hang detected > [ 1219.351145] sky2 eth0: disabling interface > [ 1219.357389] sky2 eth0: enabling interface > [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow > control both > > Please CC as I'm not subscribed. > > Thanks, > Berck Nash > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-22 0:09 ` Michael Breuer @ 2009-12-22 18:50 ` Michael Breuer 2009-12-23 22:54 ` sky2 panic in 2.6.32.1 under load (new oops) Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-22 18:50 UTC (permalink / raw) To: Berck E. Nash; +Cc: linux-kernel I updated my firmware (Marvell/Yukon)) from .12 to .14. Going to leave my kernel version unchanged for the next week or so and see if this recurs. On 12/21/2009 7:09 PM, Michael Breuer wrote: > Just an FYI - I got one of these today as well - also 2.6.32 (last git > before 2.6.33 rc1). I also had two similar events previously - one > 2.6.31, and one 2.6.32 rc7. > > I was doing some research today and came across a fair sampling of > error reports all suggesting that new firmware is a solution. Gigabyte > (not my MB) has firmware posted, but I haven't yet attempted to apply > it (nor am I sure how to test it after doing so). > > For what it's worth, my three recent events were heavy smb usage - > windows system backing up. Today's event did not generate an oops as > my watchdog kicked in and rebooted the system after the kernel > reported a sky2 error. > > The last messages before watchdog: > Dec 20 05:33:58 mail kernel: DRHD: handling fault status reg 2 > Dec 20 05:33:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] > fault addr ffe1a7ffe000 > Dec 20 05:33:58 mail kernel: DMAR:[fault reason 06] PTE Read access is > not set > Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x80000000 > Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: PCI hardware error > (0x2010) > > During the previous event I had vt-d disabled - it was enabled this time. > > > On 12/21/2009 6:52 PM, Berck E. Nash wrote: >> Since 2.6.32, I've been getting kernel panics under heavy network load >> (bittorrent usage). >> >> [ 8673.345873] sky2 eth0: receiver hang detected >> [ 8673.350368] sky2 eth0: disabling interface >> [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at >> 0000000000000010 >> [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 >> [sky2] >> [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 >> [ 8673.373838] Oops: 0002 [#1] SMP >> [ 8673.373838] last sysfs file: >> /sys/devices/platform/coretemp.1/temp1_label >> [ 8673.383900] CPU 0 >> [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl >> auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer >> hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 >> snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec >> snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss >> snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common >> snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd >> v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 >> usbcore tveeprom >> [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 >> #3 P5W DH Deluxe >> [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] >> sky2_xmit_frame+0x321/0x5d8 [sky2] >> [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 >> [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: >> 0000000000000000 >> [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: >> 0000000000000000 >> [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: >> 0000000000000000 >> [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: >> ffff88009bf3ab48 >> [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: >> 00000000955fa87e >> [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) >> knlGS:0000000000000000 >> [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >> [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: >> 00000000000026f0 >> [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 >> [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: >> 0000000000000400 >> [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo ffff8800bf8b4000, >> task ffff8800bf8a8650) >> [ 8673.539212] Stack: >> [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 >> 000000428122a659 >> [ 8673.548658]<0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 >> ffffffff813eaf30 >> [ 8673.556929]<0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 >> ffffffff81231de8 >> [ 8673.565432] Call Trace: >> [ 8673.565432]<IRQ> >> [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >> [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >> [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >> [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >> [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >> [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >> [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >> [ 8673.609833]<EOI> >> [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b >> [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >> [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >> [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 >> [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >> [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >> [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >> [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 >> c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 >> 4b 20<4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b >> [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 >> [sky2] >> [ 8673.674902] RSP<ffff880001603df8> >> [ 8673.674902] CR2: 0000000000000010 >> [ 8673.682956] hda-intel: IRQ timing workaround is activated for card >> #0. Suggest a bigger bdl_pos_adj. >> [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- >> [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt >> [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D >> 2.6.32.1 #3 >> [ 8673.683021] Call Trace: >> [ 8673.683023]<IRQ> [<ffffffff8129839d>] panic+0x75/0x11c >> [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e >> [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd >> [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 >> [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 >> [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 >> [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 >> [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 >> [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 >> [sky2] >> [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 >> [sky2] >> [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >> [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >> [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >> [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >> [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >> [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >> [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >> [ 8673.683079]<EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b >> [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >> [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >> [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 >> [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >> [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >> [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >> >> Note that sometimes I get the receiver hang and things continue without >> the crash: >> >> [ 1219.346571] sky2 eth0: receiver hang detected >> [ 1219.351145] sky2 eth0: disabling interface >> [ 1219.357389] sky2 eth0: enabling interface >> [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow >> control both >> >> Please CC as I'm not subscribed. >> >> Thanks, >> Berck Nash >> -- >> To unsubscribe from this list ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-22 18:50 ` Michael Breuer @ 2009-12-23 22:54 ` Michael Breuer 2009-12-24 7:01 ` Andrew Morton 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-23 22:54 UTC (permalink / raw) To: Berck E. Nash; +Cc: linux-kernel Ok - not the firmware. Ran another Windows backup and sky2 went down. Nothing in dmesg.old - have oops in syslog. System became unresponsive and watchdog kicked in after a minute. Also note that I have a similar oops with VT-D disabled (posted here on 12/5). I'm attaching the oops from that below this oops for comparison. That also happened under similar load. On the assumption that I can recreate this (although it takes a while) please let me know how I can help. What's in my log (starting with an smbd error about 2 min before the oops (note: the dchpd is not the system doing the backup). Dec 23 17:31:05 mail smbd[20712]: [2009/12/23 17:31:05, 0] smbd/nttrans.c:2076(call_nt_transact_ioctl) Dec 23 17:31:05 mail smbd[20712]: call_nt_transact_ioctl(0x900eb): Currently not implemented. Dec 23 17:31:34 mail clamd[3449]: SelfCheck: Database status OK. Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view External: query (cache) './NS/IN' denied Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view External: query (cache) './NS/IN' denied Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view External: query (cache) './NS/IN' denied Dec 23 17:32:54 mail dhcpd: DHCPINFORM from 10.0.0.3 via eth0 Dec 23 17:32:54 mail dhcpd: DHCPACK to 10.0.0.3 (00:21:5d:cb:a1:74) via eth0 Dec 23 17:32:58 mail kernel: DRHD: handling fault status reg 2 Dec 23 17:32:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] fault addr fff8423fe000 Dec 23 17:32:58 mail kernel: DMAR:[fault reason 06] PTE Read access is not set Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: error interrupt status=0x80000000 Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) Dec 23 17:33:06 mail kernel: ------------[ cut here ]------------ Dec 23 17:33:06 mail kernel: WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0xf3/0x161() Dec 23 17:33:06 mail kernel: Hardware name: System Product Name Dec 23 17:33:06 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out Dec 23 17:33:06 mail kernel: Modules linked in: ip6table_filter ip6table_mangle ip6_tables iptable_raw iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc hwmon_vid coretemp acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_hda_intel snd_hda_codec snd_rawmidi snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer gspca_spca505 snd gspca_main soundcore snd_page_alloc videodev firewire_ohci sky2 v4l1_compat v4l2_compat_ioctl32 firewire_core iTCO_wdt wmi asus_atk0110 iTCO_vendor_support i2c_i801 hwmon pcspkr crc_itu_t raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: ip6_tables] Dec 23 17:33:06 mail kernel: Pid: 0, comm: swapper Tainted: G W 2.6.32-09281-g1c42cda #6 Dec 23 17:33:06 mail kernel: Call Trace: Dec 23 17:33:06 mail kernel: <IRQ> [<ffffffff8104afbd>] warn_slowpath_common+0x7c/0x94 Dec 23 17:33:06 mail kernel: [<ffffffff8104b02c>] warn_slowpath_fmt+0x41/0x43 Dec 23 17:33:06 mail kernel: [<ffffffff813d096f>] ? netif_tx_lock+0x44/0x6c Dec 23 17:33:06 mail kernel: [<ffffffff813d0ad7>] dev_watchdog+0xf3/0x161 Dec 23 17:33:06 mail kernel: [<ffffffff8105610c>] ? cascade+0x6a/0x84 Dec 23 17:33:06 mail kernel: [<ffffffff81058558>] run_timer_softirq+0x1c8/0x270 Dec 23 17:33:06 mail kernel: [<ffffffff81051595>] __do_softirq+0xf8/0x1cd Dec 23 17:33:06 mail kernel: [<ffffffff810720af>] ? tick_program_event+0x2a/0x2c Dec 23 17:33:06 mail kernel: [<ffffffff8100ab1c>] call_softirq+0x1c/0x30 Dec 23 17:33:06 mail kernel: [<ffffffff8100c2b3>] do_softirq+0x4b/0xa3 Dec 23 17:33:06 mail kernel: [<ffffffff81051181>] irq_exit+0x4a/0x8c Dec 23 17:33:06 mail kernel: [<ffffffff8145ed99>] smp_apic_timer_interrupt+0x86/0x94 Dec 23 17:33:06 mail kernel: [<ffffffff8100a5d3>] apic_timer_interrupt+0x13/0x20 Dec 23 17:33:06 mail kernel: <EOI> [<ffffffff812ae5fa>] ? acpi_idle_enter_bm+0x256/0x28a Dec 23 17:33:06 mail kernel: [<ffffffff812ae5f3>] ? acpi_idle_enter_bm+0x24f/0x28a Dec 23 17:33:06 mail kernel: [<ffffffff81393308>] cpuidle_idle_call+0x9e/0xfa Dec 23 17:33:06 mail kernel: [<ffffffff81008c05>] cpu_idle+0xb4/0xf6 Dec 23 17:33:06 mail kernel: [<ffffffff814532b2>] start_secondary+0x201/0x242 Dec 23 17:33:06 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- Dec 23 17:33:06 mail kernel: sky2 eth0: tx timeout Dec 23 17:33:06 mail kernel: sky2 eth0: disabling interface Dec 23 17:33:06 mail kernel: sky2 eth0: enabling interface Dec 23 17:33:06 mail NetworkManager: <info> (eth0): carrier now OFF (device state 1) Dec 23 17:33:11 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both Dec 23 17:33:11 mail NetworkManager: <info> (eth0): carrier now ON (device state 1) Dec 23 17:33:26 mail kernel: sky2 eth0: tx timeout Dec 23 17:33:26 mail kernel: sky2 eth0: disabling interface Dec 23 17:33:26 mail kernel: sky2 eth0: enabling interface Dec 23 17:33:26 mail NetworkManager: <info> (eth0): carrier now OFF (device state 1) Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view External: query (cache) './NS/IN' denied Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view External: query (cache) './NS/IN' denied Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view External: query (cache) './NS/IN' denied Dec 23 17:33:29 mail named[4284]: client 114.80.99.2#44695: view External: query (cache) './NS/IN' denied Dec 23 17:33:29 mail named[4284]: client 114.80.99.2#44695: view External: query (cache) './NS/IN' denied Dec 23 17:33:29 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both Dec 23 17:33:29 mail NetworkManager: <info> (eth0): carrier now ON (device state 1) Dec 23 17:33:30 mail named[4284]: client 114.80.99.2#44695: view External: query (cache) './NS/IN' denied Dec 23 17:33:38 mail named[4284]: error (connection refused) resolving '121.93.27.216.blackholes.mail-abuse.org/A/IN': 168.61.5.195#53 Dec 23 17:33:44 mail kernel: sky2 eth0: tx timeout Dec 23 17:33:44 mail kernel: sky2 eth0: disabling interface Dec 23 17:33:44 mail kernel: sky2 eth0: enabling interface Dec 23 17:33:44 mail NetworkManager: <info> (eth0): carrier now OFF (device state 1) Dec 23 17:33:49 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both Dec 23 17:33:49 mail NetworkManager: <info> (eth0): carrier now ON (device state 1) Dec 23 17:33:52 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0 Dec 23 17:33:52 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via eth0 Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] lib/util_sock.c:539(read_fd_with_timeout) Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] lib/util_sock.c:1491(get_peer_addr_internal) Dec 23 17:34:01 mail smbd[24783]: getpeername failed. Error was Transport endpoint is not connected Dec 23 17:34:01 mail smbd[24783]: read_fd_with_timeout: client 0.0.0.0 read error = Connection reset by peer. Dec 23 17:34:04 mail abrt: Kerneloops: Reported 1 kernel oopses to Abrt Dec 23 17:34:04 mail abrtd: Directory 'kerneloops-1261607644-1' creation detected Dec 23 17:34:04 mail abrtd: Getting local universal unique identification Dec 23 17:34:04 mail abrtd: New crash, saving Dec 23 17:34:04 mail abrtd: Activation of plugin 'RunApp' was not successful: Plugin 'RunApp' is not registered Dec 23 17:34:04 mail kernel: sky2 eth0: tx timeout Dec 23 17:34:04 mail kernel: sky2 eth0: disabling interface Dec 23 17:34:04 mail kernel: sky2 eth0: enabling interface Dec 23 17:34:04 mail NetworkManager: <info> (eth0): carrier now OFF (device state 1) Dec 23 17:34:09 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both Dec 23 17:34:09 mail NetworkManager: <info> (eth0): carrier now ON (device state 1) Dec 23 17:34:11 mail logger: ping of potter failed Dec 23 17:34:19 mail kernel: sky2 eth0: tx timeout Dec 23 17:34:19 mail kernel: sky2 eth0: disabling interface Dec 23 17:34:19 mail kernel: sky2 eth0: enabling interface Dec 23 17:34:19 mail NetworkManager: <info> (eth0): carrier now OFF (device state 1) Dec 23 17:34:22 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both Dec 23 17:34:22 mail NetworkManager: <info> (eth0): carrier now ON (device state 1) Dec 23 17:37:40 mail kernel: imklog 4.4.2, log source = /proc/kmsg started. Oops from 12/5 (no vt-d; no intel_iommu, firmware .12) Dec 5 11:27:41 mail kernel: ------------[ cut here ]------------ Dec 5 11:27:41 mail kernel: WARNING: at lib/dma-debug.c:898 check_sync+0xbd/0x426() Dec 5 11:27:41 mail kernel: Hardware name: System Product Name Dec 5 11:27:41 mail kernel: sky2 0000:06:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x000000031a57e022] [size=66 bytes] Dec 5 11:27:41 mail kernel: Modules linked in: bridge stp appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc hwmon_vid coretemp acpi_cpufreq sit tunnel4 ipt_LOG ipt_MASQUERADE iptable_nat nf_nat iptable_mangle iptable_raw nf_conntrack_netbios_ns nf_conntrack_ftp nf_conntrack_ipv6 xt_multiport ip6table_filter xt_DSCP xt_dscp xt_MARK ip6table_mangle ip6_tables ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec snd_hda_intel snd_hda_codec ac97_bus gspca_spca505 snd_hwdep snd_seq snd_seq_device firewire_ohci gspca_main snd_pcm videodev firewire_core sky2 snd_timer v4l1_compat wmi snd v4l2_compat_ioctl32 soundcore crc_itu_t snd_page_alloc asus_atk0110 i2c_i801 iTCO_wdt iTCO_vendor_support pcspkr hwmon raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: scsi_wait_scan] Dec 5 11:27:41 mail kernel: Pid: 0, comm: swapper Not tainted 2.6.32-00573-g6cf4d67 #2 Dec 5 11:27:41 mail kernel: Call Trace: Dec 5 11:27:41 mail kernel: <IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94 Dec 5 11:27:41 mail kernel: [<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43 Dec 5 11:27:41 mail kernel: [<ffffffff8127b1fd>] check_sync+0xbd/0x426 Dec 5 11:27:41 mail kernel: [<ffffffff813ca254>] ? __netdev_alloc_skb+0x34/0x50 Dec 5 11:27:41 mail kernel: [<ffffffff8127b8b9>] debug_dma_sync_single_for_cpu+0x42/0x44 Dec 5 11:27:41 mail kernel: [<ffffffff81278c5f>] ? swiotlb_sync_single+0x2a/0xb6 Dec 5 11:27:41 mail kernel: [<ffffffff81278dbb>] ? swiotlb_sync_single_for_cpu+0xc/0xe Dec 5 11:27:41 mail kernel: [<ffffffffa01cee83>] sky2_poll+0x4c6/0xae1 [sky2] Dec 5 11:27:41 mail kernel: [<ffffffff8146bb1a>] ? _spin_unlock_irqrestore+0x29/0x41 Dec 5 11:27:41 mail kernel: [<ffffffff813d0ef2>] net_rx_action+0xb5/0x1f3 Dec 5 11:27:41 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd Dec 5 11:27:41 mail kernel: [<ffffffff810a2dea>] ? handle_IRQ_event+0x119/0x12b Dec 5 11:27:41 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 Dec 5 11:27:41 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 Dec 5 11:27:41 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c Dec 5 11:27:41 mail kernel: [<ffffffff8146fb75>] do_IRQ+0xa5/0xbc Dec 5 11:27:41 mail kernel: [<ffffffff81012613>] ret_from_intr+0x0/0x16 Dec 5 11:27:41 mail kernel: <EOI> [<ffffffff812c6b3e>] ? acpi_idle_enter_bm+0x256/0x28a Dec 5 11:27:41 mail kernel: [<ffffffff812c6b37>] ? acpi_idle_enter_bm+0x24f/0x28a Dec 5 11:27:41 mail kernel: [<ffffffff813a6280>] ? cpuidle_idle_call+0x9e/0xfa Dec 5 11:27:41 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 Dec 5 11:27:41 mail kernel: [<ffffffff814651f2>] ? start_secondary+0x201/0x242 Dec 5 11:27:41 mail kernel: ---[ end trace ff8485ac82013a8d ]--- On 12/22/2009 01:50 PM, Michael Breuer wrote: > I updated my firmware (Marvell/Yukon)) from .12 to .14. Going to leave > my kernel version unchanged for the next week or so and see if this > recurs. > On 12/21/2009 7:09 PM, Michael Breuer wrote: >> Just an FYI - I got one of these today as well - also 2.6.32 (last >> git before 2.6.33 rc1). I also had two similar events previously - >> one 2.6.31, and one 2.6.32 rc7. >> >> I was doing some research today and came across a fair sampling of >> error reports all suggesting that new firmware is a solution. >> Gigabyte (not my MB) has firmware posted, but I haven't yet attempted >> to apply it (nor am I sure how to test it after doing so). >> >> For what it's worth, my three recent events were heavy smb usage - >> windows system backing up. Today's event did not generate an oops as >> my watchdog kicked in and rebooted the system after the kernel >> reported a sky2 error. >> >> The last messages before watchdog: >> Dec 20 05:33:58 mail kernel: DRHD: handling fault status reg 2 >> Dec 20 05:33:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] >> fault addr ffe1a7ffe000 >> Dec 20 05:33:58 mail kernel: DMAR:[fault reason 06] PTE Read access >> is not set >> Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x80000000 >> Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: PCI hardware error >> (0x2010) >> >> During the previous event I had vt-d disabled - it was enabled this >> time. >> >> >> On 12/21/2009 6:52 PM, Berck E. Nash wrote: >>> Since 2.6.32, I've been getting kernel panics under heavy network load >>> (bittorrent usage). >>> >>> [ 8673.345873] sky2 eth0: receiver hang detected >>> [ 8673.350368] sky2 eth0: disabling interface >>> [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at >>> 0000000000000010 >>> [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 >>> [sky2] >>> [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 >>> [ 8673.373838] Oops: 0002 [#1] SMP >>> [ 8673.373838] last sysfs file: >>> /sys/devices/platform/coretemp.1/temp1_label >>> [ 8673.383900] CPU 0 >>> [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl >>> auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer >>> hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 >>> snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec >>> snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss >>> snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common >>> snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd >>> v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 >>> usbcore tveeprom >>> [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 >>> #3 P5W DH Deluxe >>> [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] >>> sky2_xmit_frame+0x321/0x5d8 [sky2] >>> [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 >>> [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: >>> 0000000000000000 >>> [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: >>> 0000000000000000 >>> [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: >>> 0000000000000000 >>> [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: >>> ffff88009bf3ab48 >>> [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: >>> 00000000955fa87e >>> [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) >>> knlGS:0000000000000000 >>> [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >>> [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: >>> 00000000000026f0 >>> [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>> 0000000000000000 >>> [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: >>> 0000000000000400 >>> [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo >>> ffff8800bf8b4000, >>> task ffff8800bf8a8650) >>> [ 8673.539212] Stack: >>> [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 >>> 000000428122a659 >>> [ 8673.548658]<0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 >>> ffffffff813eaf30 >>> [ 8673.556929]<0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 >>> ffffffff81231de8 >>> [ 8673.565432] Call Trace: >>> [ 8673.565432]<IRQ> >>> [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >>> [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >>> [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >>> [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >>> [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >>> [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >>> [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >>> [ 8673.609833]<EOI> >>> [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b >>> [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >>> [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >>> [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 >>> [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >>> [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >>> [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >>> [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 >>> c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 >>> 4b 20<4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b >>> [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 >>> [sky2] >>> [ 8673.674902] RSP<ffff880001603df8> >>> [ 8673.674902] CR2: 0000000000000010 >>> [ 8673.682956] hda-intel: IRQ timing workaround is activated for card >>> #0. Suggest a bigger bdl_pos_adj. >>> [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- >>> [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt >>> [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D >>> 2.6.32.1 #3 >>> [ 8673.683021] Call Trace: >>> [ 8673.683023]<IRQ> [<ffffffff8129839d>] panic+0x75/0x11c >>> [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e >>> [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd >>> [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 >>> [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 >>> [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 >>> [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 >>> [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 >>> [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 >>> [sky2] >>> [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 >>> [sky2] >>> [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >>> [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >>> [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >>> [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >>> [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >>> [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >>> [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >>> [ 8673.683079]<EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b >>> [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >>> [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >>> [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 >>> [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >>> [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >>> [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >>> >>> Note that sometimes I get the receiver hang and things continue without >>> the crash: >>> >>> [ 1219.346571] sky2 eth0: receiver hang detected >>> [ 1219.351145] sky2 eth0: disabling interface >>> [ 1219.357389] sky2 eth0: enabling interface >>> [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow >>> control both >>> >>> Please CC as I'm not subscribed. >>> >>> Thanks, >>> Berck Nash >>> -- >>> To unsubscribe from this list > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-23 22:54 ` sky2 panic in 2.6.32.1 under load (new oops) Michael Breuer @ 2009-12-24 7:01 ` Andrew Morton 2009-12-24 19:18 ` Michael Breuer ` (2 more replies) 0 siblings, 3 replies; 145+ messages in thread From: Andrew Morton @ 2009-12-24 7:01 UTC (permalink / raw) To: Michael Breuer; +Cc: Berck E. Nash, linux-kernel, netdev, Stephen Hemminger cc's added again. On Wed, 23 Dec 2009 17:54:27 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > Ok - not the firmware. Ran another Windows backup and sky2 went down. > > Nothing in dmesg.old - have oops in syslog. System became unresponsive > and watchdog kicked in after a minute. > > Also note that I have a similar oops with VT-D disabled (posted here on > 12/5). I'm attaching the oops from that below this oops for comparison. > That also happened under similar load. > > On the assumption that I can recreate this (although it takes a while) > please let me know how I can help. > > What's in my log (starting with an smbd error about 2 min before the > oops (note: the dchpd is not the system doing the backup). This (nastily wordwrapped) oops appers to be quite different from Berck's one. > Dec 23 17:31:05 mail smbd[20712]: [2009/12/23 17:31:05, 0] > smbd/nttrans.c:2076(call_nt_transact_ioctl) > Dec 23 17:31:05 mail smbd[20712]: call_nt_transact_ioctl(0x900eb): > Currently not implemented. > Dec 23 17:31:34 mail clamd[3449]: SelfCheck: Database status OK. > Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view > External: query (cache) './NS/IN' denied > Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view > External: query (cache) './NS/IN' denied > Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view > External: query (cache) './NS/IN' denied > Dec 23 17:32:54 mail dhcpd: DHCPINFORM from 10.0.0.3 via eth0 > Dec 23 17:32:54 mail dhcpd: DHCPACK to 10.0.0.3 (00:21:5d:cb:a1:74) via eth0 > Dec 23 17:32:58 mail kernel: DRHD: handling fault status reg 2 > Dec 23 17:32:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] > fault addr fff8423fe000 > Dec 23 17:32:58 mail kernel: DMAR:[fault reason 06] PTE Read access is > not set > Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x80000000 > Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) > Dec 23 17:33:06 mail kernel: ------------[ cut here ]------------ > Dec 23 17:33:06 mail kernel: WARNING: at net/sched/sch_generic.c:255 > dev_watchdog+0xf3/0x161() > Dec 23 17:33:06 mail kernel: Hardware name: System Product Name > Dec 23 17:33:06 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit > queue 0 timed out > Dec 23 17:33:06 mail kernel: Modules linked in: ip6table_filter > ip6table_mangle ip6_tables iptable_raw iptable_mangle ipt_MASQUERADE > iptable_nat nf_nat bridge stp appletalk psnap llc nfsd lockd nfs_acl > auth_rpcgss exportfs sunrpc hwmon_vid coretemp acpi_cpufreq sit tunnel4 > ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK > nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm > snd_hda_codec_analog snd_ens1371 gameport snd_hda_intel snd_hda_codec > snd_rawmidi snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_seq_device > snd_pcm snd_timer gspca_spca505 snd gspca_main soundcore snd_page_alloc > videodev firewire_ohci sky2 v4l1_compat v4l2_compat_ioctl32 > firewire_core iTCO_wdt wmi asus_atk0110 iTCO_vendor_support i2c_i801 > hwmon pcspkr crc_itu_t raid456 async_raid6_recov async_pq raid6_pq > async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi > pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit > cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: ip6_tables] > Dec 23 17:33:06 mail kernel: Pid: 0, comm: swapper Tainted: G W > 2.6.32-09281-g1c42cda #6 > Dec 23 17:33:06 mail kernel: Call Trace: > Dec 23 17:33:06 mail kernel: <IRQ> [<ffffffff8104afbd>] > warn_slowpath_common+0x7c/0x94 > Dec 23 17:33:06 mail kernel: [<ffffffff8104b02c>] > warn_slowpath_fmt+0x41/0x43 > Dec 23 17:33:06 mail kernel: [<ffffffff813d096f>] ? netif_tx_lock+0x44/0x6c > Dec 23 17:33:06 mail kernel: [<ffffffff813d0ad7>] dev_watchdog+0xf3/0x161 > Dec 23 17:33:06 mail kernel: [<ffffffff8105610c>] ? cascade+0x6a/0x84 > Dec 23 17:33:06 mail kernel: [<ffffffff81058558>] > run_timer_softirq+0x1c8/0x270 > Dec 23 17:33:06 mail kernel: [<ffffffff81051595>] __do_softirq+0xf8/0x1cd > Dec 23 17:33:06 mail kernel: [<ffffffff810720af>] ? > tick_program_event+0x2a/0x2c > Dec 23 17:33:06 mail kernel: [<ffffffff8100ab1c>] call_softirq+0x1c/0x30 > Dec 23 17:33:06 mail kernel: [<ffffffff8100c2b3>] do_softirq+0x4b/0xa3 > Dec 23 17:33:06 mail kernel: [<ffffffff81051181>] irq_exit+0x4a/0x8c > Dec 23 17:33:06 mail kernel: [<ffffffff8145ed99>] > smp_apic_timer_interrupt+0x86/0x94 > Dec 23 17:33:06 mail kernel: [<ffffffff8100a5d3>] > apic_timer_interrupt+0x13/0x20 > Dec 23 17:33:06 mail kernel: <EOI> [<ffffffff812ae5fa>] ? > acpi_idle_enter_bm+0x256/0x28a > Dec 23 17:33:06 mail kernel: [<ffffffff812ae5f3>] ? > acpi_idle_enter_bm+0x24f/0x28a > Dec 23 17:33:06 mail kernel: [<ffffffff81393308>] > cpuidle_idle_call+0x9e/0xfa > Dec 23 17:33:06 mail kernel: [<ffffffff81008c05>] cpu_idle+0xb4/0xf6 > Dec 23 17:33:06 mail kernel: [<ffffffff814532b2>] > start_secondary+0x201/0x242 > Dec 23 17:33:06 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- > Dec 23 17:33:06 mail kernel: sky2 eth0: tx timeout > Dec 23 17:33:06 mail kernel: sky2 eth0: disabling interface > Dec 23 17:33:06 mail kernel: sky2 eth0: enabling interface > Dec 23 17:33:06 mail NetworkManager: <info> (eth0): carrier now OFF > (device state 1) > Dec 23 17:33:11 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full > duplex, flow control both > Dec 23 17:33:11 mail NetworkManager: <info> (eth0): carrier now ON > (device state 1) > Dec 23 17:33:26 mail kernel: sky2 eth0: tx timeout > Dec 23 17:33:26 mail kernel: sky2 eth0: disabling interface > Dec 23 17:33:26 mail kernel: sky2 eth0: enabling interface > Dec 23 17:33:26 mail NetworkManager: <info> (eth0): carrier now OFF > (device state 1) > Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view > External: query (cache) './NS/IN' denied > Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view > External: query (cache) './NS/IN' denied > Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view > External: query (cache) './NS/IN' denied > Dec 23 17:33:29 mail named[4284]: client 114.80.99.2#44695: view > External: query (cache) './NS/IN' denied > Dec 23 17:33:29 mail named[4284]: client 114.80.99.2#44695: view > External: query (cache) './NS/IN' denied > Dec 23 17:33:29 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full > duplex, flow control both > Dec 23 17:33:29 mail NetworkManager: <info> (eth0): carrier now ON > (device state 1) > Dec 23 17:33:30 mail named[4284]: client 114.80.99.2#44695: view > External: query (cache) './NS/IN' denied > Dec 23 17:33:38 mail named[4284]: error (connection refused) resolving > '121.93.27.216.blackholes.mail-abuse.org/A/IN': 168.61.5.195#53 > Dec 23 17:33:44 mail kernel: sky2 eth0: tx timeout > Dec 23 17:33:44 mail kernel: sky2 eth0: disabling interface > Dec 23 17:33:44 mail kernel: sky2 eth0: enabling interface > Dec 23 17:33:44 mail NetworkManager: <info> (eth0): carrier now OFF > (device state 1) > Dec 23 17:33:49 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full > duplex, flow control both > Dec 23 17:33:49 mail NetworkManager: <info> (eth0): carrier now ON > (device state 1) > Dec 23 17:33:52 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0 > Dec 23 17:33:52 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via > eth0 > Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] > lib/util_sock.c:539(read_fd_with_timeout) > Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] > lib/util_sock.c:1491(get_peer_addr_internal) > Dec 23 17:34:01 mail smbd[24783]: getpeername failed. Error was > Transport endpoint is not connected > Dec 23 17:34:01 mail smbd[24783]: read_fd_with_timeout: client 0.0.0.0 > read error = Connection reset by peer. > Dec 23 17:34:04 mail abrt: Kerneloops: Reported 1 kernel oopses to Abrt > Dec 23 17:34:04 mail abrtd: Directory 'kerneloops-1261607644-1' creation > detected > Dec 23 17:34:04 mail abrtd: Getting local universal unique identification > Dec 23 17:34:04 mail abrtd: New crash, saving > Dec 23 17:34:04 mail abrtd: Activation of plugin 'RunApp' was not > successful: Plugin 'RunApp' is not registered > Dec 23 17:34:04 mail kernel: sky2 eth0: tx timeout > Dec 23 17:34:04 mail kernel: sky2 eth0: disabling interface > Dec 23 17:34:04 mail kernel: sky2 eth0: enabling interface > Dec 23 17:34:04 mail NetworkManager: <info> (eth0): carrier now OFF > (device state 1) > Dec 23 17:34:09 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full > duplex, flow control both > Dec 23 17:34:09 mail NetworkManager: <info> (eth0): carrier now ON > (device state 1) > Dec 23 17:34:11 mail logger: ping of potter failed > Dec 23 17:34:19 mail kernel: sky2 eth0: tx timeout > Dec 23 17:34:19 mail kernel: sky2 eth0: disabling interface > Dec 23 17:34:19 mail kernel: sky2 eth0: enabling interface > Dec 23 17:34:19 mail NetworkManager: <info> (eth0): carrier now OFF > (device state 1) > Dec 23 17:34:22 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full > duplex, flow control both > Dec 23 17:34:22 mail NetworkManager: <info> (eth0): carrier now ON > (device state 1) > Dec 23 17:37:40 mail kernel: imklog 4.4.2, log source = /proc/kmsg started. > > Oops from 12/5 (no vt-d; no intel_iommu, firmware .12) > > Dec 5 11:27:41 mail kernel: ------------[ cut here ]------------ > Dec 5 11:27:41 mail kernel: WARNING: at lib/dma-debug.c:898 > check_sync+0xbd/0x426() > Dec 5 11:27:41 mail kernel: Hardware name: System Product Name > Dec 5 11:27:41 mail kernel: sky2 0000:06:00.0: DMA-API: device driver > tries to sync DMA memory it has not allocated [device > address=0x000000031a57e022] [size=66 bytes] > Dec 5 11:27:41 mail kernel: Modules linked in: bridge stp appletalk > psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc hwmon_vid > coretemp acpi_cpufreq sit tunnel4 ipt_LOG ipt_MASQUERADE iptable_nat > nf_nat iptable_mangle iptable_raw nf_conntrack_netbios_ns > nf_conntrack_ftp nf_conntrack_ipv6 xt_multiport ip6table_filter xt_DSCP > xt_dscp xt_MARK ip6table_mangle ip6_tables ipv6 dm_multipath kvm_intel > kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec > snd_hda_intel snd_hda_codec ac97_bus gspca_spca505 snd_hwdep snd_seq > snd_seq_device firewire_ohci gspca_main snd_pcm videodev firewire_core > sky2 snd_timer v4l1_compat wmi snd v4l2_compat_ioctl32 soundcore > crc_itu_t snd_page_alloc asus_atk0110 i2c_i801 iTCO_wdt > iTCO_vendor_support pcspkr hwmon raid456 async_raid6_recov async_pq > raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi > pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit > cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: scsi_wait_scan] > Dec 5 11:27:41 mail kernel: Pid: 0, comm: swapper Not tainted > 2.6.32-00573-g6cf4d67 #2 > Dec 5 11:27:41 mail kernel: Call Trace: > Dec 5 11:27:41 mail kernel: <IRQ> [<ffffffff8105365a>] > warn_slowpath_common+0x7c/0x94 > Dec 5 11:27:41 mail kernel: [<ffffffff810536c9>] > warn_slowpath_fmt+0x41/0x43 > Dec 5 11:27:41 mail kernel: [<ffffffff8127b1fd>] check_sync+0xbd/0x426 > Dec 5 11:27:41 mail kernel: [<ffffffff813ca254>] ? > __netdev_alloc_skb+0x34/0x50 > Dec 5 11:27:41 mail kernel: [<ffffffff8127b8b9>] > debug_dma_sync_single_for_cpu+0x42/0x44 > Dec 5 11:27:41 mail kernel: [<ffffffff81278c5f>] ? > swiotlb_sync_single+0x2a/0xb6 > Dec 5 11:27:41 mail kernel: [<ffffffff81278dbb>] ? > swiotlb_sync_single_for_cpu+0xc/0xe > Dec 5 11:27:41 mail kernel: [<ffffffffa01cee83>] sky2_poll+0x4c6/0xae1 > [sky2] > Dec 5 11:27:41 mail kernel: [<ffffffff8146bb1a>] ? > _spin_unlock_irqrestore+0x29/0x41 > Dec 5 11:27:41 mail kernel: [<ffffffff813d0ef2>] net_rx_action+0xb5/0x1f3 > Dec 5 11:27:41 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd > Dec 5 11:27:41 mail kernel: [<ffffffff810a2dea>] ? > handle_IRQ_event+0x119/0x12b > Dec 5 11:27:41 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 > Dec 5 11:27:41 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 > Dec 5 11:27:41 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c > Dec 5 11:27:41 mail kernel: [<ffffffff8146fb75>] do_IRQ+0xa5/0xbc > Dec 5 11:27:41 mail kernel: [<ffffffff81012613>] ret_from_intr+0x0/0x16 > Dec 5 11:27:41 mail kernel: <EOI> [<ffffffff812c6b3e>] ? > acpi_idle_enter_bm+0x256/0x28a > Dec 5 11:27:41 mail kernel: [<ffffffff812c6b37>] ? > acpi_idle_enter_bm+0x24f/0x28a > Dec 5 11:27:41 mail kernel: [<ffffffff813a6280>] ? > cpuidle_idle_call+0x9e/0xfa > Dec 5 11:27:41 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 > Dec 5 11:27:41 mail kernel: [<ffffffff814651f2>] ? > start_secondary+0x201/0x242 > Dec 5 11:27:41 mail kernel: ---[ end trace ff8485ac82013a8d ]--- > On 12/22/2009 01:50 PM, Michael Breuer wrote: > > I updated my firmware (Marvell/Yukon)) from .12 to .14. Going to leave > > my kernel version unchanged for the next week or so and see if this > > recurs. > > On 12/21/2009 7:09 PM, Michael Breuer wrote: > >> Just an FYI - I got one of these today as well - also 2.6.32 (last > >> git before 2.6.33 rc1). I also had two similar events previously - > >> one 2.6.31, and one 2.6.32 rc7. > >> > >> I was doing some research today and came across a fair sampling of > >> error reports all suggesting that new firmware is a solution. > >> Gigabyte (not my MB) has firmware posted, but I haven't yet attempted > >> to apply it (nor am I sure how to test it after doing so). > >> > >> For what it's worth, my three recent events were heavy smb usage - > >> windows system backing up. Today's event did not generate an oops as > >> my watchdog kicked in and rebooted the system after the kernel > >> reported a sky2 error. > >> > >> The last messages before watchdog: > >> Dec 20 05:33:58 mail kernel: DRHD: handling fault status reg 2 > >> Dec 20 05:33:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] > >> fault addr ffe1a7ffe000 > >> Dec 20 05:33:58 mail kernel: DMAR:[fault reason 06] PTE Read access > >> is not set > >> Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: error interrupt > >> status=0x80000000 > >> Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: PCI hardware error > >> (0x2010) > >> > >> During the previous event I had vt-d disabled - it was enabled this > >> time. > >> > >> > >> On 12/21/2009 6:52 PM, Berck E. Nash wrote: > >>> Since 2.6.32, I've been getting kernel panics under heavy network load > >>> (bittorrent usage). > >>> > >>> [ 8673.345873] sky2 eth0: receiver hang detected > >>> [ 8673.350368] sky2 eth0: disabling interface > >>> [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at > >>> 0000000000000010 > >>> [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 > >>> [sky2] > >>> [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 > >>> [ 8673.373838] Oops: 0002 [#1] SMP > >>> [ 8673.373838] last sysfs file: > >>> /sys/devices/platform/coretemp.1/temp1_label > >>> [ 8673.383900] CPU 0 > >>> [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl > >>> auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer > >>> hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 > >>> snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec > >>> snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss > >>> snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common > >>> snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd > >>> v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 > >>> usbcore tveeprom > >>> [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 > >>> #3 P5W DH Deluxe > >>> [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] > >>> sky2_xmit_frame+0x321/0x5d8 [sky2] > >>> [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 > >>> [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: > >>> 0000000000000000 > >>> [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: > >>> 0000000000000000 > >>> [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: > >>> 0000000000000000 > >>> [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: > >>> ffff88009bf3ab48 > >>> [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: > >>> 00000000955fa87e > >>> [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) > >>> knlGS:0000000000000000 > >>> [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > >>> [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: > >>> 00000000000026f0 > >>> [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > >>> 0000000000000000 > >>> [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > >>> 0000000000000400 > >>> [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo > >>> ffff8800bf8b4000, > >>> task ffff8800bf8a8650) > >>> [ 8673.539212] Stack: > >>> [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 > >>> 000000428122a659 > >>> [ 8673.548658]<0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 > >>> ffffffff813eaf30 > >>> [ 8673.556929]<0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 > >>> ffffffff81231de8 > >>> [ 8673.565432] Call Trace: > >>> [ 8673.565432]<IRQ> > >>> [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 > >>> [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 > >>> [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 > >>> [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 > >>> [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e > >>> [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b > >>> [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 > >>> [ 8673.609833]<EOI> > >>> [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b > >>> [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 > >>> [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 > >>> [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 > >>> [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 > >>> [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 > >>> [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 > >>> [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 > >>> c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 > >>> 4b 20<4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b > >>> [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 > >>> [sky2] > >>> [ 8673.674902] RSP<ffff880001603df8> > >>> [ 8673.674902] CR2: 0000000000000010 > >>> [ 8673.682956] hda-intel: IRQ timing workaround is activated for card > >>> #0. Suggest a bigger bdl_pos_adj. > >>> [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- > >>> [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt > >>> [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D > >>> 2.6.32.1 #3 > >>> [ 8673.683021] Call Trace: > >>> [ 8673.683023]<IRQ> [<ffffffff8129839d>] panic+0x75/0x11c > >>> [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e > >>> [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd > >>> [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 > >>> [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 > >>> [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 > >>> [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 > >>> [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 > >>> [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 > >>> [sky2] > >>> [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 > >>> [sky2] > >>> [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 > >>> [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 > >>> [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 > >>> [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 > >>> [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e > >>> [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b > >>> [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 > >>> [ 8673.683079]<EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b > >>> [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 > >>> [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 > >>> [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 > >>> [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 > >>> [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 > >>> [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 > >>> > >>> Note that sometimes I get the receiver hang and things continue without > >>> the crash: > >>> > >>> [ 1219.346571] sky2 eth0: receiver hang detected > >>> [ 1219.351145] sky2 eth0: disabling interface > >>> [ 1219.357389] sky2 eth0: enabling interface > >>> [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow > >>> control both > >>> > >>> Please CC as I'm not subscribed. > >>> > >>> Thanks, > >>> Berck Nash > >>> -- > >>> To unsubscribe from this list > > > > -- > > To unsubscribe from this list: send the line "unsubscribe > > linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-24 7:01 ` Andrew Morton @ 2009-12-24 19:18 ` Michael Breuer 2009-12-24 22:27 ` Stephen Hemminger 2009-12-25 16:28 ` Michael Breuer 2009-12-30 8:07 ` Stephen Hemminger 2 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-24 19:18 UTC (permalink / raw) To: Andrew Morton; +Cc: Berck E. Nash, linux-kernel, netdev, Stephen Hemminger I'll be the first to admit I don't know much about network drivers... but I am wondering why __free_page(frag->page); is called in skb_put_frags when the fragment length is zero. I don't see why that page pointer wouldn't potentially be reused afterwords. On 12/24/2009 2:01 AM, Andrew Morton wrote: > cc's added again. > > On Wed, 23 Dec 2009 17:54:27 -0500 Michael Breuer<mbreuer@majjas.com> wrote: > > >> Ok - not the firmware. Ran another Windows backup and sky2 went down. >> >> Nothing in dmesg.old - have oops in syslog. System became unresponsive >> and watchdog kicked in after a minute. >> >> Also note that I have a similar oops with VT-D disabled (posted here on >> 12/5). I'm attaching the oops from that below this oops for comparison. >> That also happened under similar load. >> >> On the assumption that I can recreate this (although it takes a while) >> please let me know how I can help. >> >> What's in my log (starting with an smbd error about 2 min before the >> oops (note: the dchpd is not the system doing the backup). >> > This (nastily wordwrapped) oops appers to be quite different from > Berck's one. > > > > >> Dec 23 17:31:05 mail smbd[20712]: [2009/12/23 17:31:05, 0] >> smbd/nttrans.c:2076(call_nt_transact_ioctl) >> Dec 23 17:31:05 mail smbd[20712]: call_nt_transact_ioctl(0x900eb): >> Currently not implemented. >> Dec 23 17:31:34 mail clamd[3449]: SelfCheck: Database status OK. >> Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:32:54 mail dhcpd: DHCPINFORM from 10.0.0.3 via eth0 >> Dec 23 17:32:54 mail dhcpd: DHCPACK to 10.0.0.3 (00:21:5d:cb:a1:74) via eth0 >> Dec 23 17:32:58 mail kernel: DRHD: handling fault status reg 2 >> Dec 23 17:32:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] >> fault addr fff8423fe000 >> Dec 23 17:32:58 mail kernel: DMAR:[fault reason 06] PTE Read access is >> not set >> Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x80000000 >> Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) >> Dec 23 17:33:06 mail kernel: ------------[ cut here ]------------ >> Dec 23 17:33:06 mail kernel: WARNING: at net/sched/sch_generic.c:255 >> dev_watchdog+0xf3/0x161() >> Dec 23 17:33:06 mail kernel: Hardware name: System Product Name >> Dec 23 17:33:06 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit >> queue 0 timed out >> Dec 23 17:33:06 mail kernel: Modules linked in: ip6table_filter >> ip6table_mangle ip6_tables iptable_raw iptable_mangle ipt_MASQUERADE >> iptable_nat nf_nat bridge stp appletalk psnap llc nfsd lockd nfs_acl >> auth_rpcgss exportfs sunrpc hwmon_vid coretemp acpi_cpufreq sit tunnel4 >> ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK >> nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm >> snd_hda_codec_analog snd_ens1371 gameport snd_hda_intel snd_hda_codec >> snd_rawmidi snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_seq_device >> snd_pcm snd_timer gspca_spca505 snd gspca_main soundcore snd_page_alloc >> videodev firewire_ohci sky2 v4l1_compat v4l2_compat_ioctl32 >> firewire_core iTCO_wdt wmi asus_atk0110 iTCO_vendor_support i2c_i801 >> hwmon pcspkr crc_itu_t raid456 async_raid6_recov async_pq raid6_pq >> async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi >> pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit >> cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: ip6_tables] >> Dec 23 17:33:06 mail kernel: Pid: 0, comm: swapper Tainted: G W >> 2.6.32-09281-g1c42cda #6 >> Dec 23 17:33:06 mail kernel: Call Trace: >> Dec 23 17:33:06 mail kernel:<IRQ> [<ffffffff8104afbd>] >> warn_slowpath_common+0x7c/0x94 >> Dec 23 17:33:06 mail kernel: [<ffffffff8104b02c>] >> warn_slowpath_fmt+0x41/0x43 >> Dec 23 17:33:06 mail kernel: [<ffffffff813d096f>] ? netif_tx_lock+0x44/0x6c >> Dec 23 17:33:06 mail kernel: [<ffffffff813d0ad7>] dev_watchdog+0xf3/0x161 >> Dec 23 17:33:06 mail kernel: [<ffffffff8105610c>] ? cascade+0x6a/0x84 >> Dec 23 17:33:06 mail kernel: [<ffffffff81058558>] >> run_timer_softirq+0x1c8/0x270 >> Dec 23 17:33:06 mail kernel: [<ffffffff81051595>] __do_softirq+0xf8/0x1cd >> Dec 23 17:33:06 mail kernel: [<ffffffff810720af>] ? >> tick_program_event+0x2a/0x2c >> Dec 23 17:33:06 mail kernel: [<ffffffff8100ab1c>] call_softirq+0x1c/0x30 >> Dec 23 17:33:06 mail kernel: [<ffffffff8100c2b3>] do_softirq+0x4b/0xa3 >> Dec 23 17:33:06 mail kernel: [<ffffffff81051181>] irq_exit+0x4a/0x8c >> Dec 23 17:33:06 mail kernel: [<ffffffff8145ed99>] >> smp_apic_timer_interrupt+0x86/0x94 >> Dec 23 17:33:06 mail kernel: [<ffffffff8100a5d3>] >> apic_timer_interrupt+0x13/0x20 >> Dec 23 17:33:06 mail kernel:<EOI> [<ffffffff812ae5fa>] ? >> acpi_idle_enter_bm+0x256/0x28a >> Dec 23 17:33:06 mail kernel: [<ffffffff812ae5f3>] ? >> acpi_idle_enter_bm+0x24f/0x28a >> Dec 23 17:33:06 mail kernel: [<ffffffff81393308>] >> cpuidle_idle_call+0x9e/0xfa >> Dec 23 17:33:06 mail kernel: [<ffffffff81008c05>] cpu_idle+0xb4/0xf6 >> Dec 23 17:33:06 mail kernel: [<ffffffff814532b2>] >> start_secondary+0x201/0x242 >> Dec 23 17:33:06 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- >> Dec 23 17:33:06 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:33:06 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:33:06 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:33:06 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:33:11 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:33:11 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:33:26 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:33:26 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:33:26 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:33:26 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:29 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:29 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:29 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:33:29 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:33:30 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:38 mail named[4284]: error (connection refused) resolving >> '121.93.27.216.blackholes.mail-abuse.org/A/IN': 168.61.5.195#53 >> Dec 23 17:33:44 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:33:44 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:33:44 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:33:44 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:33:49 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:33:49 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:33:52 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0 >> Dec 23 17:33:52 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via >> eth0 >> Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] >> lib/util_sock.c:539(read_fd_with_timeout) >> Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] >> lib/util_sock.c:1491(get_peer_addr_internal) >> Dec 23 17:34:01 mail smbd[24783]: getpeername failed. Error was >> Transport endpoint is not connected >> Dec 23 17:34:01 mail smbd[24783]: read_fd_with_timeout: client 0.0.0.0 >> read error = Connection reset by peer. >> Dec 23 17:34:04 mail abrt: Kerneloops: Reported 1 kernel oopses to Abrt >> Dec 23 17:34:04 mail abrtd: Directory 'kerneloops-1261607644-1' creation >> detected >> Dec 23 17:34:04 mail abrtd: Getting local universal unique identification >> Dec 23 17:34:04 mail abrtd: New crash, saving >> Dec 23 17:34:04 mail abrtd: Activation of plugin 'RunApp' was not >> successful: Plugin 'RunApp' is not registered >> Dec 23 17:34:04 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:34:04 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:34:04 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:34:04 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:34:09 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:34:09 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:34:11 mail logger: ping of potter failed >> Dec 23 17:34:19 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:34:19 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:34:19 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:34:19 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:34:22 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:34:22 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:37:40 mail kernel: imklog 4.4.2, log source = /proc/kmsg started. >> >> Oops from 12/5 (no vt-d; no intel_iommu, firmware .12) >> >> Dec 5 11:27:41 mail kernel: ------------[ cut here ]------------ >> Dec 5 11:27:41 mail kernel: WARNING: at lib/dma-debug.c:898 >> check_sync+0xbd/0x426() >> Dec 5 11:27:41 mail kernel: Hardware name: System Product Name >> Dec 5 11:27:41 mail kernel: sky2 0000:06:00.0: DMA-API: device driver >> tries to sync DMA memory it has not allocated [device >> address=0x000000031a57e022] [size=66 bytes] >> Dec 5 11:27:41 mail kernel: Modules linked in: bridge stp appletalk >> psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc hwmon_vid >> coretemp acpi_cpufreq sit tunnel4 ipt_LOG ipt_MASQUERADE iptable_nat >> nf_nat iptable_mangle iptable_raw nf_conntrack_netbios_ns >> nf_conntrack_ftp nf_conntrack_ipv6 xt_multiport ip6table_filter xt_DSCP >> xt_dscp xt_MARK ip6table_mangle ip6_tables ipv6 dm_multipath kvm_intel >> kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec >> snd_hda_intel snd_hda_codec ac97_bus gspca_spca505 snd_hwdep snd_seq >> snd_seq_device firewire_ohci gspca_main snd_pcm videodev firewire_core >> sky2 snd_timer v4l1_compat wmi snd v4l2_compat_ioctl32 soundcore >> crc_itu_t snd_page_alloc asus_atk0110 i2c_i801 iTCO_wdt >> iTCO_vendor_support pcspkr hwmon raid456 async_raid6_recov async_pq >> raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi >> pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit >> cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: scsi_wait_scan] >> Dec 5 11:27:41 mail kernel: Pid: 0, comm: swapper Not tainted >> 2.6.32-00573-g6cf4d67 #2 >> Dec 5 11:27:41 mail kernel: Call Trace: >> Dec 5 11:27:41 mail kernel:<IRQ> [<ffffffff8105365a>] >> warn_slowpath_common+0x7c/0x94 >> Dec 5 11:27:41 mail kernel: [<ffffffff810536c9>] >> warn_slowpath_fmt+0x41/0x43 >> Dec 5 11:27:41 mail kernel: [<ffffffff8127b1fd>] check_sync+0xbd/0x426 >> Dec 5 11:27:41 mail kernel: [<ffffffff813ca254>] ? >> __netdev_alloc_skb+0x34/0x50 >> Dec 5 11:27:41 mail kernel: [<ffffffff8127b8b9>] >> debug_dma_sync_single_for_cpu+0x42/0x44 >> Dec 5 11:27:41 mail kernel: [<ffffffff81278c5f>] ? >> swiotlb_sync_single+0x2a/0xb6 >> Dec 5 11:27:41 mail kernel: [<ffffffff81278dbb>] ? >> swiotlb_sync_single_for_cpu+0xc/0xe >> Dec 5 11:27:41 mail kernel: [<ffffffffa01cee83>] sky2_poll+0x4c6/0xae1 >> [sky2] >> Dec 5 11:27:41 mail kernel: [<ffffffff8146bb1a>] ? >> _spin_unlock_irqrestore+0x29/0x41 >> Dec 5 11:27:41 mail kernel: [<ffffffff813d0ef2>] net_rx_action+0xb5/0x1f3 >> Dec 5 11:27:41 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd >> Dec 5 11:27:41 mail kernel: [<ffffffff810a2dea>] ? >> handle_IRQ_event+0x119/0x12b >> Dec 5 11:27:41 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 >> Dec 5 11:27:41 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 >> Dec 5 11:27:41 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c >> Dec 5 11:27:41 mail kernel: [<ffffffff8146fb75>] do_IRQ+0xa5/0xbc >> Dec 5 11:27:41 mail kernel: [<ffffffff81012613>] ret_from_intr+0x0/0x16 >> Dec 5 11:27:41 mail kernel:<EOI> [<ffffffff812c6b3e>] ? >> acpi_idle_enter_bm+0x256/0x28a >> Dec 5 11:27:41 mail kernel: [<ffffffff812c6b37>] ? >> acpi_idle_enter_bm+0x24f/0x28a >> Dec 5 11:27:41 mail kernel: [<ffffffff813a6280>] ? >> cpuidle_idle_call+0x9e/0xfa >> Dec 5 11:27:41 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 >> Dec 5 11:27:41 mail kernel: [<ffffffff814651f2>] ? >> start_secondary+0x201/0x242 >> Dec 5 11:27:41 mail kernel: ---[ end trace ff8485ac82013a8d ]--- >> On 12/22/2009 01:50 PM, Michael Breuer wrote: >> >>> I updated my firmware (Marvell/Yukon)) from .12 to .14. Going to leave >>> my kernel version unchanged for the next week or so and see if this >>> recurs. >>> On 12/21/2009 7:09 PM, Michael Breuer wrote: >>> >>>> Just an FYI - I got one of these today as well - also 2.6.32 (last >>>> git before 2.6.33 rc1). I also had two similar events previously - >>>> one 2.6.31, and one 2.6.32 rc7. >>>> >>>> I was doing some research today and came across a fair sampling of >>>> error reports all suggesting that new firmware is a solution. >>>> Gigabyte (not my MB) has firmware posted, but I haven't yet attempted >>>> to apply it (nor am I sure how to test it after doing so). >>>> >>>> For what it's worth, my three recent events were heavy smb usage - >>>> windows system backing up. Today's event did not generate an oops as >>>> my watchdog kicked in and rebooted the system after the kernel >>>> reported a sky2 error. >>>> >>>> The last messages before watchdog: >>>> Dec 20 05:33:58 mail kernel: DRHD: handling fault status reg 2 >>>> Dec 20 05:33:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] >>>> fault addr ffe1a7ffe000 >>>> Dec 20 05:33:58 mail kernel: DMAR:[fault reason 06] PTE Read access >>>> is not set >>>> Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x80000000 >>>> Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: PCI hardware error >>>> (0x2010) >>>> >>>> During the previous event I had vt-d disabled - it was enabled this >>>> time. >>>> >>>> >>>> On 12/21/2009 6:52 PM, Berck E. Nash wrote: >>>> >>>>> Since 2.6.32, I've been getting kernel panics under heavy network load >>>>> (bittorrent usage). >>>>> >>>>> [ 8673.345873] sky2 eth0: receiver hang detected >>>>> [ 8673.350368] sky2 eth0: disabling interface >>>>> [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at >>>>> 0000000000000010 >>>>> [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 >>>>> [sky2] >>>>> [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 >>>>> [ 8673.373838] Oops: 0002 [#1] SMP >>>>> [ 8673.373838] last sysfs file: >>>>> /sys/devices/platform/coretemp.1/temp1_label >>>>> [ 8673.383900] CPU 0 >>>>> [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl >>>>> auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer >>>>> hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 >>>>> snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec >>>>> snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss >>>>> snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common >>>>> snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd >>>>> v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 >>>>> usbcore tveeprom >>>>> [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 >>>>> #3 P5W DH Deluxe >>>>> [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] >>>>> sky2_xmit_frame+0x321/0x5d8 [sky2] >>>>> [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 >>>>> [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: >>>>> 0000000000000000 >>>>> [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: >>>>> 0000000000000000 >>>>> [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: >>>>> 0000000000000000 >>>>> [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: >>>>> ffff88009bf3ab48 >>>>> [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: >>>>> 00000000955fa87e >>>>> [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) >>>>> knlGS:0000000000000000 >>>>> [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >>>>> [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: >>>>> 00000000000026f0 >>>>> [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>>>> 0000000000000000 >>>>> [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: >>>>> 0000000000000400 >>>>> [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo >>>>> ffff8800bf8b4000, >>>>> task ffff8800bf8a8650) >>>>> [ 8673.539212] Stack: >>>>> [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 >>>>> 000000428122a659 >>>>> [ 8673.548658]<0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 >>>>> ffffffff813eaf30 >>>>> [ 8673.556929]<0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 >>>>> ffffffff81231de8 >>>>> [ 8673.565432] Call Trace: >>>>> [ 8673.565432]<IRQ> >>>>> [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >>>>> [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >>>>> [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >>>>> [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >>>>> [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >>>>> [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >>>>> [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >>>>> [ 8673.609833]<EOI> >>>>> [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b >>>>> [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >>>>> [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >>>>> [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 >>>>> [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >>>>> [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >>>>> [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >>>>> [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 >>>>> c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 >>>>> 4b 20<4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b >>>>> [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 >>>>> [sky2] >>>>> [ 8673.674902] RSP<ffff880001603df8> >>>>> [ 8673.674902] CR2: 0000000000000010 >>>>> [ 8673.682956] hda-intel: IRQ timing workaround is activated for card >>>>> #0. Suggest a bigger bdl_pos_adj. >>>>> [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- >>>>> [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt >>>>> [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D >>>>> 2.6.32.1 #3 >>>>> [ 8673.683021] Call Trace: >>>>> [ 8673.683023]<IRQ> [<ffffffff8129839d>] panic+0x75/0x11c >>>>> [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e >>>>> [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd >>>>> [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 >>>>> [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 >>>>> [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 >>>>> [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 >>>>> [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 >>>>> [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 >>>>> [sky2] >>>>> [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 >>>>> [sky2] >>>>> [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >>>>> [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >>>>> [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >>>>> [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >>>>> [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >>>>> [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >>>>> [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >>>>> [ 8673.683079]<EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b >>>>> [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >>>>> [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >>>>> [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 >>>>> [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >>>>> [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >>>>> [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >>>>> >>>>> Note that sometimes I get the receiver hang and things continue without >>>>> the crash: >>>>> >>>>> [ 1219.346571] sky2 eth0: receiver hang detected >>>>> [ 1219.351145] sky2 eth0: disabling interface >>>>> [ 1219.357389] sky2 eth0: enabling interface >>>>> [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow >>>>> control both >>>>> >>>>> Please CC as I'm not subscribed. >>>>> >>>>> Thanks, >>>>> Berck Nash >>>>> -- >>>>> To unsubscribe from this list >>>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-kernel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ >> ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-24 19:18 ` Michael Breuer @ 2009-12-24 22:27 ` Stephen Hemminger 0 siblings, 0 replies; 145+ messages in thread From: Stephen Hemminger @ 2009-12-24 22:27 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On Thu, 24 Dec 2009 14:18:12 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > I'll be the first to admit I don't know much about network drivers... > but I am wondering why __free_page(frag->page); is called in > skb_put_frags when the fragment length is zero. I don't see why that > page pointer wouldn't potentially be reused afterwords. When using Jumbo frames, the sky2 driver allocates pages to hold the larger part of the frame. With 9K MTU it will allocate 2 * 4K pages and then 1K for start of frame. Then if frame is <= 1K it will drop the extra pages. Not sure why this broke now, I am not a VM expert. -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-24 7:01 ` Andrew Morton 2009-12-24 19:18 ` Michael Breuer @ 2009-12-25 16:28 ` Michael Breuer 2009-12-25 23:22 ` Stephen Hemminger 2009-12-29 19:15 ` sky2 panic in 2.6.32.1 under load (new oops) Jarek Poplawski 2009-12-30 8:07 ` Stephen Hemminger 2 siblings, 2 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-25 16:28 UTC (permalink / raw) To: Andrew Morton; +Cc: Berck E. Nash, linux-kernel, netdev, Stephen Hemminger More data points - I'm able to reliably recreate this now. While I thought it was coincidence, each and every time I hit this issue there is a DHCP renew event immediately before the first error. The crash occurs while under load - in my case seems that the traffic is actually IPV6 (hadn't noticed that before). I ran nethogs on a remote display - the reported rx rate on the IPV6 smb connection at the time of the lockup was 33889.688 KB/sec on a 1gbit nic. I've got two events like this - don't recall if the earlier one was the exact same # - but it was in the ballpark. On 12/24/2009 2:01 AM, Andrew Morton wrote: > cc's added again. > > On Wed, 23 Dec 2009 17:54:27 -0500 Michael Breuer<mbreuer@majjas.com> wrote: > > >> Ok - not the firmware. Ran another Windows backup and sky2 went down. >> >> Nothing in dmesg.old - have oops in syslog. System became unresponsive >> and watchdog kicked in after a minute. >> >> Also note that I have a similar oops with VT-D disabled (posted here on >> 12/5). I'm attaching the oops from that below this oops for comparison. >> That also happened under similar load. >> >> On the assumption that I can recreate this (although it takes a while) >> please let me know how I can help. >> >> What's in my log (starting with an smbd error about 2 min before the >> oops (note: the dchpd is not the system doing the backup). >> > This (nastily wordwrapped) oops appers to be quite different from > Berck's one. > > > > >> Dec 23 17:31:05 mail smbd[20712]: [2009/12/23 17:31:05, 0] >> smbd/nttrans.c:2076(call_nt_transact_ioctl) >> Dec 23 17:31:05 mail smbd[20712]: call_nt_transact_ioctl(0x900eb): >> Currently not implemented. >> Dec 23 17:31:34 mail clamd[3449]: SelfCheck: Database status OK. >> Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:32:22 mail named[4284]: client 208.78.169.235#32797: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:32:54 mail dhcpd: DHCPINFORM from 10.0.0.3 via eth0 >> Dec 23 17:32:54 mail dhcpd: DHCPACK to 10.0.0.3 (00:21:5d:cb:a1:74) via eth0 >> Dec 23 17:32:58 mail kernel: DRHD: handling fault status reg 2 >> Dec 23 17:32:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] >> fault addr fff8423fe000 >> Dec 23 17:32:58 mail kernel: DMAR:[fault reason 06] PTE Read access is >> not set >> Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x80000000 >> Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) >> Dec 23 17:33:06 mail kernel: ------------[ cut here ]------------ >> Dec 23 17:33:06 mail kernel: WARNING: at net/sched/sch_generic.c:255 >> dev_watchdog+0xf3/0x161() >> Dec 23 17:33:06 mail kernel: Hardware name: System Product Name >> Dec 23 17:33:06 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit >> queue 0 timed out >> Dec 23 17:33:06 mail kernel: Modules linked in: ip6table_filter >> ip6table_mangle ip6_tables iptable_raw iptable_mangle ipt_MASQUERADE >> iptable_nat nf_nat bridge stp appletalk psnap llc nfsd lockd nfs_acl >> auth_rpcgss exportfs sunrpc hwmon_vid coretemp acpi_cpufreq sit tunnel4 >> ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK >> nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm >> snd_hda_codec_analog snd_ens1371 gameport snd_hda_intel snd_hda_codec >> snd_rawmidi snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_seq_device >> snd_pcm snd_timer gspca_spca505 snd gspca_main soundcore snd_page_alloc >> videodev firewire_ohci sky2 v4l1_compat v4l2_compat_ioctl32 >> firewire_core iTCO_wdt wmi asus_atk0110 iTCO_vendor_support i2c_i801 >> hwmon pcspkr crc_itu_t raid456 async_raid6_recov async_pq raid6_pq >> async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi >> pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit >> cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: ip6_tables] >> Dec 23 17:33:06 mail kernel: Pid: 0, comm: swapper Tainted: G W >> 2.6.32-09281-g1c42cda #6 >> Dec 23 17:33:06 mail kernel: Call Trace: >> Dec 23 17:33:06 mail kernel:<IRQ> [<ffffffff8104afbd>] >> warn_slowpath_common+0x7c/0x94 >> Dec 23 17:33:06 mail kernel: [<ffffffff8104b02c>] >> warn_slowpath_fmt+0x41/0x43 >> Dec 23 17:33:06 mail kernel: [<ffffffff813d096f>] ? netif_tx_lock+0x44/0x6c >> Dec 23 17:33:06 mail kernel: [<ffffffff813d0ad7>] dev_watchdog+0xf3/0x161 >> Dec 23 17:33:06 mail kernel: [<ffffffff8105610c>] ? cascade+0x6a/0x84 >> Dec 23 17:33:06 mail kernel: [<ffffffff81058558>] >> run_timer_softirq+0x1c8/0x270 >> Dec 23 17:33:06 mail kernel: [<ffffffff81051595>] __do_softirq+0xf8/0x1cd >> Dec 23 17:33:06 mail kernel: [<ffffffff810720af>] ? >> tick_program_event+0x2a/0x2c >> Dec 23 17:33:06 mail kernel: [<ffffffff8100ab1c>] call_softirq+0x1c/0x30 >> Dec 23 17:33:06 mail kernel: [<ffffffff8100c2b3>] do_softirq+0x4b/0xa3 >> Dec 23 17:33:06 mail kernel: [<ffffffff81051181>] irq_exit+0x4a/0x8c >> Dec 23 17:33:06 mail kernel: [<ffffffff8145ed99>] >> smp_apic_timer_interrupt+0x86/0x94 >> Dec 23 17:33:06 mail kernel: [<ffffffff8100a5d3>] >> apic_timer_interrupt+0x13/0x20 >> Dec 23 17:33:06 mail kernel:<EOI> [<ffffffff812ae5fa>] ? >> acpi_idle_enter_bm+0x256/0x28a >> Dec 23 17:33:06 mail kernel: [<ffffffff812ae5f3>] ? >> acpi_idle_enter_bm+0x24f/0x28a >> Dec 23 17:33:06 mail kernel: [<ffffffff81393308>] >> cpuidle_idle_call+0x9e/0xfa >> Dec 23 17:33:06 mail kernel: [<ffffffff81008c05>] cpu_idle+0xb4/0xf6 >> Dec 23 17:33:06 mail kernel: [<ffffffff814532b2>] >> start_secondary+0x201/0x242 >> Dec 23 17:33:06 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- >> Dec 23 17:33:06 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:33:06 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:33:06 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:33:06 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:33:11 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:33:11 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:33:26 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:33:26 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:33:26 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:33:26 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:28 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:29 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:29 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:29 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:33:29 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:33:30 mail named[4284]: client 114.80.99.2#44695: view >> External: query (cache) './NS/IN' denied >> Dec 23 17:33:38 mail named[4284]: error (connection refused) resolving >> '121.93.27.216.blackholes.mail-abuse.org/A/IN': 168.61.5.195#53 >> Dec 23 17:33:44 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:33:44 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:33:44 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:33:44 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:33:49 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:33:49 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:33:52 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0 >> Dec 23 17:33:52 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via >> eth0 >> Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] >> lib/util_sock.c:539(read_fd_with_timeout) >> Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] >> lib/util_sock.c:1491(get_peer_addr_internal) >> Dec 23 17:34:01 mail smbd[24783]: getpeername failed. Error was >> Transport endpoint is not connected >> Dec 23 17:34:01 mail smbd[24783]: read_fd_with_timeout: client 0.0.0.0 >> read error = Connection reset by peer. >> Dec 23 17:34:04 mail abrt: Kerneloops: Reported 1 kernel oopses to Abrt >> Dec 23 17:34:04 mail abrtd: Directory 'kerneloops-1261607644-1' creation >> detected >> Dec 23 17:34:04 mail abrtd: Getting local universal unique identification >> Dec 23 17:34:04 mail abrtd: New crash, saving >> Dec 23 17:34:04 mail abrtd: Activation of plugin 'RunApp' was not >> successful: Plugin 'RunApp' is not registered >> Dec 23 17:34:04 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:34:04 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:34:04 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:34:04 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:34:09 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:34:09 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:34:11 mail logger: ping of potter failed >> Dec 23 17:34:19 mail kernel: sky2 eth0: tx timeout >> Dec 23 17:34:19 mail kernel: sky2 eth0: disabling interface >> Dec 23 17:34:19 mail kernel: sky2 eth0: enabling interface >> Dec 23 17:34:19 mail NetworkManager:<info> (eth0): carrier now OFF >> (device state 1) >> Dec 23 17:34:22 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >> duplex, flow control both >> Dec 23 17:34:22 mail NetworkManager:<info> (eth0): carrier now ON >> (device state 1) >> Dec 23 17:37:40 mail kernel: imklog 4.4.2, log source = /proc/kmsg started. >> >> Oops from 12/5 (no vt-d; no intel_iommu, firmware .12) >> >> Dec 5 11:27:41 mail kernel: ------------[ cut here ]------------ >> Dec 5 11:27:41 mail kernel: WARNING: at lib/dma-debug.c:898 >> check_sync+0xbd/0x426() >> Dec 5 11:27:41 mail kernel: Hardware name: System Product Name >> Dec 5 11:27:41 mail kernel: sky2 0000:06:00.0: DMA-API: device driver >> tries to sync DMA memory it has not allocated [device >> address=0x000000031a57e022] [size=66 bytes] >> Dec 5 11:27:41 mail kernel: Modules linked in: bridge stp appletalk >> psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc hwmon_vid >> coretemp acpi_cpufreq sit tunnel4 ipt_LOG ipt_MASQUERADE iptable_nat >> nf_nat iptable_mangle iptable_raw nf_conntrack_netbios_ns >> nf_conntrack_ftp nf_conntrack_ipv6 xt_multiport ip6table_filter xt_DSCP >> xt_dscp xt_MARK ip6table_mangle ip6_tables ipv6 dm_multipath kvm_intel >> kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec >> snd_hda_intel snd_hda_codec ac97_bus gspca_spca505 snd_hwdep snd_seq >> snd_seq_device firewire_ohci gspca_main snd_pcm videodev firewire_core >> sky2 snd_timer v4l1_compat wmi snd v4l2_compat_ioctl32 soundcore >> crc_itu_t snd_page_alloc asus_atk0110 i2c_i801 iTCO_wdt >> iTCO_vendor_support pcspkr hwmon raid456 async_raid6_recov async_pq >> raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi >> pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit >> cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: scsi_wait_scan] >> Dec 5 11:27:41 mail kernel: Pid: 0, comm: swapper Not tainted >> 2.6.32-00573-g6cf4d67 #2 >> Dec 5 11:27:41 mail kernel: Call Trace: >> Dec 5 11:27:41 mail kernel:<IRQ> [<ffffffff8105365a>] >> warn_slowpath_common+0x7c/0x94 >> Dec 5 11:27:41 mail kernel: [<ffffffff810536c9>] >> warn_slowpath_fmt+0x41/0x43 >> Dec 5 11:27:41 mail kernel: [<ffffffff8127b1fd>] check_sync+0xbd/0x426 >> Dec 5 11:27:41 mail kernel: [<ffffffff813ca254>] ? >> __netdev_alloc_skb+0x34/0x50 >> Dec 5 11:27:41 mail kernel: [<ffffffff8127b8b9>] >> debug_dma_sync_single_for_cpu+0x42/0x44 >> Dec 5 11:27:41 mail kernel: [<ffffffff81278c5f>] ? >> swiotlb_sync_single+0x2a/0xb6 >> Dec 5 11:27:41 mail kernel: [<ffffffff81278dbb>] ? >> swiotlb_sync_single_for_cpu+0xc/0xe >> Dec 5 11:27:41 mail kernel: [<ffffffffa01cee83>] sky2_poll+0x4c6/0xae1 >> [sky2] >> Dec 5 11:27:41 mail kernel: [<ffffffff8146bb1a>] ? >> _spin_unlock_irqrestore+0x29/0x41 >> Dec 5 11:27:41 mail kernel: [<ffffffff813d0ef2>] net_rx_action+0xb5/0x1f3 >> Dec 5 11:27:41 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd >> Dec 5 11:27:41 mail kernel: [<ffffffff810a2dea>] ? >> handle_IRQ_event+0x119/0x12b >> Dec 5 11:27:41 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 >> Dec 5 11:27:41 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 >> Dec 5 11:27:41 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c >> Dec 5 11:27:41 mail kernel: [<ffffffff8146fb75>] do_IRQ+0xa5/0xbc >> Dec 5 11:27:41 mail kernel: [<ffffffff81012613>] ret_from_intr+0x0/0x16 >> Dec 5 11:27:41 mail kernel:<EOI> [<ffffffff812c6b3e>] ? >> acpi_idle_enter_bm+0x256/0x28a >> Dec 5 11:27:41 mail kernel: [<ffffffff812c6b37>] ? >> acpi_idle_enter_bm+0x24f/0x28a >> Dec 5 11:27:41 mail kernel: [<ffffffff813a6280>] ? >> cpuidle_idle_call+0x9e/0xfa >> Dec 5 11:27:41 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 >> Dec 5 11:27:41 mail kernel: [<ffffffff814651f2>] ? >> start_secondary+0x201/0x242 >> Dec 5 11:27:41 mail kernel: ---[ end trace ff8485ac82013a8d ]--- >> On 12/22/2009 01:50 PM, Michael Breuer wrote: >> >>> I updated my firmware (Marvell/Yukon)) from .12 to .14. Going to leave >>> my kernel version unchanged for the next week or so and see if this >>> recurs. >>> On 12/21/2009 7:09 PM, Michael Breuer wrote: >>> >>>> Just an FYI - I got one of these today as well - also 2.6.32 (last >>>> git before 2.6.33 rc1). I also had two similar events previously - >>>> one 2.6.31, and one 2.6.32 rc7. >>>> >>>> I was doing some research today and came across a fair sampling of >>>> error reports all suggesting that new firmware is a solution. >>>> Gigabyte (not my MB) has firmware posted, but I haven't yet attempted >>>> to apply it (nor am I sure how to test it after doing so). >>>> >>>> For what it's worth, my three recent events were heavy smb usage - >>>> windows system backing up. Today's event did not generate an oops as >>>> my watchdog kicked in and rebooted the system after the kernel >>>> reported a sky2 error. >>>> >>>> The last messages before watchdog: >>>> Dec 20 05:33:58 mail kernel: DRHD: handling fault status reg 2 >>>> Dec 20 05:33:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] >>>> fault addr ffe1a7ffe000 >>>> Dec 20 05:33:58 mail kernel: DMAR:[fault reason 06] PTE Read access >>>> is not set >>>> Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x80000000 >>>> Dec 20 05:33:58 mail kernel: sky2 0000:06:00.0: PCI hardware error >>>> (0x2010) >>>> >>>> During the previous event I had vt-d disabled - it was enabled this >>>> time. >>>> >>>> >>>> On 12/21/2009 6:52 PM, Berck E. Nash wrote: >>>> >>>>> Since 2.6.32, I've been getting kernel panics under heavy network load >>>>> (bittorrent usage). >>>>> >>>>> [ 8673.345873] sky2 eth0: receiver hang detected >>>>> [ 8673.350368] sky2 eth0: disabling interface >>>>> [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at >>>>> 0000000000000010 >>>>> [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 >>>>> [sky2] >>>>> [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 >>>>> [ 8673.373838] Oops: 0002 [#1] SMP >>>>> [ 8673.373838] last sysfs file: >>>>> /sys/devices/platform/coretemp.1/temp1_label >>>>> [ 8673.383900] CPU 0 >>>>> [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl >>>>> auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer >>>>> hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 >>>>> snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec >>>>> snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss >>>>> snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common >>>>> snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd >>>>> v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 >>>>> usbcore tveeprom >>>>> [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 >>>>> #3 P5W DH Deluxe >>>>> [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] >>>>> sky2_xmit_frame+0x321/0x5d8 [sky2] >>>>> [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 >>>>> [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: >>>>> 0000000000000000 >>>>> [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: >>>>> 0000000000000000 >>>>> [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: >>>>> 0000000000000000 >>>>> [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: >>>>> ffff88009bf3ab48 >>>>> [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: >>>>> 00000000955fa87e >>>>> [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) >>>>> knlGS:0000000000000000 >>>>> [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >>>>> [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: >>>>> 00000000000026f0 >>>>> [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>>>> 0000000000000000 >>>>> [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: >>>>> 0000000000000400 >>>>> [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo >>>>> ffff8800bf8b4000, >>>>> task ffff8800bf8a8650) >>>>> [ 8673.539212] Stack: >>>>> [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 >>>>> 000000428122a659 >>>>> [ 8673.548658]<0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 >>>>> ffffffff813eaf30 >>>>> [ 8673.556929]<0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 >>>>> ffffffff81231de8 >>>>> [ 8673.565432] Call Trace: >>>>> [ 8673.565432]<IRQ> >>>>> [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >>>>> [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >>>>> [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >>>>> [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >>>>> [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >>>>> [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >>>>> [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >>>>> [ 8673.609833]<EOI> >>>>> [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b >>>>> [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >>>>> [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >>>>> [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 >>>>> [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >>>>> [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >>>>> [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >>>>> [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 >>>>> c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 >>>>> 4b 20<4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b >>>>> [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 >>>>> [sky2] >>>>> [ 8673.674902] RSP<ffff880001603df8> >>>>> [ 8673.674902] CR2: 0000000000000010 >>>>> [ 8673.682956] hda-intel: IRQ timing workaround is activated for card >>>>> #0. Suggest a bigger bdl_pos_adj. >>>>> [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- >>>>> [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt >>>>> [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D >>>>> 2.6.32.1 #3 >>>>> [ 8673.683021] Call Trace: >>>>> [ 8673.683023]<IRQ> [<ffffffff8129839d>] panic+0x75/0x11c >>>>> [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e >>>>> [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd >>>>> [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 >>>>> [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 >>>>> [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 >>>>> [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 >>>>> [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 >>>>> [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 >>>>> [sky2] >>>>> [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 >>>>> [sky2] >>>>> [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >>>>> [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >>>>> [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >>>>> [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >>>>> [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >>>>> [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >>>>> [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >>>>> [ 8673.683079]<EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b >>>>> [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >>>>> [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >>>>> [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 >>>>> [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >>>>> [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >>>>> [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >>>>> >>>>> Note that sometimes I get the receiver hang and things continue without >>>>> the crash: >>>>> >>>>> [ 1219.346571] sky2 eth0: receiver hang detected >>>>> [ 1219.351145] sky2 eth0: disabling interface >>>>> [ 1219.357389] sky2 eth0: enabling interface >>>>> [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow >>>>> control both >>>>> >>>>> Please CC as I'm not subscribed. >>>>> >>>>> Thanks, >>>>> Berck Nash >>>>> -- >>>>> To unsubscribe from this list >>>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-kernel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ >> ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-25 16:28 ` Michael Breuer @ 2009-12-25 23:22 ` Stephen Hemminger 2009-12-26 3:23 ` Michael Breuer 2009-12-29 19:15 ` sky2 panic in 2.6.32.1 under load (new oops) Jarek Poplawski 1 sibling, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2009-12-25 23:22 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On Fri, 25 Dec 2009 11:28:55 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > More data points - I'm able to reliably recreate this now. > While I thought it was coincidence, each and every time I hit this issue > there is a DHCP renew event immediately before the first error. > The crash occurs while under load - in my case seems that the traffic is > actually IPV6 (hadn't noticed that before). > I ran nethogs on a remote display - the reported rx rate on the IPV6 smb > connection at the time of the lockup was 33889.688 KB/sec on a 1gbit > nic. I've got two events like this - don't recall if the earlier one was > the exact same # - but it was in the ballpark. > > On 12/24/2009 2:01 AM, Andrew Morton wrote: > > cc's added again. > > > > On Wed, 23 Dec 2009 17:54:27 -0500 Michael Breuer<mbreuer@majjas.com> wrote: > > > > > >> Ok - not the firmware. Ran another Windows backup and sky2 went down. > >> > >> Nothing in dmesg.old - have oops in syslog. System became unresponsive > >> and watchdog kicked in after a minute. > >> > >> Also note that I have a similar oops with VT-D disabled (posted here on > >> 12/5). I'm attaching the oops from that below this oops for comparison. > >> That also happened under similar load. > >> > >> On the assumption that I can recreate this (although it takes a while) > >> please let me know how I can help. > >> > >> What's in my log (starting with an smbd error about 2 min before the > >> oops (note: the dchpd is not the system doing the backup). > >> > > This (nastily wordwrapped) oops appers to be quite different from > > Berck's one. > > > > What is the MTU? > -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-25 23:22 ` Stephen Hemminger @ 2009-12-26 3:23 ` Michael Breuer 2009-12-26 17:57 ` Stephen Hemminger 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-26 3:23 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On 12/25/2009 6:22 PM, Stephen Hemminger wrote: > On Fri, 25 Dec 2009 11:28:55 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> More data points - I'm able to reliably recreate this now. >> While I thought it was coincidence, each and every time I hit this issue >> there is a DHCP renew event immediately before the first error. >> The crash occurs while under load - in my case seems that the traffic is >> actually IPV6 (hadn't noticed that before). >> I ran nethogs on a remote display - the reported rx rate on the IPV6 smb >> connection at the time of the lockup was 33889.688 KB/sec on a 1gbit >> nic. I've got two events like this - don't recall if the earlier one was >> the exact same # - but it was in the ballpark. >> >> On 12/24/2009 2:01 AM, Andrew Morton wrote: >> >>> cc's added again. >>> >>> On Wed, 23 Dec 2009 17:54:27 -0500 Michael Breuer<mbreuer@majjas.com> wrote: >>> >>> >>> >>>> Ok - not the firmware. Ran another Windows backup and sky2 went down. >>>> >>>> Nothing in dmesg.old - have oops in syslog. System became unresponsive >>>> and watchdog kicked in after a minute. >>>> >>>> Also note that I have a similar oops with VT-D disabled (posted here on >>>> 12/5). I'm attaching the oops from that below this oops for comparison. >>>> That also happened under similar load. >>>> >>>> On the assumption that I can recreate this (although it takes a while) >>>> please let me know how I can help. >>>> >>>> What's in my log (starting with an smbd error about 2 min before the >>>> oops (note: the dchpd is not the system doing the backup). >>>> >>>> >>> This (nastily wordwrapped) oops appers to be quite different from >>> Berck's one. >>> >>> >>> > What is the MTU? > 1500 >> > > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-26 3:23 ` Michael Breuer @ 2009-12-26 17:57 ` Stephen Hemminger 2009-12-26 20:37 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2009-12-26 17:57 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On Fri, 25 Dec 2009 22:23:51 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > On 12/25/2009 6:22 PM, Stephen Hemminger wrote: > > On Fri, 25 Dec 2009 11:28:55 -0500 > > Michael Breuer<mbreuer@majjas.com> wrote: > > > > > >> More data points - I'm able to reliably recreate this now. > >> While I thought it was coincidence, each and every time I hit this issue > >> there is a DHCP renew event immediately before the first error. > >> The crash occurs while under load - in my case seems that the traffic is > >> actually IPV6 (hadn't noticed that before). > >> I ran nethogs on a remote display - the reported rx rate on the IPV6 smb > >> connection at the time of the lockup was 33889.688 KB/sec on a 1gbit > >> nic. I've got two events like this - don't recall if the earlier one was > >> the exact same # - but it was in the ballpark. > >> > >> On 12/24/2009 2:01 AM, Andrew Morton wrote: > >> > >>> cc's added again. > >>> > >>> On Wed, 23 Dec 2009 17:54:27 -0500 Michael Breuer<mbreuer@majjas.com> wrote: > >>> > >>> > >>> > >>>> Ok - not the firmware. Ran another Windows backup and sky2 went down. > >>>> > >>>> Nothing in dmesg.old - have oops in syslog. System became unresponsive > >>>> and watchdog kicked in after a minute. > >>>> > >>>> Also note that I have a similar oops with VT-D disabled (posted here on > >>>> 12/5). I'm attaching the oops from that below this oops for comparison. > >>>> That also happened under similar load. > >>>> > >>>> On the assumption that I can recreate this (although it takes a while) > >>>> please let me know how I can help. > >>>> > >>>> What's in my log (starting with an smbd error about 2 min before the > >>>> oops (note: the dchpd is not the system doing the backup). > >>>> > >>>> > >>> This (nastily wordwrapped) oops appers to be quite different from > >>> Berck's one. > >>> > >>> > >>> > > What is the MTU? > > > 1500 > >> It looks like the problem only shows up for packets generated by DHCP, and these come through AF_PACKET. The problem maybe related to how this packets are fragmented into header and page, in a different way than other packets confusing the driver or DMA engine. Does this help? ----- --- a/drivers/net/sky2.c 2009-12-26 09:50:20.869565022 -0800 +++ b/drivers/net/sky2.c 2009-12-26 09:55:54.620645355 -0800 @@ -1616,6 +1616,13 @@ static netdev_tx_t sky2_xmit_frame(struc if (unlikely(tx_avail(sky2) < tx_le_req(skb))) return NETDEV_TX_BUSY; + if (!pskb_may_pull(skb, ETH_HLEN)) { + if (net_ratelimit()) + pr_info(PFX "%s: packet missing ether header (%d)?", + dev->name, skb->len); + goto drop; + } + len = skb_headlen(skb); mapping = pci_map_single(hw->pdev, skb->data, len, PCI_DMA_TODEVICE); @@ -1761,6 +1768,7 @@ mapping_unwind: mapping_error: if (net_ratelimit()) dev_warn(&hw->pdev->dev, "%s: tx mapping error\n", dev->name); +drop: dev_kfree_skb(skb); return NETDEV_TX_OK; } -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-26 17:57 ` Stephen Hemminger @ 2009-12-26 20:37 ` Michael Breuer 2009-12-26 22:05 ` [PATCH] sky2: make sure ethernet header is in transmit skb Stephen Hemminger 2009-12-27 17:03 ` sky2 panic in 2.6.32.1 under load (new oops) Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-26 20:37 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On 12/26/2009 12:57 PM, Stephen Hemminger wrote: > On Fri, 25 Dec 2009 22:23:51 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> On 12/25/2009 6:22 PM, Stephen Hemminger wrote: >> >>> On Fri, 25 Dec 2009 11:28:55 -0500 >>> Michael Breuer<mbreuer@majjas.com> wrote: >>> >>> >>> >>>> More data points - I'm able to reliably recreate this now. >>>> While I thought it was coincidence, each and every time I hit this issue >>>> there is a DHCP renew event immediately before the first error. >>>> The crash occurs while under load - in my case seems that the traffic is >>>> actually IPV6 (hadn't noticed that before). >>>> I ran nethogs on a remote display - the reported rx rate on the IPV6 smb >>>> connection at the time of the lockup was 33889.688 KB/sec on a 1gbit >>>> nic. I've got two events like this - don't recall if the earlier one was >>>> the exact same # - but it was in the ballpark. >>>> >>>> On 12/24/2009 2:01 AM, Andrew Morton wrote: >>>> >>>> >>>>> cc's added again. >>>>> >>>>> On Wed, 23 Dec 2009 17:54:27 -0500 Michael Breuer<mbreuer@majjas.com> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Ok - not the firmware. Ran another Windows backup and sky2 went down. >>>>>> >>>>>> Nothing in dmesg.old - have oops in syslog. System became unresponsive >>>>>> and watchdog kicked in after a minute. >>>>>> >>>>>> Also note that I have a similar oops with VT-D disabled (posted here on >>>>>> 12/5). I'm attaching the oops from that below this oops for comparison. >>>>>> That also happened under similar load. >>>>>> >>>>>> On the assumption that I can recreate this (although it takes a while) >>>>>> please let me know how I can help. >>>>>> >>>>>> What's in my log (starting with an smbd error about 2 min before the >>>>>> oops (note: the dchpd is not the system doing the backup). >>>>>> >>>>>> >>>>>> >>>>> This (nastily wordwrapped) oops appers to be quite different from >>>>> Berck's one. >>>>> >>>>> >>>>> >>>>> >>> What is the MTU? >>> >>> >> 1500 >> >>>> >>>> > It looks like the problem only shows up for packets generated by DHCP, > and these come through AF_PACKET. The problem maybe related to how this > packets are fragmented into header and page, in a different way than other > packets confusing the driver or DMA engine. > > Does this help? > ----- > > --- a/drivers/net/sky2.c 2009-12-26 09:50:20.869565022 -0800 > +++ b/drivers/net/sky2.c 2009-12-26 09:55:54.620645355 -0800 > @@ -1616,6 +1616,13 @@ static netdev_tx_t sky2_xmit_frame(struc > if (unlikely(tx_avail(sky2)< tx_le_req(skb))) > return NETDEV_TX_BUSY; > > + if (!pskb_may_pull(skb, ETH_HLEN)) { > + if (net_ratelimit()) > + pr_info(PFX "%s: packet missing ether header (%d)?", > + dev->name, skb->len); > + goto drop; > + } > + > len = skb_headlen(skb); > mapping = pci_map_single(hw->pdev, skb->data, len, PCI_DMA_TODEVICE); > > @@ -1761,6 +1768,7 @@ mapping_unwind: > mapping_error: > if (net_ratelimit()) > dev_warn(&hw->pdev->dev, "%s: tx mapping error\n", dev->name); > +drop: > dev_kfree_skb(skb); > return NETDEV_TX_OK; > } > > > > > That seems to have done the trick! Still one odd message sequence, but no hangs or crashes. The first time I forced a DHCP renew while running at high throughput, I got the same SMB errors I saw in my original error log (pre-crash). This only happened once: Dec 26 15:24:18 mail dhcpd: DHCPACK on 10.0.0.56 to 00:1c:cc:f3:9f:f6 (BLACKBERRY-9542) via eth0 Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] lib/util_sock.c:1564(matchname) Dec 26 15:24:25 mail smbd[8937]: matchname: host name/address mismatch: ::ffff:10.0.0.11 != potter.majjas.com Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] lib/util_sock.c:1685(get_peer_name) Dec 26 15:24:25 mail smbd[8937]: Matchname failed on potter.majjas.com ::ffff:10.0.0.11 Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] smbd/nttrans.c:2076(call_nt_transact_ioctl) Dec 26 15:24:25 mail smbd[8937]: call_nt_transact_ioctl(0x900eb): Currently not implemented. I would discount this, but the same sequence was present in the logs pre-crash as well. I do not see this at all absent the preceding DHCP renew sequence. I also don't see this unless the adapter is under load. ^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH] sky2: make sure ethernet header is in transmit skb 2009-12-26 20:37 ` Michael Breuer @ 2009-12-26 22:05 ` Stephen Hemminger 2009-12-27 3:44 ` David Miller 2009-12-27 17:03 ` sky2 panic in 2.6.32.1 under load (new oops) Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2009-12-26 22:05 UTC (permalink / raw) To: David Miller Cc: Michael Breuer, Andrew Morton, Berck E. Nash, linux-kernel, netdev When DHCP sends a packet using AF_PACKET, it is possible for driver to receive a transmit packet with the Ethernet header off in the skb fragments. The hardware doesn't handle 0 size DMA's well, so it gets confused. Other drivers may have same problem, I really think this ought to be done at higher level. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> --- Patch against net-2.6, but same patch applies to -stable as well (with fuzz). --- a/drivers/net/sky2.c 2009-12-26 13:39:34.289252942 -0800 +++ b/drivers/net/sky2.c 2009-12-26 13:59:44.548877690 -0800 @@ -1654,6 +1654,14 @@ static netdev_tx_t sky2_xmit_frame(struc if (unlikely(tx_avail(sky2) < tx_le_req(skb))) return NETDEV_TX_BUSY; + /* Make sure there is at least some data for first DMA */ + if (!pskb_may_pull(skb, ETH_HLEN)) { + if (net_ratelimit()) + pr_info(PFX "%s: packet missing ether header (%d)?", + dev->name, skb->len); + goto drop; + } + len = skb_headlen(skb); mapping = pci_map_single(hw->pdev, skb->data, len, PCI_DMA_TODEVICE); @@ -1799,6 +1807,7 @@ mapping_unwind: mapping_error: if (net_ratelimit()) dev_warn(&hw->pdev->dev, "%s: tx mapping error\n", dev->name); +drop: dev_kfree_skb(skb); return NETDEV_TX_OK; } ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] sky2: make sure ethernet header is in transmit skb 2009-12-26 22:05 ` [PATCH] sky2: make sure ethernet header is in transmit skb Stephen Hemminger @ 2009-12-27 3:44 ` David Miller 2009-12-27 4:11 ` David Miller 0 siblings, 1 reply; 145+ messages in thread From: David Miller @ 2009-12-27 3:44 UTC (permalink / raw) To: shemminger; +Cc: mbreuer, akpm, flyboy, linux-kernel, netdev From: Stephen Hemminger <shemminger@linux-foundation.org> Date: Sat, 26 Dec 2009 14:05:44 -0800 > Other drivers may have same problem, I really think this ought > to be done at higher level. I tend to agree with you, and I thought we had handled all cases. Let's simply make AF_PACKET linearize the link level header before sending things out to the transmit path. I can work on this if you want. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] sky2: make sure ethernet header is in transmit skb 2009-12-27 3:44 ` David Miller @ 2009-12-27 4:11 ` David Miller 2010-01-04 5:32 ` David Miller 2010-01-05 23:07 ` [PATCH] af_packet: Don't use skb after dev_queue_xmit() Jarek Poplawski 0 siblings, 2 replies; 145+ messages in thread From: David Miller @ 2009-12-27 4:11 UTC (permalink / raw) To: shemminger; +Cc: mbreuer, akpm, flyboy, linux-kernel, netdev From: David Miller <davem@davemloft.net> Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) > From: Stephen Hemminger <shemminger@linux-foundation.org> > Date: Sat, 26 Dec 2009 14:05:44 -0800 > >> Other drivers may have same problem, I really think this ought >> to be done at higher level. > > I tend to agree with you, and I thought we had handled all > cases. Let's simply make AF_PACKET linearize the link > level header before sending things out to the transmit path. > > I can work on this if you want. Actually Stephen, I took a look and I can't see how AF_PACKET can create this situation. It always copies into the linear area of the SKB it allocates for sendmsg() processing. Whether the data comes from sendmsg data or the mmap() ring buffer. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] sky2: make sure ethernet header is in transmit skb 2009-12-27 4:11 ` David Miller @ 2010-01-04 5:32 ` David Miller 2010-01-04 16:40 ` Stephen Hemminger 2010-01-05 23:07 ` [PATCH] af_packet: Don't use skb after dev_queue_xmit() Jarek Poplawski 1 sibling, 1 reply; 145+ messages in thread From: David Miller @ 2010-01-04 5:32 UTC (permalink / raw) To: shemminger; +Cc: mbreuer, akpm, flyboy, linux-kernel, netdev From: David Miller <davem@davemloft.net> Date: Sat, 26 Dec 2009 20:11:07 -0800 (PST) > From: David Miller <davem@davemloft.net> > Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) > >> From: Stephen Hemminger <shemminger@linux-foundation.org> >> Date: Sat, 26 Dec 2009 14:05:44 -0800 >> >>> Other drivers may have same problem, I really think this ought >>> to be done at higher level. >> >> I tend to agree with you, and I thought we had handled all >> cases. Let's simply make AF_PACKET linearize the link >> level header before sending things out to the transmit path. >> >> I can work on this if you want. > > Actually Stephen, I took a look and I can't see how AF_PACKET > can create this situation. > > It always copies into the linear area of the SKB it allocates > for sendmsg() processing. Whether the data comes from sendmsg > data or the mmap() ring buffer. Stephen can you get a backtrace of the code path which triggers this? I want to fix it at a higher level too, but I can't do that until I know where it actually happens. Thanks! ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] sky2: make sure ethernet header is in transmit skb 2010-01-04 5:32 ` David Miller @ 2010-01-04 16:40 ` Stephen Hemminger 2010-01-04 17:02 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2010-01-04 16:40 UTC (permalink / raw) To: David Miller; +Cc: mbreuer, akpm, flyboy, linux-kernel, netdev On Sun, 03 Jan 2010 21:32:58 -0800 (PST) David Miller <davem@davemloft.net> wrote: > From: David Miller <davem@davemloft.net> > Date: Sat, 26 Dec 2009 20:11:07 -0800 (PST) > > > From: David Miller <davem@davemloft.net> > > Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) > > > >> From: Stephen Hemminger <shemminger@linux-foundation.org> > >> Date: Sat, 26 Dec 2009 14:05:44 -0800 > >> > >>> Other drivers may have same problem, I really think this ought > >>> to be done at higher level. > >> > >> I tend to agree with you, and I thought we had handled all > >> cases. Let's simply make AF_PACKET linearize the link > >> level header before sending things out to the transmit path. > >> > >> I can work on this if you want. > > > > Actually Stephen, I took a look and I can't see how AF_PACKET > > can create this situation. > > > > It always copies into the linear area of the SKB it allocates > > for sendmsg() processing. Whether the data comes from sendmsg > > data or the mmap() ring buffer. > > Stephen can you get a backtrace of the code path which triggers > this? I want to fix it at a higher level too, but I can't > do that until I know where it actually happens. Ignore it, the problem is outside the sky2 driver in some other place causing corrupt skb's. I never reproduced this (with added BUG_ON and WARN_ON), only seen by Michael. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] sky2: make sure ethernet header is in transmit skb 2010-01-04 16:40 ` Stephen Hemminger @ 2010-01-04 17:02 ` Michael Breuer 0 siblings, 0 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-04 17:02 UTC (permalink / raw) To: Stephen Hemminger; +Cc: David Miller, akpm, flyboy, linux-kernel, netdev On 1/4/2010 11:40 AM, Stephen Hemminger wrote: > On Sun, 03 Jan 2010 21:32:58 -0800 (PST) > David Miller<davem@davemloft.net> wrote: > > >> From: David Miller<davem@davemloft.net> >> Date: Sat, 26 Dec 2009 20:11:07 -0800 (PST) >> >> >>> From: David Miller<davem@davemloft.net> >>> Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) >>> >>> >>>> From: Stephen Hemminger<shemminger@linux-foundation.org> >>>> Date: Sat, 26 Dec 2009 14:05:44 -0800 >>>> >>>> >>>>> Other drivers may have same problem, I really think this ought >>>>> to be done at higher level. >>>>> >>>> I tend to agree with you, and I thought we had handled all >>>> cases. Let's simply make AF_PACKET linearize the link >>>> level header before sending things out to the transmit path. >>>> >>>> I can work on this if you want. >>>> >>> Actually Stephen, I took a look and I can't see how AF_PACKET >>> can create this situation. >>> >>> It always copies into the linear area of the SKB it allocates >>> for sendmsg() processing. Whether the data comes from sendmsg >>> data or the mmap() ring buffer. >>> >> Stephen can you get a backtrace of the code path which triggers >> this? I want to fix it at a higher level too, but I can't >> do that until I know where it actually happens. >> > Ignore it, the problem is outside the sky2 driver in some other > place causing corrupt skb's. I never reproduced this (with added BUG_ON > and WARN_ON), only seen by Michael. > I've posted several oops with explanations: http://lkml.org/lkml/2009/12/5/60 http://lkml.org/lkml/2009/12/21/268 http://lkml.org/lkml/2009/12/23/316 In a nutshell, my system was hanging (sometimes with a viewable oops, sometimes not (unrelated KMS issues). The hangs (with watchdog reboot) happened when under load and when any attached device sent in a DHCP request/offer. The hang was 100% reproducible when running Microsoft Backup vrom a win7 box via SAMBA onto the affected server. Stephen's patch stopped the hang and oops. Please let me know what I can do to help. Under the same load conditions (but no longer associated with DHCP) I'm now seeing multiple soft interrupt errors coming from sky2. This seems to be a race condition somewhere as it only occurs when a mingetty is run on tty1 prior to generating load on the sky2 driver. It could be a wild goose chase, but I think something is getting corrupted by either devpts or console when mingetty issues a vhangup on pts0. ^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2009-12-27 4:11 ` David Miller 2010-01-04 5:32 ` David Miller @ 2010-01-05 23:07 ` Jarek Poplawski 2010-01-05 23:16 ` Michael Breuer 2010-01-06 2:36 ` Michael Breuer 1 sibling, 2 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-05 23:07 UTC (permalink / raw) To: David Miller; +Cc: shemminger, mbreuer, akpm, flyboy, linux-kernel, netdev David Miller wrote, On 12/27/2009 05:11 AM: > From: David Miller <davem@davemloft.net> > Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) > >> From: Stephen Hemminger <shemminger@linux-foundation.org> >> Date: Sat, 26 Dec 2009 14:05:44 -0800 >> >>> Other drivers may have same problem, I really think this ought >>> to be done at higher level. >> I tend to agree with you, and I thought we had handled all >> cases. Let's simply make AF_PACKET linearize the link >> level header before sending things out to the transmit path. >> >> I can work on this if you want. > > Actually Stephen, I took a look and I can't see how AF_PACKET > can create this situation. > > It always copies into the linear area of the SKB it allocates > for sendmsg() processing. Whether the data comes from sendmsg > data or the mmap() ring buffer. Actually, I think there is a bug in this place, but of course this might be unconnected. Anyway, Michael, could you try this patch? BTW, did you try with CONFIG_PACKET_MMAP disabled? Thanks, Jarek P. -----------------> Changing an skb after dev_queue_xmit() is illegal. And since it's inconsistent to treat specially net_xmit_errno() non-zero return, while ignoring other dev_queue_xmit() errors, there is no reason to break the loop in tpacket_snd() in this case. With debugging by: Stephen Hemminger <shemminger@linux-foundation.org> Reported-by: Michael Breuer <mbreuer@majjas.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- net/packet/af_packet.c | 8 +++----- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index e0516a2..984a1fa 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1021,8 +1021,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) status = TP_STATUS_SEND_REQUEST; err = dev_queue_xmit(skb); - if (unlikely(err > 0 && (err = net_xmit_errno(err)) != 0)) - goto out_xmit; + if (unlikely(err > 0)) + err = net_xmit_errno(err); + packet_increment_head(&po->tx_ring); len_sum += tp_len; } while (likely((ph != NULL) || @@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) err = len_sum; goto out_put; -out_xmit: - skb->destructor = sock_wfree; - atomic_dec(&po->tx_ring.pending); out_status: __packet_set_status(po, ph, status); kfree_skb(skb); ^ permalink raw reply related [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-05 23:07 ` [PATCH] af_packet: Don't use skb after dev_queue_xmit() Jarek Poplawski @ 2010-01-05 23:16 ` Michael Breuer 2010-01-05 23:29 ` Jarek Poplawski 2010-01-06 2:36 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-05 23:16 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On 1/5/2010 6:07 PM, Jarek Poplawski wrote: > David Miller wrote, On 12/27/2009 05:11 AM: > > >> From: David Miller<davem@davemloft.net> >> Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) >> >> >>> From: Stephen Hemminger<shemminger@linux-foundation.org> >>> Date: Sat, 26 Dec 2009 14:05:44 -0800 >>> >>> >>>> Other drivers may have same problem, I really think this ought >>>> to be done at higher level. >>>> >>> I tend to agree with you, and I thought we had handled all >>> cases. Let's simply make AF_PACKET linearize the link >>> level header before sending things out to the transmit path. >>> >>> I can work on this if you want. >>> >> Actually Stephen, I took a look and I can't see how AF_PACKET >> can create this situation. >> >> It always copies into the linear area of the SKB it allocates >> for sendmsg() processing. Whether the data comes from sendmsg >> data or the mmap() ring buffer. >> > Actually, I think there is a bug in this place, but of course this > might be unconnected. Anyway, Michael, could you try this patch? > BTW, did you try with CONFIG_PACKET_MMAP disabled? > I did not try with CONFIG_PACKET_MMAP disabled. Please let me know which permutations of the following would be most valuable to test: Davids patch in/out The attached patch in/out CONFIG_PACKET_MMAP on/off I'm thinking your patch in, david's out, with and without CONFIG_PACKET_MMAP. > Thanks, > Jarek P. > -----------------> > > Changing an skb after dev_queue_xmit() is illegal. And since it's > inconsistent to treat specially net_xmit_errno() non-zero return, > while ignoring other dev_queue_xmit() errors, there is no reason > to break the loop in tpacket_snd() in this case. > > With debugging by: Stephen Hemminger<shemminger@linux-foundation.org> > > Reported-by: Michael Breuer<mbreuer@majjas.com> > Signed-off-by: Jarek Poplawski<jarkao2@gmail.com> > --- > > net/packet/af_packet.c | 8 +++----- > 1 files changed, 3 insertions(+), 5 deletions(-) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index e0516a2..984a1fa 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -1021,8 +1021,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > > status = TP_STATUS_SEND_REQUEST; > err = dev_queue_xmit(skb); > - if (unlikely(err> 0&& (err = net_xmit_errno(err)) != 0)) > - goto out_xmit; > + if (unlikely(err> 0)) > + err = net_xmit_errno(err); > + > packet_increment_head(&po->tx_ring); > len_sum += tp_len; > } while (likely((ph != NULL) || > @@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > err = len_sum; > goto out_put; > > -out_xmit: > - skb->destructor = sock_wfree; > - atomic_dec(&po->tx_ring.pending); > out_status: > __packet_set_status(po, ph, status); > kfree_skb(skb); > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-05 23:16 ` Michael Breuer @ 2010-01-05 23:29 ` Jarek Poplawski 0 siblings, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-05 23:29 UTC (permalink / raw) To: Michael Breuer Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On Tue, Jan 05, 2010 at 06:16:58PM -0500, Michael Breuer wrote: > On 1/5/2010 6:07 PM, Jarek Poplawski wrote: > >David Miller wrote, On 12/27/2009 05:11 AM: > > > >>From: David Miller<davem@davemloft.net> > >>Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) > >> > >>>From: Stephen Hemminger<shemminger@linux-foundation.org> > >>>Date: Sat, 26 Dec 2009 14:05:44 -0800 > >>> > >>>>Other drivers may have same problem, I really think this ought > >>>>to be done at higher level. > >>>I tend to agree with you, and I thought we had handled all > >>>cases. Let's simply make AF_PACKET linearize the link > >>>level header before sending things out to the transmit path. > >>> > >>>I can work on this if you want. > >>Actually Stephen, I took a look and I can't see how AF_PACKET > >>can create this situation. > >> > >>It always copies into the linear area of the SKB it allocates > >>for sendmsg() processing. Whether the data comes from sendmsg > >>data or the mmap() ring buffer. > >Actually, I think there is a bug in this place, but of course this > >might be unconnected. Anyway, Michael, could you try this patch? > >BTW, did you try with CONFIG_PACKET_MMAP disabled? > I did not try with CONFIG_PACKET_MMAP disabled. > > Please let me know which permutations of the following would be most > valuable to test: > Davids patch in/out > The attached patch in/out > CONFIG_PACKET_MMAP on/off > > I'm thinking your patch in, david's out, with and without > CONFIG_PACKET_MMAP. I'm not sure about David's (except do it first ;-), but try my patch with CONFIG_PACKET_MMAP on; if there is no change, CONFIG_PACKET_MMAP off, without any patches. Thanks, Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-05 23:07 ` [PATCH] af_packet: Don't use skb after dev_queue_xmit() Jarek Poplawski 2010-01-05 23:16 ` Michael Breuer @ 2010-01-06 2:36 ` Michael Breuer 2010-01-06 7:22 ` Jarek Poplawski 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-06 2:36 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On 1/5/2010 6:07 PM, Jarek Poplawski wrote: > David Miller wrote, On 12/27/2009 05:11 AM: > > >> From: David Miller<davem@davemloft.net> >> Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) >> >> >>> From: Stephen Hemminger<shemminger@linux-foundation.org> >>> Date: Sat, 26 Dec 2009 14:05:44 -0800 >>> >>> >>>> Other drivers may have same problem, I really think this ought >>>> to be done at higher level. >>>> >>> I tend to agree with you, and I thought we had handled all >>> cases. Let's simply make AF_PACKET linearize the link >>> level header before sending things out to the transmit path. >>> >>> I can work on this if you want. >>> >> Actually Stephen, I took a look and I can't see how AF_PACKET >> can create this situation. >> >> It always copies into the linear area of the SKB it allocates >> for sendmsg() processing. Whether the data comes from sendmsg >> data or the mmap() ring buffer. >> > Actually, I think there is a bug in this place, but of course this > might be unconnected. Anyway, Michael, could you try this patch? > BTW, did you try with CONFIG_PACKET_MMAP disabled? > > Thanks, > Jarek P. > -----------------> > > Changing an skb after dev_queue_xmit() is illegal. And since it's > inconsistent to treat specially net_xmit_errno() non-zero return, > while ignoring other dev_queue_xmit() errors, there is no reason > to break the loop in tpacket_snd() in this case. > > With debugging by: Stephen Hemminger<shemminger@linux-foundation.org> > > Reported-by: Michael Breuer<mbreuer@majjas.com> > Signed-off-by: Jarek Poplawski<jarkao2@gmail.com> > --- > > net/packet/af_packet.c | 8 +++----- > 1 files changed, 3 insertions(+), 5 deletions(-) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index e0516a2..984a1fa 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -1021,8 +1021,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > > status = TP_STATUS_SEND_REQUEST; > err = dev_queue_xmit(skb); > - if (unlikely(err> 0&& (err = net_xmit_errno(err)) != 0)) > - goto out_xmit; > + if (unlikely(err> 0)) > + err = net_xmit_errno(err); > + > packet_increment_head(&po->tx_ring); > len_sum += tp_len; > } while (likely((ph != NULL) || > @@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > err = len_sum; > goto out_put; > > -out_xmit: > - skb->destructor = sock_wfree; > - atomic_dec(&po->tx_ring.pending); > out_status: > __packet_set_status(po, ph, status); > kfree_skb(skb); > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > This patch solves the original reported oops - as did Steven's patch of 12/26: [PATCH] sky2: make sure ethernet header is in transmit skb (I ran without Steven's patch and with this patch). Oddly, with this patch vs. Steven's - I'm getting software interrupt errors sporadically while not under load - with Steven's I get the frequently while under load (as per nethogs). For example: Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x40000008 Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=124...125 report=125 done=125 Jan 5 21:29:00 mail kernel: 124: 0xb38de0be(5374) Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x8 Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=126...127 report=126 done=127 Jan 5 21:29:00 mail kernel: 126: 0xb38d80be(9014) I also believe (can't prove yet) that my load test is slower with this patch vs. the sky2 patch. Is it possible that this patch is causing the transmission to momentarily halt such that the overall utilization appears low at the time I see the interrupt errors, vs. the other patch which perhaps doesn't interrupt the traffic flow quite so much? I haven't run yet with CONFIG_PACKET_MMAP disabled. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 2:36 ` Michael Breuer @ 2010-01-06 7:22 ` Jarek Poplawski 2010-01-06 9:15 ` [PATCH alt.2] " Jarek Poplawski 2010-01-06 19:49 ` [PATCH] " Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-06 7:22 UTC (permalink / raw) To: Michael Breuer Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On Tue, Jan 05, 2010 at 09:36:28PM -0500, Michael Breuer wrote: > On 1/5/2010 6:07 PM, Jarek Poplawski wrote: >> David Miller wrote, On 12/27/2009 05:11 AM: >> >> >>> From: David Miller<davem@davemloft.net> >>> Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) >>> >>> >>>> From: Stephen Hemminger<shemminger@linux-foundation.org> >>>> Date: Sat, 26 Dec 2009 14:05:44 -0800 >>>> >>>> >>>>> Other drivers may have same problem, I really think this ought >>>>> to be done at higher level. >>>>> >>>> I tend to agree with you, and I thought we had handled all >>>> cases. Let's simply make AF_PACKET linearize the link >>>> level header before sending things out to the transmit path. >>>> >>>> I can work on this if you want. >>>> >>> Actually Stephen, I took a look and I can't see how AF_PACKET >>> can create this situation. >>> >>> It always copies into the linear area of the SKB it allocates >>> for sendmsg() processing. Whether the data comes from sendmsg >>> data or the mmap() ring buffer. >>> >> Actually, I think there is a bug in this place, but of course this >> might be unconnected. Anyway, Michael, could you try this patch? >> BTW, did you try with CONFIG_PACKET_MMAP disabled? >> >> Thanks, >> Jarek P. >> -----------------> >> >> Changing an skb after dev_queue_xmit() is illegal. And since it's >> inconsistent to treat specially net_xmit_errno() non-zero return, >> while ignoring other dev_queue_xmit() errors, there is no reason >> to break the loop in tpacket_snd() in this case. >> >> With debugging by: Stephen Hemminger<shemminger@linux-foundation.org> >> >> Reported-by: Michael Breuer<mbreuer@majjas.com> >> Signed-off-by: Jarek Poplawski<jarkao2@gmail.com> >> --- >> >> net/packet/af_packet.c | 8 +++----- >> 1 files changed, 3 insertions(+), 5 deletions(-) >> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >> index e0516a2..984a1fa 100644 >> --- a/net/packet/af_packet.c >> +++ b/net/packet/af_packet.c >> @@ -1021,8 +1021,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >> >> status = TP_STATUS_SEND_REQUEST; >> err = dev_queue_xmit(skb); >> - if (unlikely(err> 0&& (err = net_xmit_errno(err)) != 0)) >> - goto out_xmit; >> + if (unlikely(err> 0)) >> + err = net_xmit_errno(err); >> + >> packet_increment_head(&po->tx_ring); >> len_sum += tp_len; >> } while (likely((ph != NULL) || >> @@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >> err = len_sum; >> goto out_put; >> >> -out_xmit: >> - skb->destructor = sock_wfree; >> - atomic_dec(&po->tx_ring.pending); >> out_status: >> __packet_set_status(po, ph, status); >> kfree_skb(skb); >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ >> > This patch solves the original reported oops - as did Steven's patch of > 12/26: [PATCH] sky2: make sure ethernet header is in transmit skb (I ran > without Steven's patch and with this patch). > > Oddly, with this patch vs. Steven's - I'm getting software interrupt > errors sporadically while not under load - with Steven's I get the > frequently while under load (as per nethogs). For example: > Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x40000008 > Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=124...125 report=125 > done=125 > Jan 5 21:29:00 mail kernel: 124: 0xb38de0be(5374) > Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 > Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x8 > Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=126...127 report=126 > done=127 > Jan 5 21:29:00 mail kernel: 126: 0xb38d80be(9014) > > I also believe (can't prove yet) that my load test is slower with this > patch vs. the sky2 patch. > > Is it possible that this patch is causing the transmission to > momentarily halt such that the overall utilization appears low at the > time I see the interrupt errors, vs. the other patch which perhaps > doesn't interrupt the traffic flow quite so much? Yes, without this patch xmit could be stopped earlier on some kind of errors, with retransmit of the last message possible. On the other hand, other dev_queue_xmit() (negative) errors, are ignored. So this place could be still improved by adding proper err handling (or removing getting err return from dev_queue_xmit() at all). Anyway, I think this patch should be a safe proposal for stable. If so, David, please add Michael's "Tested-by". > > I haven't run yet with CONFIG_PACKET_MMAP disabled. > This should behave similarly as MMAP with this patch or maybe even better in case of errors. Thanks, Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH alt.2] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 7:22 ` Jarek Poplawski @ 2010-01-06 9:15 ` Jarek Poplawski 2010-01-06 14:49 ` Stephen Hemminger 2010-01-06 19:49 ` [PATCH] " Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-06 9:15 UTC (permalink / raw) To: Michael Breuer Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On Wed, Jan 06, 2010 at 07:22:08AM +0000, Jarek Poplawski wrote: > On Tue, Jan 05, 2010 at 09:36:28PM -0500, Michael Breuer wrote: > > Is it possible that this patch is causing the transmission to > > momentarily halt such that the overall utilization appears low at the > > time I see the interrupt errors, vs. the other patch which perhaps > > doesn't interrupt the traffic flow quite so much? > > Yes, without this patch xmit could be stopped earlier on some kind of > errors, with retransmit of the last message possible. On the other > hand, other dev_queue_xmit() (negative) errors, are ignored. So this > place could be still improved by adding proper err handling (or > removing getting err return from dev_queue_xmit() at all). On the other hand, returning just this one net_xmit_errno() would be consistent with non-MMAP sendmsg, so here is an alternative version for testing. Thanks, Jarek P. -----------------> (alternative #2) tpacket_snd() can change and kfree an skb after dev_queue_xmit(), which is illegal. With debugging by: Stephen Hemminger <shemminger@linux-foundation.org> Reported-by: Michael Breuer <mbreuer@majjas.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- net/packet/af_packet.c | 8 +++----- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index e0516a2..aba2049 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1021,9 +1021,10 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) status = TP_STATUS_SEND_REQUEST; err = dev_queue_xmit(skb); - if (unlikely(err > 0 && (err = net_xmit_errno(err)) != 0)) - goto out_xmit; packet_increment_head(&po->tx_ring); + if (unlikely(err > 0 && (err = net_xmit_errno(err)) != 0)) + goto out_put; + len_sum += tp_len; } while (likely((ph != NULL) || ((!(msg->msg_flags & MSG_DONTWAIT)) && @@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) err = len_sum; goto out_put; -out_xmit: - skb->destructor = sock_wfree; - atomic_dec(&po->tx_ring.pending); out_status: __packet_set_status(po, ph, status); kfree_skb(skb); ^ permalink raw reply related [flat|nested] 145+ messages in thread
* Re: [PATCH alt.2] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 9:15 ` [PATCH alt.2] " Jarek Poplawski @ 2010-01-06 14:49 ` Stephen Hemminger 2010-01-06 19:40 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2010-01-06 14:49 UTC (permalink / raw) To: Jarek Poplawski Cc: Michael Breuer, David Miller, akpm, flyboy, linux-kernel, netdev On Wed, 6 Jan 2010 09:15:32 +0000 Jarek Poplawski <jarkao2@gmail.com> wrote: > tpacket_snd() can change and kfree an skb after dev_queue_xmit(), > which is illegal. > > With debugging by: Stephen Hemminger <shemminger@linux-foundation.org> > > Reported-by: Michael Breuer <mbreuer@majjas.com> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Good catch. Acked-by: Stephen Hemminger <shemminger@vyatta.com> -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH alt.2] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 14:49 ` Stephen Hemminger @ 2010-01-06 19:40 ` Jarek Poplawski 0 siblings, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-06 19:40 UTC (permalink / raw) To: Stephen Hemminger Cc: Michael Breuer, David Miller, akpm, flyboy, linux-kernel, netdev On Wed, Jan 06, 2010 at 06:49:57AM -0800, Stephen Hemminger wrote: > On Wed, 6 Jan 2010 09:15:32 +0000 > Jarek Poplawski <jarkao2@gmail.com> wrote: > > > tpacket_snd() can change and kfree an skb after dev_queue_xmit(), > > which is illegal. > > > > With debugging by: Stephen Hemminger <shemminger@linux-foundation.org> > > > > Reported-by: Michael Breuer <mbreuer@majjas.com> > > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > > Good catch. > Acked-by: Stephen Hemminger <shemminger@vyatta.com> First of all, we're incredibly lucky that such a helpful guy as Michael had misfortune to hit so helpfully broken piece of hardware ;-) Thanks, Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 7:22 ` Jarek Poplawski 2010-01-06 9:15 ` [PATCH alt.2] " Jarek Poplawski @ 2010-01-06 19:49 ` Michael Breuer 2010-01-06 20:22 ` Jarek Poplawski 2010-01-06 21:10 ` Stephen Hemminger 1 sibling, 2 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-06 19:49 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On 1/6/2010 2:22 AM, Jarek Poplawski wrote: > On Tue, Jan 05, 2010 at 09:36:28PM -0500, Michael Breuer wrote: > >> On 1/5/2010 6:07 PM, Jarek Poplawski wrote: >> >>> David Miller wrote, On 12/27/2009 05:11 AM: >>> >>> >>> >>>> From: David Miller<davem@davemloft.net> >>>> Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) >>>> >>>> >>>> >>>>> From: Stephen Hemminger<shemminger@linux-foundation.org> >>>>> Date: Sat, 26 Dec 2009 14:05:44 -0800 >>>>> >>>>> >>>>> >>>>>> Other drivers may have same problem, I really think this ought >>>>>> to be done at higher level. >>>>>> >>>>>> >>>>> I tend to agree with you, and I thought we had handled all >>>>> cases. Let's simply make AF_PACKET linearize the link >>>>> level header before sending things out to the transmit path. >>>>> >>>>> I can work on this if you want. >>>>> >>>>> >>>> Actually Stephen, I took a look and I can't see how AF_PACKET >>>> can create this situation. >>>> >>>> It always copies into the linear area of the SKB it allocates >>>> for sendmsg() processing. Whether the data comes from sendmsg >>>> data or the mmap() ring buffer. >>>> >>>> >>> Actually, I think there is a bug in this place, but of course this >>> might be unconnected. Anyway, Michael, could you try this patch? >>> BTW, did you try with CONFIG_PACKET_MMAP disabled? >>> >>> Thanks, >>> Jarek P. >>> -----------------> >>> >>> Changing an skb after dev_queue_xmit() is illegal. And since it's >>> inconsistent to treat specially net_xmit_errno() non-zero return, >>> while ignoring other dev_queue_xmit() errors, there is no reason >>> to break the loop in tpacket_snd() in this case. >>> >>> With debugging by: Stephen Hemminger<shemminger@linux-foundation.org> >>> >>> Reported-by: Michael Breuer<mbreuer@majjas.com> >>> Signed-off-by: Jarek Poplawski<jarkao2@gmail.com> >>> --- >>> >>> net/packet/af_packet.c | 8 +++----- >>> 1 files changed, 3 insertions(+), 5 deletions(-) >>> >>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >>> index e0516a2..984a1fa 100644 >>> --- a/net/packet/af_packet.c >>> +++ b/net/packet/af_packet.c >>> @@ -1021,8 +1021,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >>> >>> status = TP_STATUS_SEND_REQUEST; >>> err = dev_queue_xmit(skb); >>> - if (unlikely(err> 0&& (err = net_xmit_errno(err)) != 0)) >>> - goto out_xmit; >>> + if (unlikely(err> 0)) >>> + err = net_xmit_errno(err); >>> + >>> packet_increment_head(&po->tx_ring); >>> len_sum += tp_len; >>> } while (likely((ph != NULL) || >>> @@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >>> err = len_sum; >>> goto out_put; >>> >>> -out_xmit: >>> - skb->destructor = sock_wfree; >>> - atomic_dec(&po->tx_ring.pending); >>> out_status: >>> __packet_set_status(po, ph, status); >>> kfree_skb(skb); >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >>> >>> >> This patch solves the original reported oops - as did Steven's patch of >> 12/26: [PATCH] sky2: make sure ethernet header is in transmit skb (I ran >> without Steven's patch and with this patch). >> >> Oddly, with this patch vs. Steven's - I'm getting software interrupt >> errors sporadically while not under load - with Steven's I get the >> frequently while under load (as per nethogs). For example: >> Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x40000008 >> Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=124...125 report=125 >> done=125 >> Jan 5 21:29:00 mail kernel: 124: 0xb38de0be(5374) >> Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 >> Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x8 >> Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=126...127 report=126 >> done=127 >> Jan 5 21:29:00 mail kernel: 126: 0xb38d80be(9014) >> >> I also believe (can't prove yet) that my load test is slower with this >> patch vs. the sky2 patch. >> >> Is it possible that this patch is causing the transmission to >> momentarily halt such that the overall utilization appears low at the >> time I see the interrupt errors, vs. the other patch which perhaps >> doesn't interrupt the traffic flow quite so much? >> > Yes, without this patch xmit could be stopped earlier on some kind of > errors, with retransmit of the last message possible. On the other > hand, other dev_queue_xmit() (negative) errors, are ignored. So this > place could be still improved by adding proper err handling (or > removing getting err return from dev_queue_xmit() at all). > > Anyway, I think this patch should be a safe proposal for stable. If > so, David, please add Michael's "Tested-by". > > >> I haven't run yet with CONFIG_PACKET_MMAP disabled. >> >> > This should behave similarly as MMAP with this patch or maybe even > better in case of errors. > > Thanks, > Jarek P. > This patch at first behaved similarly to the previous one - seemed to be running a bit better... until the adapter went down :( This is the syslog output at the time the network failed: Jan 6 14:11:01 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Jan 6 14:11:01 mail kernel: sky2 software interrupt status 0x40000008 Jan 6 14:11:01 mail kernel: sky2 Tx ring pending=125...126 report=125 done=126 Jan 6 14:11:01 mail kernel: 125: 0xc9c340be(8310) <network dropped as seen by clients here> Jan 6 14:12:51 mail kernel: DRHD: handling fault status reg 2 Jan 6 14:12:51 mail kernel: DMAR:[DMA Read] Request device [06:00.0] fault addr fffa9ebfe000 Jan 6 14:12:51 mail kernel: DMAR:[fault reason 06] PTE Read access is not set Jan 6 14:12:51 mail kernel: sky2 0000:06:00.0: error interrupt status=0x80000000 Jan 6 14:12:51 mail kernel: sky2 software interrupt status 0x80000000 Jan 6 14:12:51 mail kernel: sky2 Tx ring pending=68...69 report=68 done=68 Jan 6 14:12:51 mail kernel: 68: 0x9ebfe4ca(105) Jan 6 14:12:51 mail kernel: sky2 hardware interrupt status 0xd000000 Jan 6 14:12:51 mail kernel: sky2 Tx ring pending=68...69 report=68 done=68 Jan 6 14:12:51 mail kernel: 68: 0x9ebfe4ca(105) Jan 6 14:12:51 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) Jan 6 14:12:51 mail smbd[6281]: [2010/01/06 14:12:51, 0] lib/util_sock.c:539(read_fd_with_timeout) Jan 6 14:12:51 mail smbd[6281]: [2010/01/06 14:12:51, 0] lib/util_sock.c:1491(get_peer_addr_internal) Jan 6 14:12:51 mail smbd[6281]: getpeername failed. Error was Transport endpoint is not connected Jan 6 14:12:51 mail smbd[6281]: read_fd_with_timeout: client 0.0.0.0 read error = Connection timed out. Much of the same... followed by sky2's unsuccessful attempts to reset the adapter - starting with NETDEV WATCHDOG oops: Jan 6 14:13:41 mail kernel: ------------[ cut here ]------------ Jan 6 14:13:41 mail kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164() Jan 6 14:13:41 mail kernel: Hardware name: System Product Name Jan 6 14:13:41 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out Jan 6 14:13:41 mail kernel: Modules linked in: ip6table_filter ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat nf_nat iptable_mangle iptable_raw bridge stp appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog gspca_spca505 snd_hda_intel snd_hda_codec snd_ens1371 gspca_main gameport snd_rawmidi snd_ac97_codec ac97_bus snd_hwdep snd_seq videodev snd_seq_device v4l1_compat v4l2_compat_ioctl32 wmi snd_pcm pcspkr i2c_i801 asus_atk0110 hwmon firewire_ohci snd_timer firewire_core snd iTCO_wdt crc_itu_t sky2 soundcore iTCO_vendor_support snd_page_alloc fbcon tileblit font bitblit softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfil Jan 6 14:13:41 mail kernel: lrect [last unloaded: microcode] Jan 6 14:13:41 mail kernel: Pid: 25, comm: ksoftirqd/7 Tainted: G W 2.6.32-00839-g92238e5-dirty #40 Jan 6 14:13:41 mail kernel: Call Trace: Jan 6 14:13:41 mail kernel: <IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94 Jan 6 14:13:41 mail kernel: [<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43 Jan 6 14:13:41 mail kernel: [<ffffffff813e139f>] ? netif_tx_lock+0x44/0x6c Jan 6 14:13:41 mail kernel: [<ffffffff813e1507>] dev_watchdog+0xf3/0x164 Jan 6 14:13:41 mail kernel: [<ffffffff8106e8a4>] ? __queue_work+0x3a/0x42 Jan 6 14:13:41 mail kernel: [<ffffffff8106316b>] run_timer_softirq+0x1c8/0x270 Jan 6 14:13:41 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd Jan 6 14:13:41 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 Jan 6 14:13:41 mail kernel: <EOI> [<ffffffff810143a3>] do_softirq+0x4b/0xa6 Jan 6 14:13:41 mail kernel: [<ffffffff8105a730>] ksoftirqd+0x65/0xdc Jan 6 14:13:41 mail kernel: [<ffffffff8105a6cb>] ? ksoftirqd+0x0/0xdc Jan 6 14:13:41 mail kernel: [<ffffffff8107275c>] kthread+0x7f/0x87 Jan 6 14:13:41 mail kernel: [<ffffffff81012d1a>] child_rip+0xa/0x20 Jan 6 14:13:41 mail kernel: [<ffffffff8101267d>] ? restore_args+0x0/0x30 Jan 6 14:13:41 mail kernel: [<ffffffff810726dd>] ? kthread+0x0/0x87 Jan 6 14:13:41 mail kernel: [<ffffffff81012d10>] ? child_rip+0x0/0x20 Jan 6 14:13:41 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- Jan 6 14:13:41 mail kernel: sky2 eth0: tx timeout Jan 6 14:13:41 mail kernel: sky2 eth0: transmit ring 68 .. 27 report=68 done=68 Jan 6 14:13:41 mail kernel: sky2 eth0: disabling interface Jan 6 14:13:41 mail kernel: sky2 eth0: enabling interface <unrelated stuff> Jan 6 14:14:29 mail kernel: sky2 eth0: tx timeout Jan 6 14:14:29 mail kernel: sky2 eth0: transmit ring 2 .. 89 report=2 done=2 Jan 6 14:14:29 mail kernel: sky2 eth0: disabling interface Jan 6 14:14:29 mail kernel: sky2 eth0: enabling interface Jan 6 14:14:29 mail NetworkManager: <info> (eth0): carrier now OFF (device state 1) Jan 6 14:14:32 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both Jan 6 14:14:32 mail NetworkManager: <info> (eth0): carrier now ON (device state 1) Jan 6 14:14:34 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0 Jan 6 14:14:34 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via eth0 <unrelated stuff> Jan 6 14:15:17 mail kernel: sky2 eth0: tx timeout Jan 6 14:15:17 mail kernel: sky2 eth0: transmit ring 2 .. 89 report=2 done=2 Jan 6 14:15:17 mail kernel: sky2 eth0: disabling interface Jan 6 14:15:17 mail kernel: sky2 eth0: enabling interface This sequence - without the NETDEV WATCHDOG oops, and with differening DHCP client request/responses repeats until I rebooted the system. Note: networkmanager is not managing the interface. Also note that my external link (eth1) remained up and fully functional. Observations leading up the the failure: 1) Peak Receive rates about 20% higher than with the prior patch (still lower than with the earlier sky2 patch). 2) Peak rates sustained for longer than the prior patch. 3) Interrupt errors occurred mostly during periods of high receive rates - not at all while quiesced as seen with the prior patch. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 19:49 ` [PATCH] " Michael Breuer @ 2010-01-06 20:22 ` Jarek Poplawski 2010-01-06 20:33 ` Michael Breuer 2010-01-06 21:10 ` Stephen Hemminger 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-06 20:22 UTC (permalink / raw) To: Michael Breuer Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On Wed, Jan 06, 2010 at 02:49:38PM -0500, Michael Breuer wrote: > On 1/6/2010 2:22 AM, Jarek Poplawski wrote: > >On Tue, Jan 05, 2010 at 09:36:28PM -0500, Michael Breuer wrote: > >>On 1/5/2010 6:07 PM, Jarek Poplawski wrote: > >>>David Miller wrote, On 12/27/2009 05:11 AM: > >>> > >>> > >>>>From: David Miller<davem@davemloft.net> > >>>>Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) > >>>> > >>>> > >>>>>From: Stephen Hemminger<shemminger@linux-foundation.org> > >>>>>Date: Sat, 26 Dec 2009 14:05:44 -0800 > >>>>> > >>>>> > >>>>>>Other drivers may have same problem, I really think this ought > >>>>>>to be done at higher level. > >>>>>> > >>>>>I tend to agree with you, and I thought we had handled all > >>>>>cases. Let's simply make AF_PACKET linearize the link > >>>>>level header before sending things out to the transmit path. > >>>>> > >>>>>I can work on this if you want. > >>>>> > >>>>Actually Stephen, I took a look and I can't see how AF_PACKET > >>>>can create this situation. > >>>> > >>>>It always copies into the linear area of the SKB it allocates > >>>>for sendmsg() processing. Whether the data comes from sendmsg > >>>>data or the mmap() ring buffer. > >>>> > >>>Actually, I think there is a bug in this place, but of course this > >>>might be unconnected. Anyway, Michael, could you try this patch? > >>>BTW, did you try with CONFIG_PACKET_MMAP disabled? > >>> > >>>Thanks, > >>>Jarek P. > >>>-----------------> > >>> > >>>Changing an skb after dev_queue_xmit() is illegal. And since it's > >>>inconsistent to treat specially net_xmit_errno() non-zero return, > >>>while ignoring other dev_queue_xmit() errors, there is no reason > >>>to break the loop in tpacket_snd() in this case. > >>> > >>>With debugging by: Stephen Hemminger<shemminger@linux-foundation.org> > >>> > >>>Reported-by: Michael Breuer<mbreuer@majjas.com> > >>>Signed-off-by: Jarek Poplawski<jarkao2@gmail.com> > >>>--- > >>> > >>> net/packet/af_packet.c | 8 +++----- > >>> 1 files changed, 3 insertions(+), 5 deletions(-) > >>> > >>>diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > >>>index e0516a2..984a1fa 100644 > >>>--- a/net/packet/af_packet.c > >>>+++ b/net/packet/af_packet.c > >>>@@ -1021,8 +1021,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > >>> > >>> status = TP_STATUS_SEND_REQUEST; > >>> err = dev_queue_xmit(skb); > >>>- if (unlikely(err> 0&& (err = net_xmit_errno(err)) != 0)) > >>>- goto out_xmit; > >>>+ if (unlikely(err> 0)) > >>>+ err = net_xmit_errno(err); > >>>+ > >>> packet_increment_head(&po->tx_ring); > >>> len_sum += tp_len; > >>> } while (likely((ph != NULL) || > >>>@@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > >>> err = len_sum; > >>> goto out_put; > >>> > >>>-out_xmit: > >>>- skb->destructor = sock_wfree; > >>>- atomic_dec(&po->tx_ring.pending); > >>> out_status: > >>> __packet_set_status(po, ph, status); > >>> kfree_skb(skb); > >>>-- > >>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >>>the body of a message to majordomo@vger.kernel.org > >>>More majordomo info at http://vger.kernel.org/majordomo-info.html > >>>Please read the FAQ at http://www.tux.org/lkml/ > >>> > >>This patch solves the original reported oops - as did Steven's patch of > >>12/26: [PATCH] sky2: make sure ethernet header is in transmit skb (I ran > >>without Steven's patch and with this patch). > >> > >>Oddly, with this patch vs. Steven's - I'm getting software interrupt > >>errors sporadically while not under load - with Steven's I get the > >>frequently while under load (as per nethogs). For example: > >>Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt > >>status=0x40000008 > >>Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x40000008 > >>Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=124...125 report=125 > >>done=125 > >>Jan 5 21:29:00 mail kernel: 124: 0xb38de0be(5374) > >>Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 > >>Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x8 > >>Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=126...127 report=126 > >>done=127 > >>Jan 5 21:29:00 mail kernel: 126: 0xb38d80be(9014) > >> > >>I also believe (can't prove yet) that my load test is slower with this > >>patch vs. the sky2 patch. > >> > >>Is it possible that this patch is causing the transmission to > >>momentarily halt such that the overall utilization appears low at the > >>time I see the interrupt errors, vs. the other patch which perhaps > >>doesn't interrupt the traffic flow quite so much? > >Yes, without this patch xmit could be stopped earlier on some kind of > >errors, with retransmit of the last message possible. On the other > >hand, other dev_queue_xmit() (negative) errors, are ignored. So this > >place could be still improved by adding proper err handling (or > >removing getting err return from dev_queue_xmit() at all). > > > >Anyway, I think this patch should be a safe proposal for stable. If > >so, David, please add Michael's "Tested-by". > > > >>I haven't run yet with CONFIG_PACKET_MMAP disabled. > >> > >This should behave similarly as MMAP with this patch or maybe even > >better in case of errors. > > > >Thanks, > >Jarek P. > This patch at first behaved similarly to the previous one - seemed > to be running a bit better... until the adapter went down :( I'm not sure: do you mean this patch above vs previous one by Stephen, or did you manage to try my "alernative #2" patch already? BTW, I forgot to mention, and maybe it doesn't matter here, but it would be better to (always) use my sky2 patch from Berck Nash's thread. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 20:22 ` Jarek Poplawski @ 2010-01-06 20:33 ` Michael Breuer 2010-01-06 21:09 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-06 20:33 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On 1/6/2010 3:22 PM, Jarek Poplawski wrote: > On Wed, Jan 06, 2010 at 02:49:38PM -0500, Michael Breuer wrote: > >> On 1/6/2010 2:22 AM, Jarek Poplawski wrote: >> >>> On Tue, Jan 05, 2010 at 09:36:28PM -0500, Michael Breuer wrote: >>> >>>> On 1/5/2010 6:07 PM, Jarek Poplawski wrote: >>>> >>>>> David Miller wrote, On 12/27/2009 05:11 AM: >>>>> >>>>> >>>>> >>>>>> From: David Miller<davem@davemloft.net> >>>>>> Date: Sat, 26 Dec 2009 19:44:18 -0800 (PST) >>>>>> >>>>>> >>>>>> >>>>>>> From: Stephen Hemminger<shemminger@linux-foundation.org> >>>>>>> Date: Sat, 26 Dec 2009 14:05:44 -0800 >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Other drivers may have same problem, I really think this ought >>>>>>>> to be done at higher level. >>>>>>>> >>>>>>>> >>>>>>> I tend to agree with you, and I thought we had handled all >>>>>>> cases. Let's simply make AF_PACKET linearize the link >>>>>>> level header before sending things out to the transmit path. >>>>>>> >>>>>>> I can work on this if you want. >>>>>>> >>>>>>> >>>>>> Actually Stephen, I took a look and I can't see how AF_PACKET >>>>>> can create this situation. >>>>>> >>>>>> It always copies into the linear area of the SKB it allocates >>>>>> for sendmsg() processing. Whether the data comes from sendmsg >>>>>> data or the mmap() ring buffer. >>>>>> >>>>>> >>>>> Actually, I think there is a bug in this place, but of course this >>>>> might be unconnected. Anyway, Michael, could you try this patch? >>>>> BTW, did you try with CONFIG_PACKET_MMAP disabled? >>>>> >>>>> Thanks, >>>>> Jarek P. >>>>> -----------------> >>>>> >>>>> Changing an skb after dev_queue_xmit() is illegal. And since it's >>>>> inconsistent to treat specially net_xmit_errno() non-zero return, >>>>> while ignoring other dev_queue_xmit() errors, there is no reason >>>>> to break the loop in tpacket_snd() in this case. >>>>> >>>>> With debugging by: Stephen Hemminger<shemminger@linux-foundation.org> >>>>> >>>>> Reported-by: Michael Breuer<mbreuer@majjas.com> >>>>> Signed-off-by: Jarek Poplawski<jarkao2@gmail.com> >>>>> --- >>>>> >>>>> net/packet/af_packet.c | 8 +++----- >>>>> 1 files changed, 3 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >>>>> index e0516a2..984a1fa 100644 >>>>> --- a/net/packet/af_packet.c >>>>> +++ b/net/packet/af_packet.c >>>>> @@ -1021,8 +1021,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >>>>> >>>>> status = TP_STATUS_SEND_REQUEST; >>>>> err = dev_queue_xmit(skb); >>>>> - if (unlikely(err> 0&& (err = net_xmit_errno(err)) != 0)) >>>>> - goto out_xmit; >>>>> + if (unlikely(err> 0)) >>>>> + err = net_xmit_errno(err); >>>>> + >>>>> packet_increment_head(&po->tx_ring); >>>>> len_sum += tp_len; >>>>> } while (likely((ph != NULL) || >>>>> @@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >>>>> err = len_sum; >>>>> goto out_put; >>>>> >>>>> -out_xmit: >>>>> - skb->destructor = sock_wfree; >>>>> - atomic_dec(&po->tx_ring.pending); >>>>> out_status: >>>>> __packet_set_status(po, ph, status); >>>>> kfree_skb(skb); >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> Please read the FAQ at http://www.tux.org/lkml/ >>>>> >>>>> >>>> This patch solves the original reported oops - as did Steven's patch of >>>> 12/26: [PATCH] sky2: make sure ethernet header is in transmit skb (I ran >>>> without Steven's patch and with this patch). >>>> >>>> Oddly, with this patch vs. Steven's - I'm getting software interrupt >>>> errors sporadically while not under load - with Steven's I get the >>>> frequently while under load (as per nethogs). For example: >>>> Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x40000008 >>>> Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x40000008 >>>> Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=124...125 report=125 >>>> done=125 >>>> Jan 5 21:29:00 mail kernel: 124: 0xb38de0be(5374) >>>> Jan 5 21:29:00 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 >>>> Jan 5 21:29:00 mail kernel: sky2 software interrupt status 0x8 >>>> Jan 5 21:29:00 mail kernel: sky2 Tx ring pending=126...127 report=126 >>>> done=127 >>>> Jan 5 21:29:00 mail kernel: 126: 0xb38d80be(9014) >>>> >>>> I also believe (can't prove yet) that my load test is slower with this >>>> patch vs. the sky2 patch. >>>> >>>> Is it possible that this patch is causing the transmission to >>>> momentarily halt such that the overall utilization appears low at the >>>> time I see the interrupt errors, vs. the other patch which perhaps >>>> doesn't interrupt the traffic flow quite so much? >>>> >>> Yes, without this patch xmit could be stopped earlier on some kind of >>> errors, with retransmit of the last message possible. On the other >>> hand, other dev_queue_xmit() (negative) errors, are ignored. So this >>> place could be still improved by adding proper err handling (or >>> removing getting err return from dev_queue_xmit() at all). >>> >>> Anyway, I think this patch should be a safe proposal for stable. If >>> so, David, please add Michael's "Tested-by". >>> >>> >>>> I haven't run yet with CONFIG_PACKET_MMAP disabled. >>>> >>>> >>> This should behave similarly as MMAP with this patch or maybe even >>> better in case of errors. >>> >>> Thanks, >>> Jarek P. >>> >> This patch at first behaved similarly to the previous one - seemed >> to be running a bit better... until the adapter went down :( >> > I'm not sure: do you mean this patch above vs previous one by Stephen, > or did you manage to try my "alernative #2" patch already? > > BTW, I forgot to mention, and maybe it doesn't matter here, but it > would be better to (always) use my sky2 patch from Berck Nash's > thread. > > Jarek P. > This was using "alternative #2" patch. I didn't get the hang with alternative #1. Your sky2 patch from Berck Nash's thread was included in both cases; Stephen's was not. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 20:33 ` Michael Breuer @ 2010-01-06 21:09 ` Jarek Poplawski 2010-01-06 21:32 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-06 21:09 UTC (permalink / raw) To: Michael Breuer Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On Wed, Jan 06, 2010 at 03:33:05PM -0500, Michael Breuer wrote: > On 1/6/2010 3:22 PM, Jarek Poplawski wrote: > >On Wed, Jan 06, 2010 at 02:49:38PM -0500, Michael Breuer wrote: > >>On 1/6/2010 2:22 AM, Jarek Poplawski wrote: > >>>On Tue, Jan 05, 2010 at 09:36:28PM -0500, Michael Breuer wrote: > >>>>On 1/5/2010 6:07 PM, Jarek Poplawski wrote: > >>>>>-----------------> > >>>>> > >>>>>Changing an skb after dev_queue_xmit() is illegal. And since it's > >>>>>inconsistent to treat specially net_xmit_errno() non-zero return, > >>>>>while ignoring other dev_queue_xmit() errors, there is no reason > >>>>>to break the loop in tpacket_snd() in this case. > >>>>> > >>>>>With debugging by: Stephen Hemminger<shemminger@linux-foundation.org> > >>>>> > >>>>>Reported-by: Michael Breuer<mbreuer@majjas.com> > >>>>>Signed-off-by: Jarek Poplawski<jarkao2@gmail.com> > >>>>>--- > >>>>> > >>>>> net/packet/af_packet.c | 8 +++----- > >>>>> 1 files changed, 3 insertions(+), 5 deletions(-) > >>>>> > >>>>>diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > >>>>>index e0516a2..984a1fa 100644 > >>>>>--- a/net/packet/af_packet.c > >>>>>+++ b/net/packet/af_packet.c > >>>>>@@ -1021,8 +1021,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > >>>>> > >>>>> status = TP_STATUS_SEND_REQUEST; > >>>>> err = dev_queue_xmit(skb); > >>>>>- if (unlikely(err> 0&& (err = net_xmit_errno(err)) != 0)) > >>>>>- goto out_xmit; > >>>>>+ if (unlikely(err> 0)) > >>>>>+ err = net_xmit_errno(err); > >>>>>+ > >>>>> packet_increment_head(&po->tx_ring); > >>>>> len_sum += tp_len; > >>>>> } while (likely((ph != NULL) || > >>>>>@@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > >>>>> err = len_sum; > >>>>> goto out_put; > >>>>> > >>>>>-out_xmit: > >>>>>- skb->destructor = sock_wfree; > >>>>>- atomic_dec(&po->tx_ring.pending); > >>>>> out_status: > >>>>> __packet_set_status(po, ph, status); > >>>>> kfree_skb(skb); > >>>>>-- ... > >>This patch at first behaved similarly to the previous one - seemed > >>to be running a bit better... until the adapter went down :( > >I'm not sure: do you mean this patch above vs previous one by Stephen, > >or did you manage to try my "alernative #2" patch already? > > > >BTW, I forgot to mention, and maybe it doesn't matter here, but it > >would be better to (always) use my sky2 patch from Berck Nash's > >thread. > > > >Jarek P. > This was using "alternative #2" patch. I didn't get the hang with > alternative #1. Your sky2 patch from Berck Nash's thread was > included in both cases; Stephen's was not. OK, so I guess "alternative #1" (above) seems safer to recommend for now (as I assumed earlier). On the other hand, we really don't know if it's only because it's because it's nicer for your hardware (or still some other bug around), so as before: let David choose ;-) BTW, I think you could still use Stephen's patch too (there might be still something more like this). There was also mentioned this network manager again. I might be wrong, but IMHO there could be some interaction even if it doesn't use this device; so could/did you try to disable it entirely? Thanks for testing! Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 21:09 ` Jarek Poplawski @ 2010-01-06 21:32 ` Michael Breuer 0 siblings, 0 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-06 21:32 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, shemminger, akpm, flyboy, linux-kernel, netdev On 1/6/2010 4:09 PM, Jarek Poplawski wrote: > On Wed, Jan 06, 2010 at 03:33:05PM -0500, Michael Breuer wrote: > >> On 1/6/2010 3:22 PM, Jarek Poplawski wrote: >> >>> On Wed, Jan 06, 2010 at 02:49:38PM -0500, Michael Breuer wrote: >>> >>>> On 1/6/2010 2:22 AM, Jarek Poplawski wrote: >>>> >>>>> On Tue, Jan 05, 2010 at 09:36:28PM -0500, Michael Breuer wrote: >>>>> >>>>>> On 1/5/2010 6:07 PM, Jarek Poplawski wrote: >>>>>> >>>>>>> -----------------> >>>>>>> >>>>>>> Changing an skb after dev_queue_xmit() is illegal. And since it's >>>>>>> inconsistent to treat specially net_xmit_errno() non-zero return, >>>>>>> while ignoring other dev_queue_xmit() errors, there is no reason >>>>>>> to break the loop in tpacket_snd() in this case. >>>>>>> >>>>>>> With debugging by: Stephen Hemminger<shemminger@linux-foundation.org> >>>>>>> >>>>>>> Reported-by: Michael Breuer<mbreuer@majjas.com> >>>>>>> Signed-off-by: Jarek Poplawski<jarkao2@gmail.com> >>>>>>> --- >>>>>>> >>>>>>> net/packet/af_packet.c | 8 +++----- >>>>>>> 1 files changed, 3 insertions(+), 5 deletions(-) >>>>>>> >>>>>>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >>>>>>> index e0516a2..984a1fa 100644 >>>>>>> --- a/net/packet/af_packet.c >>>>>>> +++ b/net/packet/af_packet.c >>>>>>> @@ -1021,8 +1021,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >>>>>>> >>>>>>> status = TP_STATUS_SEND_REQUEST; >>>>>>> err = dev_queue_xmit(skb); >>>>>>> - if (unlikely(err> 0&& (err = net_xmit_errno(err)) != 0)) >>>>>>> - goto out_xmit; >>>>>>> + if (unlikely(err> 0)) >>>>>>> + err = net_xmit_errno(err); >>>>>>> + >>>>>>> packet_increment_head(&po->tx_ring); >>>>>>> len_sum += tp_len; >>>>>>> } while (likely((ph != NULL) || >>>>>>> @@ -1033,9 +1034,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >>>>>>> err = len_sum; >>>>>>> goto out_put; >>>>>>> >>>>>>> -out_xmit: >>>>>>> - skb->destructor = sock_wfree; >>>>>>> - atomic_dec(&po->tx_ring.pending); >>>>>>> out_status: >>>>>>> __packet_set_status(po, ph, status); >>>>>>> kfree_skb(skb); >>>>>>> -- >>>>>>> > ... > >>>> This patch at first behaved similarly to the previous one - seemed >>>> to be running a bit better... until the adapter went down :( >>>> >>> I'm not sure: do you mean this patch above vs previous one by Stephen, >>> or did you manage to try my "alernative #2" patch already? >>> >>> BTW, I forgot to mention, and maybe it doesn't matter here, but it >>> would be better to (always) use my sky2 patch from Berck Nash's >>> thread. >>> >>> Jarek P. >>> >> This was using "alternative #2" patch. I didn't get the hang with >> alternative #1. Your sky2 patch from Berck Nash's thread was >> included in both cases; Stephen's was not. >> > OK, so I guess "alternative #1" (above) seems safer to recommend for > now (as I assumed earlier). > > On the other hand, we really don't know if it's only because it's > because it's nicer for your hardware (or still some other bug around), > so as before: let David choose ;-) > > BTW, I think you could still use Stephen's patch too (there might be > still something more like this). There was also mentioned this network > manager again. I might be wrong, but IMHO there could be some > interaction even if it doesn't use this device; so could/did you try > to disable it entirely? > > Thanks for testing! > Jarek P. > > > Just reran without the network manager - no change. Going to rerun with Stephen's new patch, alternative #1, and the patch from Berck Nash's thread. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 19:49 ` [PATCH] " Michael Breuer 2010-01-06 20:22 ` Jarek Poplawski @ 2010-01-06 21:10 ` Stephen Hemminger 2010-01-06 21:20 ` Michael Breuer 2010-01-06 23:26 ` Michael Breuer 1 sibling, 2 replies; 145+ messages in thread From: Stephen Hemminger @ 2010-01-06 21:10 UTC (permalink / raw) To: Michael Breuer Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On Wed, 06 Jan 2010 14:49:38 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > This patch at first behaved similarly to the previous one - seemed to be > running a bit better... until the adapter went down :( > > This is the syslog output at the time the network failed: > Jan 6 14:11:01 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Jan 6 14:11:01 mail kernel: sky2 software interrupt status 0x40000008 Could you go back to baseline sky2 driver. The display code might be buggy. These bits indicate an error in the MAC. The interrupt source enabled is Transmit FIFO underrun. Looking at how vendor driver handles this. It looks like the Yukon EC_U chip doesn't really do Jumbo frames correctly. Maybe not enough internal buffering to ensure that the whole packet is in the chip. Of course, none of this is in the chip manual. Does this help -------------- --- a/drivers/net/sky2.c 2010-01-06 12:48:43.012318966 -0800 +++ b/drivers/net/sky2.c 2010-01-06 13:05:31.273987255 -0800 @@ -792,33 +792,21 @@ static void sky2_set_tx_stfwd(struct sky { struct net_device *dev = hw->dev[port]; - if ( (hw->chip_id == CHIP_ID_YUKON_EX && - hw->chip_rev != CHIP_REV_YU_EX_A0) || - hw->chip_id >= CHIP_ID_YUKON_FE_P) { - /* Yukon-Extreme B0 and further Extreme devices */ - /* enable Store & Forward mode for TX */ - - if (dev->mtu <= ETH_DATA_LEN) - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), - TX_JUMBO_DIS | TX_STFW_ENA); - - else - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), - TX_JUMBO_ENA| TX_STFW_ENA); - } else { - if (dev->mtu <= ETH_DATA_LEN) - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); - else { - /* set Tx GMAC FIFO Almost Empty Threshold */ - sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), - (ECU_JUMBO_WM << 16) | ECU_AE_THR); - - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_DIS); - - /* Can't do offload because of lack of store/forward */ - dev->features &= ~(NETIF_F_TSO | NETIF_F_SG | NETIF_F_ALL_CSUM); - } - } + if ( (hw->chip_id == CHIP_ID_YUKON_EX && hw->chip_rev != CHIP_REV_YU_EX_A0) || + hw->chip_id >= CHIP_ID_YUKON_FE_P) { + /* Yukon-Extreme B0 and further Extreme devices */ + /* enable Store & Forward mode for TX */ + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); + } else if (dev->mtu > ETH_DATA_LEN) { + /* set Tx GMAC FIFO Almost Empty Threshold */ + sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), + (ECU_JUMBO_WM << 16) | ECU_AE_THR); + /* disable Store & Forward mode for TX */ + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_DIS); + } else { + /* enable Store & Forward mode for TX */ + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); + } } static void sky2_mac_init(struct sky2_hw *hw, unsigned port) @@ -2185,11 +2173,16 @@ static int sky2_change_mtu(struct net_de if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) return -EINVAL; + /* MTU > 1500 on yukon FE and FE+ not allowed */ if (new_mtu > ETH_DATA_LEN && (hw->chip_id == CHIP_ID_YUKON_FE || hw->chip_id == CHIP_ID_YUKON_FE_P)) return -EINVAL; + /* TSO on Yukon Ultra and MTU > 1500 not supported */ + if (new_mtu > ETH_DATA_LEN && hw->chip_id == CHIP_ID_YUKON_EC_U) + dev->features &= ~NETIF_F_TSO; + if (!netif_running(dev)) { dev->mtu = new_mtu; return 0; @@ -2233,6 +2226,15 @@ static int sky2_change_mtu(struct net_de if (err) dev_close(dev); else { + /* WA for dev. #4.209 */ + if (hw->chip_id == CHIP_ID_YUKON_EC_U && + hw->chip_rev == CHIP_REV_YU_EC_U_A1) { + /* enable/disable Store & Forward mode for TX */ + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), + sky2->speed != SPEED_1000 + ? TX_STFW_ENA : TX_STFW_DIS); + } + gma_write16(hw, port, GM_GP_CTRL, ctl); netif_wake_queue(dev); --- a/drivers/net/sky2.h 2010-01-06 12:48:48.632247424 -0800 +++ b/drivers/net/sky2.h 2010-01-06 12:59:57.322078964 -0800 @@ -1901,8 +1901,8 @@ enum { TX_VLAN_TAG_ON = 1<<25,/* enable VLAN tagging */ TX_VLAN_TAG_OFF = 1<<24,/* disable VLAN tagging */ - TX_JUMBO_ENA = 1<<23,/* PCI Jumbo Mode enable (Yukon-EC Ultra) */ - TX_JUMBO_DIS = 1<<22,/* PCI Jumbo Mode enable (Yukon-EC Ultra) */ + TX_PCI_JUM_ENA = 1<<23,/* Enable PCI Jumbo Mode (Yukon-EC Ultra) */ + TX_PCI_JUM_DIS = 1<<22,/* Disable PCI Jumbo Mode (Yukon-EC Ultra) */ GMF_WSP_TST_ON = 1<<18,/* Write Shadow Pointer Test On */ GMF_WSP_TST_OFF = 1<<17,/* Write Shadow Pointer Test Off */ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 21:10 ` Stephen Hemminger @ 2010-01-06 21:20 ` Michael Breuer 2010-01-06 23:26 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-06 21:20 UTC (permalink / raw) To: Stephen Hemminger Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On 1/6/2010 4:10 PM, Stephen Hemminger wrote: > On Wed, 06 Jan 2010 14:49:38 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> This patch at first behaved similarly to the previous one - seemed to be >> running a bit better... until the adapter went down :( >> >> This is the syslog output at the time the network failed: >> Jan 6 14:11:01 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Jan 6 14:11:01 mail kernel: sky2 software interrupt status 0x40000008 >> > Could you go back to baseline sky2 driver. The display code might be buggy. > These bits indicate an error in the MAC. The interrupt source enabled > is Transmit FIFO underrun. > > Looking at how vendor driver handles this. > It looks like the Yukon EC_U chip doesn't really do Jumbo frames correctly. > Maybe not enough internal buffering to ensure that the whole packet > is in the chip. Of course, none of this is in the chip manual. > > Does this help > -------------- > --- a/drivers/net/sky2.c 2010-01-06 12:48:43.012318966 -0800 > +++ b/drivers/net/sky2.c 2010-01-06 13:05:31.273987255 -0800 > @@ -792,33 +792,21 @@ static void sky2_set_tx_stfwd(struct sky > { > struct net_device *dev = hw->dev[port]; > > - if ( (hw->chip_id == CHIP_ID_YUKON_EX&& > - hw->chip_rev != CHIP_REV_YU_EX_A0) || > - hw->chip_id>= CHIP_ID_YUKON_FE_P) { > - /* Yukon-Extreme B0 and further Extreme devices */ > - /* enable Store& Forward mode for TX */ > - > - if (dev->mtu<= ETH_DATA_LEN) > - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), > - TX_JUMBO_DIS | TX_STFW_ENA); > - > - else > - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), > - TX_JUMBO_ENA| TX_STFW_ENA); > - } else { > - if (dev->mtu<= ETH_DATA_LEN) > - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); > - else { > - /* set Tx GMAC FIFO Almost Empty Threshold */ > - sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), > - (ECU_JUMBO_WM<< 16) | ECU_AE_THR); > - > - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_DIS); > - > - /* Can't do offload because of lack of store/forward */ > - dev->features&= ~(NETIF_F_TSO | NETIF_F_SG | NETIF_F_ALL_CSUM); > - } > - } > + if ( (hw->chip_id == CHIP_ID_YUKON_EX&& hw->chip_rev != CHIP_REV_YU_EX_A0) || > + hw->chip_id>= CHIP_ID_YUKON_FE_P) { > + /* Yukon-Extreme B0 and further Extreme devices */ > + /* enable Store& Forward mode for TX */ > + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); > + } else if (dev->mtu> ETH_DATA_LEN) { > + /* set Tx GMAC FIFO Almost Empty Threshold */ > + sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), > + (ECU_JUMBO_WM<< 16) | ECU_AE_THR); > + /* disable Store& Forward mode for TX */ > + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_DIS); > + } else { > + /* enable Store& Forward mode for TX */ > + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); > + } > } > > static void sky2_mac_init(struct sky2_hw *hw, unsigned port) > @@ -2185,11 +2173,16 @@ static int sky2_change_mtu(struct net_de > if (new_mtu< ETH_ZLEN || new_mtu> ETH_JUMBO_MTU) > return -EINVAL; > > + /* MTU> 1500 on yukon FE and FE+ not allowed */ > if (new_mtu> ETH_DATA_LEN&& > (hw->chip_id == CHIP_ID_YUKON_FE || > hw->chip_id == CHIP_ID_YUKON_FE_P)) > return -EINVAL; > > + /* TSO on Yukon Ultra and MTU> 1500 not supported */ > + if (new_mtu> ETH_DATA_LEN&& hw->chip_id == CHIP_ID_YUKON_EC_U) > + dev->features&= ~NETIF_F_TSO; > + > if (!netif_running(dev)) { > dev->mtu = new_mtu; > return 0; > @@ -2233,6 +2226,15 @@ static int sky2_change_mtu(struct net_de > if (err) > dev_close(dev); > else { > + /* WA for dev. #4.209 */ > + if (hw->chip_id == CHIP_ID_YUKON_EC_U&& > + hw->chip_rev == CHIP_REV_YU_EC_U_A1) { > + /* enable/disable Store& Forward mode for TX */ > + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), > + sky2->speed != SPEED_1000 > + ? TX_STFW_ENA : TX_STFW_DIS); > + } > + > gma_write16(hw, port, GM_GP_CTRL, ctl); > > netif_wake_queue(dev); > --- a/drivers/net/sky2.h 2010-01-06 12:48:48.632247424 -0800 > +++ b/drivers/net/sky2.h 2010-01-06 12:59:57.322078964 -0800 > @@ -1901,8 +1901,8 @@ enum { > TX_VLAN_TAG_ON = 1<<25,/* enable VLAN tagging */ > TX_VLAN_TAG_OFF = 1<<24,/* disable VLAN tagging */ > > - TX_JUMBO_ENA = 1<<23,/* PCI Jumbo Mode enable (Yukon-EC Ultra) */ > - TX_JUMBO_DIS = 1<<22,/* PCI Jumbo Mode enable (Yukon-EC Ultra) */ > + TX_PCI_JUM_ENA = 1<<23,/* Enable PCI Jumbo Mode (Yukon-EC Ultra) */ > + TX_PCI_JUM_DIS = 1<<22,/* Disable PCI Jumbo Mode (Yukon-EC Ultra) */ > > GMF_WSP_TST_ON = 1<<18,/* Write Shadow Pointer Test On */ > GMF_WSP_TST_OFF = 1<<17,/* Write Shadow Pointer Test Off */ > I'll try this a bit later today. However, early on, I saw the same issues with MTU=1500. Also, maybe I'm missing something, but I can only recreate the issue with a high receive rate. Given the interaction with DHCP, for example, I'm thinking that there is some precondition that is as yet unknown. May be buggy hardware, or perhaps a race condition resulting in a corrupt i/o buffer somewhere. I'm wondering whether there's some useful place to insert some diagnostics on the RX side - at least we can see if there are any consistent events on the RX side preceding the TX error. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 21:10 ` Stephen Hemminger 2010-01-06 21:20 ` Michael Breuer @ 2010-01-06 23:26 ` Michael Breuer 2010-01-07 2:42 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-06 23:26 UTC (permalink / raw) To: Stephen Hemminger Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On 1/6/2010 4:10 PM, Stephen Hemminger wrote: > On Wed, 06 Jan 2010 14:49:38 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> This patch at first behaved similarly to the previous one - seemed to be >> running a bit better... until the adapter went down :( >> >> This is the syslog output at the time the network failed: >> Jan 6 14:11:01 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Jan 6 14:11:01 mail kernel: sky2 software interrupt status 0x40000008 >> > Could you go back to baseline sky2 driver. The display code might be buggy. > These bits indicate an error in the MAC. The interrupt source enabled > is Transmit FIFO underrun. > > Looking at how vendor driver handles this. > It looks like the Yukon EC_U chip doesn't really do Jumbo frames correctly. > Maybe not enough internal buffering to ensure that the whole packet > is in the chip. Of course, none of this is in the chip manual. > > Does this help > -------------- > --- a/drivers/net/sky2.c 2010-01-06 12:48:43.012318966 -0800 > +++ b/drivers/net/sky2.c 2010-01-06 13:05:31.273987255 -0800 > @@ -792,33 +792,21 @@ static void sky2_set_tx_stfwd(struct sky > { > struct net_device *dev = hw->dev[port]; > > - if ( (hw->chip_id == CHIP_ID_YUKON_EX&& > - hw->chip_rev != CHIP_REV_YU_EX_A0) || > - hw->chip_id>= CHIP_ID_YUKON_FE_P) { > - /* Yukon-Extreme B0 and further Extreme devices */ > - /* enable Store& Forward mode for TX */ > - > - if (dev->mtu<= ETH_DATA_LEN) > - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), > - TX_JUMBO_DIS | TX_STFW_ENA); > - > - else > - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), > - TX_JUMBO_ENA| TX_STFW_ENA); > - } else { > - if (dev->mtu<= ETH_DATA_LEN) > - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); > - else { > - /* set Tx GMAC FIFO Almost Empty Threshold */ > - sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), > - (ECU_JUMBO_WM<< 16) | ECU_AE_THR); > - > - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_DIS); > - > - /* Can't do offload because of lack of store/forward */ > - dev->features&= ~(NETIF_F_TSO | NETIF_F_SG | NETIF_F_ALL_CSUM); > - } > - } > + if ( (hw->chip_id == CHIP_ID_YUKON_EX&& hw->chip_rev != CHIP_REV_YU_EX_A0) || > + hw->chip_id>= CHIP_ID_YUKON_FE_P) { > + /* Yukon-Extreme B0 and further Extreme devices */ > + /* enable Store& Forward mode for TX */ > + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); > + } else if (dev->mtu> ETH_DATA_LEN) { > + /* set Tx GMAC FIFO Almost Empty Threshold */ > + sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), > + (ECU_JUMBO_WM<< 16) | ECU_AE_THR); > + /* disable Store& Forward mode for TX */ > + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_DIS); > + } else { > + /* enable Store& Forward mode for TX */ > + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); > + } > } > > static void sky2_mac_init(struct sky2_hw *hw, unsigned port) > @@ -2185,11 +2173,16 @@ static int sky2_change_mtu(struct net_de > if (new_mtu< ETH_ZLEN || new_mtu> ETH_JUMBO_MTU) > return -EINVAL; > > + /* MTU> 1500 on yukon FE and FE+ not allowed */ > if (new_mtu> ETH_DATA_LEN&& > (hw->chip_id == CHIP_ID_YUKON_FE || > hw->chip_id == CHIP_ID_YUKON_FE_P)) > return -EINVAL; > > + /* TSO on Yukon Ultra and MTU> 1500 not supported */ > + if (new_mtu> ETH_DATA_LEN&& hw->chip_id == CHIP_ID_YUKON_EC_U) > + dev->features&= ~NETIF_F_TSO; > + > if (!netif_running(dev)) { > dev->mtu = new_mtu; > return 0; > @@ -2233,6 +2226,15 @@ static int sky2_change_mtu(struct net_de > if (err) > dev_close(dev); > else { > + /* WA for dev. #4.209 */ > + if (hw->chip_id == CHIP_ID_YUKON_EC_U&& > + hw->chip_rev == CHIP_REV_YU_EC_U_A1) { > + /* enable/disable Store& Forward mode for TX */ > + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), > + sky2->speed != SPEED_1000 > + ? TX_STFW_ENA : TX_STFW_DIS); > + } > + > gma_write16(hw, port, GM_GP_CTRL, ctl); > > netif_wake_queue(dev); > --- a/drivers/net/sky2.h 2010-01-06 12:48:48.632247424 -0800 > +++ b/drivers/net/sky2.h 2010-01-06 12:59:57.322078964 -0800 > @@ -1901,8 +1901,8 @@ enum { > TX_VLAN_TAG_ON = 1<<25,/* enable VLAN tagging */ > TX_VLAN_TAG_OFF = 1<<24,/* disable VLAN tagging */ > > - TX_JUMBO_ENA = 1<<23,/* PCI Jumbo Mode enable (Yukon-EC Ultra) */ > - TX_JUMBO_DIS = 1<<22,/* PCI Jumbo Mode enable (Yukon-EC Ultra) */ > + TX_PCI_JUM_ENA = 1<<23,/* Enable PCI Jumbo Mode (Yukon-EC Ultra) */ > + TX_PCI_JUM_DIS = 1<<22,/* Disable PCI Jumbo Mode (Yukon-EC Ultra) */ > > GMF_WSP_TST_ON = 1<<18,/* Write Shadow Pointer Test On */ > GMF_WSP_TST_OFF = 1<<17,/* Write Shadow Pointer Test Off */ > Ok ... results - and maybe some more clues... Running with this patch; Jarek's "alternative 1", and the patch from the other thread. Not so good. No reported errors (sky2, etc.) - however with mtu=9000, lots of stuff broke: XDMCP; http via MASQ/netfilter, ssh connections intermittently (when large frames involved perhaps), etc. Tried to change mtu to 1500 on the fly, got a bunch of errors - and network watchdog kicked in. Have now rebooted with the same patches and mtu=1500. ... with mtu=1500, Everything is again working (i.e., XDMCP, netfilter, etc.) Load test with mtu=1500 went well for a while - high throughput sustained for a few minutes - then similar crash as before... but no interrup error messages this time until after the oops: <nothing of note before this> Jan 6 18:17:54 mail kernel: DRHD: handling fault status reg 2 Jan 6 18:17:54 mail kernel: DMAR:[DMA Read] Request device [06:00.0] fault addr 1bbfe000 Jan 6 18:17:54 mail kernel: DMAR:[fault reason 06] PTE Read access is not set Jan 6 18:17:54 mail kernel: sky2 0000:06:00.0: error interrupt status=0x80000000 Jan 6 18:17:54 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) Jan 6 18:18:04 mail kernel: ------------[ cut here ]------------ Jan 6 18:18:04 mail kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164() Jan 6 18:18:04 mail kernel: Hardware name: System Product Name Jan 6 18:18:04 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out Jan 6 18:18:04 mail kernel: Modules linked in: ip6table_filter ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat nf_nat iptable_mangle iptable_raw bridge stp appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec snd_hda_intel snd_hda_codec ac97_bus snd_hwdep snd_seq snd_seq_device gspca_spca505 gspca_main videodev v4l1_compat snd_pcm v4l2_compat_ioctl32 pcspkr asus_atk0110 hwmon i2c_i801 iTCO_wdt firewire_ohci iTCO_vendor_support firewire_core crc_itu_t snd_timer snd sky2 soundcore wmi snd_page_alloc fbcon tileblit font bitblit softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfil Jan 6 18:18:04 mail kernel: lrect [last unloaded: microcode] Jan 6 18:18:04 mail kernel: Pid: 0, comm: swapper Tainted: G W 2.6.32-00840-gec8257c-dirty #41 Jan 6 18:18:04 mail kernel: Call Trace: Jan 6 18:18:04 mail kernel: <IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94 Jan 6 18:18:04 mail kernel: [<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43 Jan 6 18:18:04 mail kernel: [<ffffffff813e12bf>] ? netif_tx_lock+0x44/0x6c Jan 6 18:18:04 mail kernel: [<ffffffff813e1427>] dev_watchdog+0xf3/0x164 Jan 6 18:18:04 mail kernel: [<ffffffff81077696>] ? sched_clock_cpu+0x47/0xd1 Jan 6 18:18:04 mail kernel: [<ffffffff8106316b>] run_timer_softirq+0x1c8/0x270 Jan 6 18:18:04 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd Jan 6 18:18:04 mail kernel: [<ffffffff8107ef33>] ? tick_program_event+0x2a/0x2c Jan 6 18:18:04 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 Jan 6 18:18:04 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 Jan 6 18:18:04 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c Jan 6 18:18:04 mail kernel: [<ffffffff8146dd32>] smp_apic_timer_interrupt+0x86/0x94 Jan 6 18:18:04 mail kernel: [<ffffffff810127e3>] apic_timer_interrupt+0x13/0x20 Jan 6 18:18:04 mail kernel: <EOI> [<ffffffff812c4a06>] ? acpi_idle_enter_c1+0xb2/0xd0 Jan 6 18:18:04 mail kernel: [<ffffffff812c49ff>] ? acpi_idle_enter_c1+0xab/0xd0 Jan 6 18:18:04 mail kernel: [<ffffffff813a43b8>] ? cpuidle_idle_call+0x9e/0xfa Jan 6 18:18:04 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 Jan 6 18:18:04 mail kernel: [<ffffffff81463312>] ? start_secondary+0x201/0x242 Jan 6 18:18:04 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- Jan 6 18:18:04 mail kernel: sky2 eth0: tx timeout Jan 6 18:18:04 mail kernel: sky2 eth0: transmit ring 21 .. 108 report=21 done=21 Jan 6 18:18:04 mail kernel: sky2 eth0: disabling interface Jan 6 18:18:04 mail kernel: sky2 eth0: enabling interface <eth0 dead after this> ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-06 23:26 ` Michael Breuer @ 2010-01-07 2:42 ` Michael Breuer 2010-01-07 4:00 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 2:42 UTC (permalink / raw) To: Stephen Hemminger Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On 1/6/2010 6:26 PM, Michael Breuer wrote: > On 1/6/2010 4:10 PM, Stephen Hemminger wrote: >> On Wed, 06 Jan 2010 14:49:38 -0500 >> Michael Breuer<mbreuer@majjas.com> wrote: >> >>> This patch at first behaved similarly to the previous one - seemed >>> to be >>> running a bit better... until the adapter went down :( >>> >>> This is the syslog output at the time the network failed: >>> Jan 6 14:11:01 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x40000008 >>> Jan 6 14:11:01 mail kernel: sky2 software interrupt status 0x40000008 >> Could you go back to baseline sky2 driver. The display code might be >> buggy. >> These bits indicate an error in the MAC. The interrupt source enabled >> is Transmit FIFO underrun. >> >> Looking at how vendor driver handles this. >> It looks like the Yukon EC_U chip doesn't really do Jumbo frames >> correctly. >> Maybe not enough internal buffering to ensure that the whole packet >> is in the chip. Of course, none of this is in the chip manual. >> >> Does this help >> -------------- >> --- a/drivers/net/sky2.c 2010-01-06 12:48:43.012318966 -0800 >> +++ b/drivers/net/sky2.c 2010-01-06 13:05:31.273987255 -0800 >> @@ -792,33 +792,21 @@ static void sky2_set_tx_stfwd(struct sky >> { >> struct net_device *dev = hw->dev[port]; >> >> - if ( (hw->chip_id == CHIP_ID_YUKON_EX&& >> - hw->chip_rev != CHIP_REV_YU_EX_A0) || >> - hw->chip_id>= CHIP_ID_YUKON_FE_P) { >> - /* Yukon-Extreme B0 and further Extreme devices */ >> - /* enable Store& Forward mode for TX */ >> - >> - if (dev->mtu<= ETH_DATA_LEN) >> - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), >> - TX_JUMBO_DIS | TX_STFW_ENA); >> - >> - else >> - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), >> - TX_JUMBO_ENA| TX_STFW_ENA); >> - } else { >> - if (dev->mtu<= ETH_DATA_LEN) >> - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); >> - else { >> - /* set Tx GMAC FIFO Almost Empty Threshold */ >> - sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), >> - (ECU_JUMBO_WM<< 16) | ECU_AE_THR); >> - >> - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_DIS); >> - >> - /* Can't do offload because of lack of store/forward */ >> - dev->features&= ~(NETIF_F_TSO | NETIF_F_SG | >> NETIF_F_ALL_CSUM); >> - } >> - } >> + if ( (hw->chip_id == CHIP_ID_YUKON_EX&& hw->chip_rev != >> CHIP_REV_YU_EX_A0) || >> + hw->chip_id>= CHIP_ID_YUKON_FE_P) { >> + /* Yukon-Extreme B0 and further Extreme devices */ >> + /* enable Store& Forward mode for TX */ >> + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); >> + } else if (dev->mtu> ETH_DATA_LEN) { >> + /* set Tx GMAC FIFO Almost Empty Threshold */ >> + sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), >> + (ECU_JUMBO_WM<< 16) | ECU_AE_THR); >> + /* disable Store& Forward mode for TX */ >> + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_DIS); >> + } else { >> + /* enable Store& Forward mode for TX */ >> + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); >> + } >> } >> >> static void sky2_mac_init(struct sky2_hw *hw, unsigned port) >> @@ -2185,11 +2173,16 @@ static int sky2_change_mtu(struct net_de >> if (new_mtu< ETH_ZLEN || new_mtu> ETH_JUMBO_MTU) >> return -EINVAL; >> >> + /* MTU> 1500 on yukon FE and FE+ not allowed */ >> if (new_mtu> ETH_DATA_LEN&& >> (hw->chip_id == CHIP_ID_YUKON_FE || >> hw->chip_id == CHIP_ID_YUKON_FE_P)) >> return -EINVAL; >> >> + /* TSO on Yukon Ultra and MTU> 1500 not supported */ >> + if (new_mtu> ETH_DATA_LEN&& hw->chip_id == CHIP_ID_YUKON_EC_U) >> + dev->features&= ~NETIF_F_TSO; >> + >> if (!netif_running(dev)) { >> dev->mtu = new_mtu; >> return 0; >> @@ -2233,6 +2226,15 @@ static int sky2_change_mtu(struct net_de >> if (err) >> dev_close(dev); >> else { >> + /* WA for dev. #4.209 */ >> + if (hw->chip_id == CHIP_ID_YUKON_EC_U&& >> + hw->chip_rev == CHIP_REV_YU_EC_U_A1) { >> + /* enable/disable Store& Forward mode for TX */ >> + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), >> + sky2->speed != SPEED_1000 >> + ? TX_STFW_ENA : TX_STFW_DIS); >> + } >> + >> gma_write16(hw, port, GM_GP_CTRL, ctl); >> >> netif_wake_queue(dev); >> --- a/drivers/net/sky2.h 2010-01-06 12:48:48.632247424 -0800 >> +++ b/drivers/net/sky2.h 2010-01-06 12:59:57.322078964 -0800 >> @@ -1901,8 +1901,8 @@ enum { >> TX_VLAN_TAG_ON = 1<<25,/* enable VLAN tagging */ >> TX_VLAN_TAG_OFF = 1<<24,/* disable VLAN tagging */ >> >> - TX_JUMBO_ENA = 1<<23,/* PCI Jumbo Mode enable (Yukon-EC >> Ultra) */ >> - TX_JUMBO_DIS = 1<<22,/* PCI Jumbo Mode enable (Yukon-EC >> Ultra) */ >> + TX_PCI_JUM_ENA = 1<<23,/* Enable PCI Jumbo Mode (Yukon-EC >> Ultra) */ >> + TX_PCI_JUM_DIS = 1<<22,/* Disable PCI Jumbo Mode (Yukon-EC >> Ultra) */ >> >> GMF_WSP_TST_ON = 1<<18,/* Write Shadow Pointer Test On */ >> GMF_WSP_TST_OFF = 1<<17,/* Write Shadow Pointer Test Off */ > Ok ... results - and maybe some more clues... > > Running with this patch; Jarek's "alternative 1", and the patch from > the other thread. Not so good. > > No reported errors (sky2, etc.) - however with mtu=9000, lots of stuff > broke: XDMCP; http via MASQ/netfilter, ssh connections intermittently > (when large frames involved perhaps), etc. Tried to change mtu to 1500 > on the fly, got a bunch of errors - and network watchdog kicked in. > Have now rebooted with the same patches and mtu=1500. > ... with mtu=1500, Everything is again working (i.e., XDMCP, > netfilter, etc.) > Load test with mtu=1500 went well for a while - high throughput > sustained for a few minutes - then similar crash as before... but no > interrup error messages this time until after the oops: > <nothing of note before this> > Jan 6 18:17:54 mail kernel: DRHD: handling fault status reg 2 > Jan 6 18:17:54 mail kernel: DMAR:[DMA Read] Request device [06:00.0] > fault addr 1bbfe000 > Jan 6 18:17:54 mail kernel: DMAR:[fault reason 06] PTE Read access is > not set > Jan 6 18:17:54 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x80000000 > Jan 6 18:17:54 mail kernel: sky2 0000:06:00.0: PCI hardware error > (0x2010) > Jan 6 18:18:04 mail kernel: ------------[ cut here ]------------ > Jan 6 18:18:04 mail kernel: WARNING: at net/sched/sch_generic.c:261 > dev_watchdog+0xf3/0x164() > Jan 6 18:18:04 mail kernel: Hardware name: System Product Name > Jan 6 18:18:04 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit > queue 0 timed out > Jan 6 18:18:04 mail kernel: Modules linked in: ip6table_filter > ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat nf_nat > iptable_mangle iptable_raw bridge stp appletalk psnap llc nfsd lockd > nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq > sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP > xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath > kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi > snd_ac97_codec snd_hda_intel snd_hda_codec ac97_bus snd_hwdep snd_seq > snd_seq_device gspca_spca505 gspca_main videodev v4l1_compat snd_pcm > v4l2_compat_ioctl32 pcspkr asus_atk0110 hwmon i2c_i801 iTCO_wdt > firewire_ohci iTCO_vendor_support firewire_core crc_itu_t snd_timer > snd sky2 soundcore wmi snd_page_alloc fbcon tileblit font bitblit > softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor > async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau > ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core > cfbimgblt cfbfil > Jan 6 18:18:04 mail kernel: lrect [last unloaded: microcode] > Jan 6 18:18:04 mail kernel: Pid: 0, comm: swapper Tainted: G > W 2.6.32-00840-gec8257c-dirty #41 > Jan 6 18:18:04 mail kernel: Call Trace: > Jan 6 18:18:04 mail kernel: <IRQ> [<ffffffff8105365a>] > warn_slowpath_common+0x7c/0x94 > Jan 6 18:18:04 mail kernel: [<ffffffff810536c9>] > warn_slowpath_fmt+0x41/0x43 > Jan 6 18:18:04 mail kernel: [<ffffffff813e12bf>] ? > netif_tx_lock+0x44/0x6c > Jan 6 18:18:04 mail kernel: [<ffffffff813e1427>] dev_watchdog+0xf3/0x164 > Jan 6 18:18:04 mail kernel: [<ffffffff81077696>] ? > sched_clock_cpu+0x47/0xd1 > Jan 6 18:18:04 mail kernel: [<ffffffff8106316b>] > run_timer_softirq+0x1c8/0x270 > Jan 6 18:18:04 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd > Jan 6 18:18:04 mail kernel: [<ffffffff8107ef33>] ? > tick_program_event+0x2a/0x2c > Jan 6 18:18:04 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 > Jan 6 18:18:04 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 > Jan 6 18:18:04 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c > Jan 6 18:18:04 mail kernel: [<ffffffff8146dd32>] > smp_apic_timer_interrupt+0x86/0x94 > Jan 6 18:18:04 mail kernel: [<ffffffff810127e3>] > apic_timer_interrupt+0x13/0x20 > Jan 6 18:18:04 mail kernel: <EOI> [<ffffffff812c4a06>] ? > acpi_idle_enter_c1+0xb2/0xd0 > Jan 6 18:18:04 mail kernel: [<ffffffff812c49ff>] ? > acpi_idle_enter_c1+0xab/0xd0 > Jan 6 18:18:04 mail kernel: [<ffffffff813a43b8>] ? > cpuidle_idle_call+0x9e/0xfa > Jan 6 18:18:04 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 > Jan 6 18:18:04 mail kernel: [<ffffffff81463312>] ? > start_secondary+0x201/0x242 > Jan 6 18:18:04 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- > Jan 6 18:18:04 mail kernel: sky2 eth0: tx timeout > Jan 6 18:18:04 mail kernel: sky2 eth0: transmit ring 21 .. 108 > report=21 done=21 > Jan 6 18:18:04 mail kernel: sky2 eth0: disabling interface > Jan 6 18:18:04 mail kernel: sky2 eth0: enabling interface > <eth0 dead after this> Walked through the code based on Jarek's patches... came upon NET_CLS_ACT. At least in some cases (sch_cbq.c for example), the net transmit error could be returned from here... after releasing the skb. A quick scan of the various files in net/sched suggests that with NET_CLS_ACT the skb may or may not have been freed in the event of an error. If I have time later I'll see if I can bypass NET_CLS_ACT and see whether this is even relevant. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 2:42 ` Michael Breuer @ 2010-01-07 4:00 ` Michael Breuer 2010-01-07 4:53 ` Stephen Hemminger 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 4:00 UTC (permalink / raw) To: Stephen Hemminger Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On 1/6/2010 9:42 PM, Michael Breuer wrote: > On 1/6/2010 6:26 PM, Michael Breuer wrote: >> On 1/6/2010 4:10 PM, Stephen Hemminger wrote: >>> On Wed, 06 Jan 2010 14:49:38 -0500 >>> Michael Breuer<mbreuer@majjas.com> wrote: >>> >>>> This patch at first behaved similarly to the previous one - seemed >>>> to be >>>> running a bit better... until the adapter went down :( >>>> >>>> This is the syslog output at the time the network failed: >>>> Jan 6 14:11:01 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x40000008 >>>> Jan 6 14:11:01 mail kernel: sky2 software interrupt status 0x40000008 >>> Could you go back to baseline sky2 driver. The display code might >>> be buggy. >>> These bits indicate an error in the MAC. The interrupt source enabled >>> is Transmit FIFO underrun. >>> >>> Looking at how vendor driver handles this. >>> It looks like the Yukon EC_U chip doesn't really do Jumbo frames >>> correctly. >>> Maybe not enough internal buffering to ensure that the whole packet >>> is in the chip. Of course, none of this is in the chip manual. >>> >>> Does this help >>> -------------- >>> --- a/drivers/net/sky2.c 2010-01-06 12:48:43.012318966 -0800 >>> +++ b/drivers/net/sky2.c 2010-01-06 13:05:31.273987255 -0800 >>> @@ -792,33 +792,21 @@ static void sky2_set_tx_stfwd(struct sky >>> { >>> struct net_device *dev = hw->dev[port]; >>> >>> - if ( (hw->chip_id == CHIP_ID_YUKON_EX&& >>> - hw->chip_rev != CHIP_REV_YU_EX_A0) || >>> - hw->chip_id>= CHIP_ID_YUKON_FE_P) { >>> - /* Yukon-Extreme B0 and further Extreme devices */ >>> - /* enable Store& Forward mode for TX */ >>> - >>> - if (dev->mtu<= ETH_DATA_LEN) >>> - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), >>> - TX_JUMBO_DIS | TX_STFW_ENA); >>> - >>> - else >>> - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), >>> - TX_JUMBO_ENA| TX_STFW_ENA); >>> - } else { >>> - if (dev->mtu<= ETH_DATA_LEN) >>> - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), >>> TX_STFW_ENA); >>> - else { >>> - /* set Tx GMAC FIFO Almost Empty Threshold */ >>> - sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), >>> - (ECU_JUMBO_WM<< 16) | ECU_AE_THR); >>> - >>> - sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), >>> TX_STFW_DIS); >>> - >>> - /* Can't do offload because of lack of store/forward */ >>> - dev->features&= ~(NETIF_F_TSO | NETIF_F_SG | >>> NETIF_F_ALL_CSUM); >>> - } >>> - } >>> + if ( (hw->chip_id == CHIP_ID_YUKON_EX&& hw->chip_rev != >>> CHIP_REV_YU_EX_A0) || >>> + hw->chip_id>= CHIP_ID_YUKON_FE_P) { >>> + /* Yukon-Extreme B0 and further Extreme devices */ >>> + /* enable Store& Forward mode for TX */ >>> + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); >>> + } else if (dev->mtu> ETH_DATA_LEN) { >>> + /* set Tx GMAC FIFO Almost Empty Threshold */ >>> + sky2_write32(hw, SK_REG(port, TX_GMF_AE_THR), >>> + (ECU_JUMBO_WM<< 16) | ECU_AE_THR); >>> + /* disable Store& Forward mode for TX */ >>> + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_DIS); >>> + } else { >>> + /* enable Store& Forward mode for TX */ >>> + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), TX_STFW_ENA); >>> + } >>> } >>> >>> static void sky2_mac_init(struct sky2_hw *hw, unsigned port) >>> @@ -2185,11 +2173,16 @@ static int sky2_change_mtu(struct net_de >>> if (new_mtu< ETH_ZLEN || new_mtu> ETH_JUMBO_MTU) >>> return -EINVAL; >>> >>> + /* MTU> 1500 on yukon FE and FE+ not allowed */ >>> if (new_mtu> ETH_DATA_LEN&& >>> (hw->chip_id == CHIP_ID_YUKON_FE || >>> hw->chip_id == CHIP_ID_YUKON_FE_P)) >>> return -EINVAL; >>> >>> + /* TSO on Yukon Ultra and MTU> 1500 not supported */ >>> + if (new_mtu> ETH_DATA_LEN&& hw->chip_id == CHIP_ID_YUKON_EC_U) >>> + dev->features&= ~NETIF_F_TSO; >>> + >>> if (!netif_running(dev)) { >>> dev->mtu = new_mtu; >>> return 0; >>> @@ -2233,6 +2226,15 @@ static int sky2_change_mtu(struct net_de >>> if (err) >>> dev_close(dev); >>> else { >>> + /* WA for dev. #4.209 */ >>> + if (hw->chip_id == CHIP_ID_YUKON_EC_U&& >>> + hw->chip_rev == CHIP_REV_YU_EC_U_A1) { >>> + /* enable/disable Store& Forward mode for TX */ >>> + sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T), >>> + sky2->speed != SPEED_1000 >>> + ? TX_STFW_ENA : TX_STFW_DIS); >>> + } >>> + >>> gma_write16(hw, port, GM_GP_CTRL, ctl); >>> >>> netif_wake_queue(dev); >>> --- a/drivers/net/sky2.h 2010-01-06 12:48:48.632247424 -0800 >>> +++ b/drivers/net/sky2.h 2010-01-06 12:59:57.322078964 -0800 >>> @@ -1901,8 +1901,8 @@ enum { >>> TX_VLAN_TAG_ON = 1<<25,/* enable VLAN tagging */ >>> TX_VLAN_TAG_OFF = 1<<24,/* disable VLAN tagging */ >>> >>> - TX_JUMBO_ENA = 1<<23,/* PCI Jumbo Mode enable (Yukon-EC >>> Ultra) */ >>> - TX_JUMBO_DIS = 1<<22,/* PCI Jumbo Mode enable (Yukon-EC >>> Ultra) */ >>> + TX_PCI_JUM_ENA = 1<<23,/* Enable PCI Jumbo Mode (Yukon-EC >>> Ultra) */ >>> + TX_PCI_JUM_DIS = 1<<22,/* Disable PCI Jumbo Mode (Yukon-EC >>> Ultra) */ >>> >>> GMF_WSP_TST_ON = 1<<18,/* Write Shadow Pointer Test On */ >>> GMF_WSP_TST_OFF = 1<<17,/* Write Shadow Pointer Test Off */ >> Ok ... results - and maybe some more clues... >> >> Running with this patch; Jarek's "alternative 1", and the patch from >> the other thread. Not so good. >> >> No reported errors (sky2, etc.) - however with mtu=9000, lots of >> stuff broke: XDMCP; http via MASQ/netfilter, ssh connections >> intermittently (when large frames involved perhaps), etc. Tried to >> change mtu to 1500 on the fly, got a bunch of errors - and network >> watchdog kicked in. Have now rebooted with the same patches and >> mtu=1500. >> ... with mtu=1500, Everything is again working (i.e., XDMCP, >> netfilter, etc.) >> Load test with mtu=1500 went well for a while - high throughput >> sustained for a few minutes - then similar crash as before... but no >> interrup error messages this time until after the oops: >> <nothing of note before this> >> Jan 6 18:17:54 mail kernel: DRHD: handling fault status reg 2 >> Jan 6 18:17:54 mail kernel: DMAR:[DMA Read] Request device [06:00.0] >> fault addr 1bbfe000 >> Jan 6 18:17:54 mail kernel: DMAR:[fault reason 06] PTE Read access >> is not set >> Jan 6 18:17:54 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x80000000 >> Jan 6 18:17:54 mail kernel: sky2 0000:06:00.0: PCI hardware error >> (0x2010) >> Jan 6 18:18:04 mail kernel: ------------[ cut here ]------------ >> Jan 6 18:18:04 mail kernel: WARNING: at net/sched/sch_generic.c:261 >> dev_watchdog+0xf3/0x164() >> Jan 6 18:18:04 mail kernel: Hardware name: System Product Name >> Jan 6 18:18:04 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit >> queue 0 timed out >> Jan 6 18:18:04 mail kernel: Modules linked in: ip6table_filter >> ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat nf_nat >> iptable_mangle iptable_raw bridge stp appletalk psnap llc nfsd lockd >> nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq >> sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP >> xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath >> kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi >> snd_ac97_codec snd_hda_intel snd_hda_codec ac97_bus snd_hwdep snd_seq >> snd_seq_device gspca_spca505 gspca_main videodev v4l1_compat snd_pcm >> v4l2_compat_ioctl32 pcspkr asus_atk0110 hwmon i2c_i801 iTCO_wdt >> firewire_ohci iTCO_vendor_support firewire_core crc_itu_t snd_timer >> snd sky2 soundcore wmi snd_page_alloc fbcon tileblit font bitblit >> softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor >> async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell >> nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea >> i2c_core cfbimgblt cfbfil >> Jan 6 18:18:04 mail kernel: lrect [last unloaded: microcode] >> Jan 6 18:18:04 mail kernel: Pid: 0, comm: swapper Tainted: G >> W 2.6.32-00840-gec8257c-dirty #41 >> Jan 6 18:18:04 mail kernel: Call Trace: >> Jan 6 18:18:04 mail kernel: <IRQ> [<ffffffff8105365a>] >> warn_slowpath_common+0x7c/0x94 >> Jan 6 18:18:04 mail kernel: [<ffffffff810536c9>] >> warn_slowpath_fmt+0x41/0x43 >> Jan 6 18:18:04 mail kernel: [<ffffffff813e12bf>] ? >> netif_tx_lock+0x44/0x6c >> Jan 6 18:18:04 mail kernel: [<ffffffff813e1427>] >> dev_watchdog+0xf3/0x164 >> Jan 6 18:18:04 mail kernel: [<ffffffff81077696>] ? >> sched_clock_cpu+0x47/0xd1 >> Jan 6 18:18:04 mail kernel: [<ffffffff8106316b>] >> run_timer_softirq+0x1c8/0x270 >> Jan 6 18:18:04 mail kernel: [<ffffffff8105ae3b>] >> __do_softirq+0xf8/0x1cd >> Jan 6 18:18:04 mail kernel: [<ffffffff8107ef33>] ? >> tick_program_event+0x2a/0x2c >> Jan 6 18:18:04 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 >> Jan 6 18:18:04 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 >> Jan 6 18:18:04 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c >> Jan 6 18:18:04 mail kernel: [<ffffffff8146dd32>] >> smp_apic_timer_interrupt+0x86/0x94 >> Jan 6 18:18:04 mail kernel: [<ffffffff810127e3>] >> apic_timer_interrupt+0x13/0x20 >> Jan 6 18:18:04 mail kernel: <EOI> [<ffffffff812c4a06>] ? >> acpi_idle_enter_c1+0xb2/0xd0 >> Jan 6 18:18:04 mail kernel: [<ffffffff812c49ff>] ? >> acpi_idle_enter_c1+0xab/0xd0 >> Jan 6 18:18:04 mail kernel: [<ffffffff813a43b8>] ? >> cpuidle_idle_call+0x9e/0xfa >> Jan 6 18:18:04 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 >> Jan 6 18:18:04 mail kernel: [<ffffffff81463312>] ? >> start_secondary+0x201/0x242 >> Jan 6 18:18:04 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- >> Jan 6 18:18:04 mail kernel: sky2 eth0: tx timeout >> Jan 6 18:18:04 mail kernel: sky2 eth0: transmit ring 21 .. 108 >> report=21 done=21 >> Jan 6 18:18:04 mail kernel: sky2 eth0: disabling interface >> Jan 6 18:18:04 mail kernel: sky2 eth0: enabling interface >> <eth0 dead after this> > Walked through the code based on Jarek's patches... came upon > NET_CLS_ACT. At least in some cases (sch_cbq.c for example), the net > transmit error could be returned from here... after releasing the skb. > A quick scan of the various files in net/sched suggests that with > NET_CLS_ACT the skb may or may not have been freed in the event of an > error. If I have time later I'll see if I can bypass NET_CLS_ACT and > see whether this is even relevant. Ok - so rerunning with Jarek's alternative #2 patch (the one that doesn't re-free the skb after the net_dev_enqueue error) and all else as above (your mtu patch on a clean 2.6.32.2 sky2.c, plus the devtpts inode patch) with an MTU of 1500 I can run without errors (including the interrupt errors previously reported). Changing MTU to 9000, everything basically breaks - Can't use X11 (local or remote - get X11 screen after gdm login locally, but then goes back to greeter; remote gets no greeter); ssh sessions hang; etc. This time I was able to reset the MTU back to 1500 without a reboot - but I did have to ifconfig eth0 down and then up. Looking at the sk98lin code, it looks to me like they do a bit more work with existing buffers before completing the MTU switch. Note that even doing this, X11 did not work (it did with the old mtu change code). Tried changing to mtu 4500 - same effect as 9000... but when I switched back to 1500, ksoftirqd started spinning using 100% of one core. Running with these patches and 1500 MTU seems stable, but the network is running rather slow. Latency is OK, but peak transfer rates seem to be running about 20% of what I saw while receiving the interrupt errors with earlier patches. I'll leave this running overnight to confirm that at least the errors and hangs are resolved with this patch set. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 4:00 ` Michael Breuer @ 2010-01-07 4:53 ` Stephen Hemminger 2010-01-07 5:10 ` Michael Breuer 2010-01-07 5:32 ` Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Stephen Hemminger @ 2010-01-07 4:53 UTC (permalink / raw) To: Michael Breuer Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On Wed, 06 Jan 2010 23:00:34 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > Changing MTU to 9000, everything basically breaks - Can't use X11 (local > or remote - get X11 screen after gdm login locally, but then goes back > to greeter; remote gets no greeter); ssh sessions hang; etc. This time I > was able to reset the MTU back to 1500 without a reboot - but I did have > to ifconfig eth0 down and then up. Looking at the sk98lin code, it looks > to me like they do a bit more work with existing buffers before > completing the MTU switch. Note that even doing this, X11 did not work > (it did with the old mtu change code). Tried changing to mtu 4500 - same > effect as 9000... but when I switched back to 1500, ksoftirqd started > spinning using 100% of one core. The problem is that patch was enabling scatter-gather and checksum offload that won't work on EC_U hardware with 9K MTU. At least, it never worked for me when I tested it. So because of that it really doesn't change anything for the better on that chip version. What version chip is on that motherboard? Mine is: Yukon-2 EC Ultra chip revision 3 which corresponds to B0 step. Another possibility is the PHY register which controls number of ticks of buffering. The default is zero, which gives the most buffering (good), but the firmware could be reprogramming it (bad). In general, the driver doesn't fiddle with bits that are already set correctly, because sometimes vendors need to tweak PCI timing in firmware/BIOS. It seems the firmware on this chip is just a bunch of register setups done on power on. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 4:53 ` Stephen Hemminger @ 2010-01-07 5:10 ` Michael Breuer 2010-01-07 5:32 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-07 5:10 UTC (permalink / raw) To: Stephen Hemminger Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On 1/6/2010 11:53 PM, Stephen Hemminger wrote: > On Wed, 06 Jan 2010 23:00:34 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> Changing MTU to 9000, everything basically breaks - Can't use X11 (local >> or remote - get X11 screen after gdm login locally, but then goes back >> to greeter; remote gets no greeter); ssh sessions hang; etc. This time I >> was able to reset the MTU back to 1500 without a reboot - but I did have >> to ifconfig eth0 down and then up. Looking at the sk98lin code, it looks >> to me like they do a bit more work with existing buffers before >> completing the MTU switch. Note that even doing this, X11 did not work >> (it did with the old mtu change code). Tried changing to mtu 4500 - same >> effect as 9000... but when I switched back to 1500, ksoftirqd started >> spinning using 100% of one core. >> > The problem is that patch was enabling scatter-gather and checksum offload > that won't work on EC_U hardware with 9K MTU. At least, it never worked > for me when I tested it. So because of that it really doesn't change anything > for the better on that chip version. > > What version chip is on that motherboard? Mine is: > Yukon-2 EC Ultra chip revision 3 > which corresponds to B0 step. > > Another possibility is the PHY register which controls number of ticks > of buffering. The default is zero, which gives the most buffering (good), > but the firmware could be reprogramming it (bad). In general, the driver > doesn't fiddle with bits that are already set correctly, because sometimes > vendors need to tweak PCI timing in firmware/BIOS. It seems the firmware on this > chip is just a bunch of register setups done on power on. > So at this point, things are working as mentioned - but really slow... at least an order of magnitude slower than with the other set of patches. The other set also generated errors and was not stable :( But, that set worked with mtu=9000, this set doesn't seem to work with anything over 1500. The slowdown may also be (based on earlier testing) attributable to Jarek's alternative #2 patch. As to the chip, I *think* we have the same chip - I'm including lspci -vv - perhaps there is something different. 06:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev 14) Subsystem: Giga-byte Technology Device e000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 256 bytes Interrupt: pin A routed to IRQ 58 Region 0: Memory at fbdfc000 (64-bit, non-prefetchable) [size=16K] Region 2: I/O ports at d800 [size=256] Expansion ROM at fbdc0000 [disabled] [size=128K] Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Product Name: Marvell Yukon 88E8056 Gigabit Ethernet Controller Read-only fields: [PN] Part number: Yukon 88E8056 [EC] Engineering changes: Rev. 1.4 [MN] Manufacture ID: 4d 61 72 76 65 6c 6c [SN] Serial number: AbCdEfG001C3B [CP] Extended capability: 01 10 cc 03 [RV] Reserved: checksum good, 9 byte(s) reserved Read/write fields: [RW] Read-write area: 121 byte(s) free End Capabilities: [5c] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00458 Data: 0000 Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <256ns, L1 unlimited ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [100] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 1f, GenCap- CGenEn- ChkCap- ChkEn- Kernel driver in use: sky2 Kernel modules: sky2 ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 4:53 ` Stephen Hemminger 2010-01-07 5:10 ` Michael Breuer @ 2010-01-07 5:32 ` Michael Breuer 2010-01-07 5:54 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 5:32 UTC (permalink / raw) To: Stephen Hemminger Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On 1/6/2010 11:53 PM, Stephen Hemminger wrote: > On Wed, 06 Jan 2010 23:00:34 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> Changing MTU to 9000, everything basically breaks - Can't use X11 (local >> or remote - get X11 screen after gdm login locally, but then goes back >> to greeter; remote gets no greeter); ssh sessions hang; etc. This time I >> was able to reset the MTU back to 1500 without a reboot - but I did have >> to ifconfig eth0 down and then up. Looking at the sk98lin code, it looks >> to me like they do a bit more work with existing buffers before >> completing the MTU switch. Note that even doing this, X11 did not work >> (it did with the old mtu change code). Tried changing to mtu 4500 - same >> effect as 9000... but when I switched back to 1500, ksoftirqd started >> spinning using 100% of one core. >> > The problem is that patch was enabling scatter-gather and checksum offload > that won't work on EC_U hardware with 9K MTU. At least, it never worked > for me when I tested it. So because of that it really doesn't change anything > for the better on that chip version. > > What version chip is on that motherboard? Mine is: > Yukon-2 EC Ultra chip revision 3 > which corresponds to B0 step. > > Another possibility is the PHY register which controls number of ticks > of buffering. The default is zero, which gives the most buffering (good), > but the firmware could be reprogramming it (bad). In general, the driver > doesn't fiddle with bits that are already set correctly, because sometimes > vendors need to tweak PCI timing in firmware/BIOS. It seems the firmware on this > chip is just a bunch of register setups done on power on. > Also - I'm seeing a huge number of dropped packets (RX) 200-300/second. Probably why this is so slow. Current ifconfig: eth0 Link encap:Ethernet HWaddr 00:26:18:00:1C:3B inet addr:10.0.0.1 Bcast:10.0.0.255 Mask:255.255.255.0 inet6 addr: fe80::226:18ff:fe00:1c3b/64 Scope:Link UP BROADCAST RUNNING ALLMULTI MULTICAST MTU:1500 Metric:1 RX packets:26647536 errors:0 dropped:517884 overruns:0 frame:0 TX packets:12112780 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:38960063319 (36.2 GiB) TX bytes:1889879762 (1.7 GiB) Interrupt:18 ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 5:32 ` Michael Breuer @ 2010-01-07 5:54 ` Michael Breuer 2010-01-07 7:20 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 5:54 UTC (permalink / raw) To: Stephen Hemminger Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 12:32 AM, Michael Breuer wrote: > On 1/6/2010 11:53 PM, Stephen Hemminger wrote: >> On Wed, 06 Jan 2010 23:00:34 -0500 >> Michael Breuer<mbreuer@majjas.com> wrote: >> >>> Changing MTU to 9000, everything basically breaks - Can't use X11 >>> (local >>> or remote - get X11 screen after gdm login locally, but then goes back >>> to greeter; remote gets no greeter); ssh sessions hang; etc. This >>> time I >>> was able to reset the MTU back to 1500 without a reboot - but I did >>> have >>> to ifconfig eth0 down and then up. Looking at the sk98lin code, it >>> looks >>> to me like they do a bit more work with existing buffers before >>> completing the MTU switch. Note that even doing this, X11 did not work >>> (it did with the old mtu change code). Tried changing to mtu 4500 - >>> same >>> effect as 9000... but when I switched back to 1500, ksoftirqd started >>> spinning using 100% of one core. >> The problem is that patch was enabling scatter-gather and checksum >> offload >> that won't work on EC_U hardware with 9K MTU. At least, it never worked >> for me when I tested it. So because of that it really doesn't change >> anything >> for the better on that chip version. >> >> What version chip is on that motherboard? Mine is: >> Yukon-2 EC Ultra chip revision 3 >> which corresponds to B0 step. >> >> Another possibility is the PHY register which controls number of ticks >> of buffering. The default is zero, which gives the most buffering >> (good), >> but the firmware could be reprogramming it (bad). In general, the >> driver >> doesn't fiddle with bits that are already set correctly, because >> sometimes >> vendors need to tweak PCI timing in firmware/BIOS. It seems the >> firmware on this >> chip is just a bunch of register setups done on power on. > Also - I'm seeing a huge number of dropped packets (RX) > 200-300/second. Probably why this is so slow. > > Current ifconfig: > eth0 Link encap:Ethernet HWaddr 00:26:18:00:1C:3B > inet addr:10.0.0.1 Bcast:10.0.0.255 Mask:255.255.255.0 > inet6 addr: fe80::226:18ff:fe00:1c3b/64 Scope:Link > UP BROADCAST RUNNING ALLMULTI MULTICAST MTU:1500 Metric:1 > RX packets:26647536 errors:0 dropped:517884 overruns:0 frame:0 > TX packets:12112780 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:38960063319 (36.2 GiB) TX bytes:1889879762 (1.7 GiB) > Interrupt:18 > > > Never mind... spoke too soon. Crashed again. Just took longer: Jan 7 00:37:39 mail kernel: DRHD: handling fault status reg 2 Jan 7 00:37:39 mail kernel: DMAR:[DMA Read] Request device [06:00.0] fault addr fff1401fe000 Jan 7 00:37:39 mail kernel: DMAR:[fault reason 06] PTE Read access is not set Jan 7 00:37:39 mail kernel: sky2 0000:06:00.0: error interrupt status=0x80000000 Jan 7 00:37:39 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) Jan 7 00:37:40 mail smbd[4729]: [2010/01/07 00:37:40, 0] lib/util_sock.c:539(read_fd_with_timeout) Jan 7 00:37:40 mail smbd[4729]: [2010/01/07 00:37:40, 0] lib/util_sock.c:1491(get_peer_addr_internal) Jan 7 00:37:40 mail smbd[4729]: getpeername failed. Error was Transport endpoint is not connected Jan 7 00:37:40 mail smbd[4729]: read_fd_with_timeout: client 0.0.0.0 read error = Connection timed out. Jan 7 00:37:40 mail dhcpd: DHCPREQUEST for 10.0.0.32 from 00:26:bb:aa:15:10 (mbitouch) via eth0 Jan 7 00:37:40 mail dhcpd: DHCPACK on 10.0.0.32 to 00:26:bb:aa:15:10 (mbitouch) via eth0 Jan 7 00:37:41 mail dhcpd: DHCPREQUEST for 10.0.0.32 from 00:26:bb:aa:15:10 (mbitouch) via eth0 Jan 7 00:37:41 mail dhcpd: DHCPACK on 10.0.0.32 to 00:26:bb:aa:15:10 (mbitouch) via eth0 Jan 7 00:37:44 mail dhcpd: DHCPREQUEST for 10.0.0.32 from 00:26:bb:aa:15:10 (mbitouch) via eth0 Jan 7 00:37:44 mail dhcpd: DHCPACK on 10.0.0.32 to 00:26:bb:aa:15:10 (mbitouch) via eth0 Jan 7 00:37:47 mail kernel: ------------[ cut here ]------------ Jan 7 00:37:47 mail kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164() Jan 7 00:37:47 mail kernel: Hardware name: System Product Name Jan 7 00:37:47 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out Jan 7 00:37:47 mail kernel: Modules linked in: ip6table_filter ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat nf_nat iptable_mangle iptable_raw bridge stp appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_ens1371 gameport snd_rawmidi gspca_spca505 gspca_main pcspkr snd_ac97_codec snd_hwdep i2c_i801 snd_seq firewire_ohci videodev v4l1_compat snd_seq_device v4l2_compat_ioctl32 ac97_bus firewire_core crc_itu_t iTCO_wdt snd_pcm iTCO_vendor_support wmi snd_timer snd sky2 asus_atk0110 hwmon soundcore snd_page_alloc fbcon tileblit font bitblit softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfil Jan 7 00:37:47 mail kernel: lrect [last unloaded: microcode] Jan 7 00:37:47 mail kernel: Pid: 0, comm: swapper Tainted: G W 2.6.32-00840-gec8257c-dirty #44 Jan 7 00:37:47 mail kernel: Call Trace: Jan 7 00:37:47 mail kernel: <IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94 Jan 7 00:37:47 mail kernel: [<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43 Jan 7 00:37:47 mail kernel: [<ffffffff813e12bf>] ? netif_tx_lock+0x44/0x6c Jan 7 00:37:47 mail kernel: [<ffffffff813e1427>] dev_watchdog+0xf3/0x164 Jan 7 00:37:47 mail kernel: [<ffffffff8105f320>] ? cascade+0x6a/0x84 Jan 7 00:37:47 mail kernel: [<ffffffff8106316b>] run_timer_softirq+0x1c8/0x270 Jan 7 00:37:47 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd Jan 7 00:37:47 mail kernel: [<ffffffff8107ef33>] ? tick_program_event+0x2a/0x2c Jan 7 00:37:47 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 Jan 7 00:37:47 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 Jan 7 00:37:47 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c Jan 7 00:37:47 mail kernel: [<ffffffff8146dd32>] smp_apic_timer_interrupt+0x86/0x94 Jan 7 00:37:47 mail kernel: [<ffffffff810127e3>] apic_timer_interrupt+0x13/0x20 Jan 7 00:37:47 mail kernel: <EOI> [<ffffffff812c4c7a>] ? acpi_idle_enter_bm+0x256/0x28a Jan 7 00:37:47 mail kernel: [<ffffffff812c4c73>] ? acpi_idle_enter_bm+0x24f/0x28a Jan 7 00:37:47 mail kernel: [<ffffffff813a43b8>] ? cpuidle_idle_call+0x9e/0xfa Jan 7 00:37:47 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 Jan 7 00:37:47 mail kernel: [<ffffffff81463312>] ? start_secondary+0x201/0x242 Jan 7 00:37:47 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- Jan 7 00:37:47 mail kernel: sky2 eth0: tx timeout Jan 7 00:37:47 mail kernel: sky2 eth0: transmit ring 79 .. 39 report=79 done=79 Jan 7 00:37:47 mail kernel: sky2 eth0: disabling interface Jan 7 00:37:47 mail kernel: sky2 eth0: enabling interface Jan 7 00:37:51 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both JJan 7 00:38:07 mail kernel: sky2 eth0: tx timeout Jan 7 00:38:07 mail kernel: sky2 eth0: transmit ring 3 .. 90 report=3 done=3 Jan 7 00:38:07 mail kernel: sky2 eth0: disabling interface Jan 7 00:38:07 mail kernel: sky2 eth0: enabling interface and so on. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 5:54 ` Michael Breuer @ 2010-01-07 7:20 ` Michael Breuer 2010-01-07 7:47 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 7:20 UTC (permalink / raw) To: Stephen Hemminger Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 12:54 AM, Michael Breuer wrote: > On 1/7/2010 12:32 AM, Michael Breuer wrote: >> On 1/6/2010 11:53 PM, Stephen Hemminger wrote: >>> On Wed, 06 Jan 2010 23:00:34 -0500 >>> Michael Breuer<mbreuer@majjas.com> wrote: >>> >>>> Changing MTU to 9000, everything basically breaks - Can't use X11 >>>> (local >>>> or remote - get X11 screen after gdm login locally, but then goes back >>>> to greeter; remote gets no greeter); ssh sessions hang; etc. This >>>> time I >>>> was able to reset the MTU back to 1500 without a reboot - but I did >>>> have >>>> to ifconfig eth0 down and then up. Looking at the sk98lin code, it >>>> looks >>>> to me like they do a bit more work with existing buffers before >>>> completing the MTU switch. Note that even doing this, X11 did not work >>>> (it did with the old mtu change code). Tried changing to mtu 4500 - >>>> same >>>> effect as 9000... but when I switched back to 1500, ksoftirqd started >>>> spinning using 100% of one core. >>> The problem is that patch was enabling scatter-gather and checksum >>> offload >>> that won't work on EC_U hardware with 9K MTU. At least, it never >>> worked >>> for me when I tested it. So because of that it really doesn't change >>> anything >>> for the better on that chip version. >>> >>> What version chip is on that motherboard? Mine is: >>> Yukon-2 EC Ultra chip revision 3 >>> which corresponds to B0 step. >>> >>> Another possibility is the PHY register which controls number of ticks >>> of buffering. The default is zero, which gives the most buffering >>> (good), >>> but the firmware could be reprogramming it (bad). In general, the >>> driver >>> doesn't fiddle with bits that are already set correctly, because >>> sometimes >>> vendors need to tweak PCI timing in firmware/BIOS. It seems the >>> firmware on this >>> chip is just a bunch of register setups done on power on. >> Also - I'm seeing a huge number of dropped packets (RX) >> 200-300/second. Probably why this is so slow. >> >> Current ifconfig: >> eth0 Link encap:Ethernet HWaddr 00:26:18:00:1C:3B >> inet addr:10.0.0.1 Bcast:10.0.0.255 Mask:255.255.255.0 >> inet6 addr: fe80::226:18ff:fe00:1c3b/64 Scope:Link >> UP BROADCAST RUNNING ALLMULTI MULTICAST MTU:1500 Metric:1 >> RX packets:26647536 errors:0 dropped:517884 overruns:0 frame:0 >> TX packets:12112780 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:38960063319 (36.2 GiB) TX bytes:1889879762 (1.7 GiB) >> Interrupt:18 >> >> >> > > Never mind... spoke too soon. Crashed again. Just took longer: ... Reapplied a couple of earlier patches - still can't do jumbo frames, but the rx errors are gone and speed has improved. Too early to assure that it's stable. Patches that seem to fix the rx drops (all from Stephen): 1) Patch change to tx_init 2) Patch to lock netif_device_detach 3) Patch to sky2_tx_complete to add netif_device_present test Also in the mix: Jarek's alternative 2 With this set and mtu=1500 all seems good - decent if not stellar throughput; no logged errors; no reported packet loss. As before, will leave running and see if anything falls apart. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 7:20 ` Michael Breuer @ 2010-01-07 7:47 ` Jarek Poplawski 2010-01-07 7:55 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-07 7:47 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 02:20:22AM -0500, Michael Breuer wrote: > ... > Reapplied a couple of earlier patches - still can't do jumbo frames, but > the rx errors are gone and speed has improved. Too early to assure that > it's stable. > > Patches that seem to fix the rx drops (all from Stephen): > 1) Patch change to tx_init > 2) Patch to lock netif_device_detach > 3) Patch to sky2_tx_complete to add netif_device_present test > Also in the mix: Jarek's alternative 2 BTW, the main difference between alt. 1 and 2 is error notification: alternative 2 doesn't hide some (most) of drops, so, dependending on app, there might be more and faster retransmits. (I don't know what apps used by you (other than dhcp) can depend so much on this.) > > With this set and mtu=1500 all seems good - decent if not stellar > throughput; no logged errors; no reported packet loss. As before, will > leave running and see if anything falls apart. Good news! Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 7:47 ` Jarek Poplawski @ 2010-01-07 7:55 ` Michael Breuer 2010-01-07 8:21 ` Jarek Poplawski 2010-01-07 15:05 ` Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-07 7:55 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 2:47 AM, Jarek Poplawski wrote: > On Thu, Jan 07, 2010 at 02:20:22AM -0500, Michael Breuer wrote: > >> ... >> Reapplied a couple of earlier patches - still can't do jumbo frames, but >> the rx errors are gone and speed has improved. Too early to assure that >> it's stable. >> >> Patches that seem to fix the rx drops (all from Stephen): >> 1) Patch change to tx_init >> 2) Patch to lock netif_device_detach >> 3) Patch to sky2_tx_complete to add netif_device_present test >> Also in the mix: Jarek's alternative 2 >> > BTW, the main difference between alt. 1 and 2 is error notification: > alternative 2 doesn't hide some (most) of drops, so, dependending on > app, there might be more and faster retransmits. (I don't know what > apps used by you (other than dhcp) can depend so much on this.) > > Unless I misread the code, I think that in some cases e skb is actually freed if the cfq (among others perhaps) scheduler returns an error on enqueue (flow control perhaps). Thus with alternative 1, it is possible that the skb is acted upon after being freed - this would be consistent with the DMAR errors I saw. >> With this set and mtu=1500 all seems good - decent if not stellar >> throughput; no logged errors; no reported packet loss. As before, will >> leave running and see if anything falls apart. >> > Good news! > > Jarek P. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 7:55 ` Michael Breuer @ 2010-01-07 8:21 ` Jarek Poplawski 2010-01-07 15:03 ` Michael Breuer 2010-01-07 15:05 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-07 8:21 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 02:55:20AM -0500, Michael Breuer wrote: > On 1/7/2010 2:47 AM, Jarek Poplawski wrote: >> On Thu, Jan 07, 2010 at 02:20:22AM -0500, Michael Breuer wrote: >> >>> ... >>> Reapplied a couple of earlier patches - still can't do jumbo frames, but >>> the rx errors are gone and speed has improved. Too early to assure that >>> it's stable. >>> >>> Patches that seem to fix the rx drops (all from Stephen): >>> 1) Patch change to tx_init >>> 2) Patch to lock netif_device_detach >>> 3) Patch to sky2_tx_complete to add netif_device_present test >>> Also in the mix: Jarek's alternative 2 >>> >> BTW, the main difference between alt. 1 and 2 is error notification: >> alternative 2 doesn't hide some (most) of drops, so, dependending on >> app, there might be more and faster retransmits. (I don't know what >> apps used by you (other than dhcp) can depend so much on this.) >> >> > Unless I misread the code, I think that in some cases e skb is actually > freed if the cfq (among others perhaps) scheduler returns an error on > enqueue (flow control perhaps). Thus with alternative 1, it is possible > that the skb is acted upon after being freed - this would be consistent > with the DMAR errors I saw. I can't see your point: could you give some scenario? Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 8:21 ` Jarek Poplawski @ 2010-01-07 15:03 ` Michael Breuer 2010-01-07 17:56 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 15:03 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 3:21 AM, Jarek Poplawski wrote: > On Thu, Jan 07, 2010 at 02:55:20AM -0500, Michael Breuer wrote: > >> On 1/7/2010 2:47 AM, Jarek Poplawski wrote: >> >>> On Thu, Jan 07, 2010 at 02:20:22AM -0500, Michael Breuer wrote: >>> >>> >>>> ... >>>> Reapplied a couple of earlier patches - still can't do jumbo frames, but >>>> the rx errors are gone and speed has improved. Too early to assure that >>>> it's stable. >>>> >>>> Patches that seem to fix the rx drops (all from Stephen): >>>> 1) Patch change to tx_init >>>> 2) Patch to lock netif_device_detach >>>> 3) Patch to sky2_tx_complete to add netif_device_present test >>>> Also in the mix: Jarek's alternative 2 >>>> >>>> >>> BTW, the main difference between alt. 1 and 2 is error notification: >>> alternative 2 doesn't hide some (most) of drops, so, dependending on >>> app, there might be more and faster retransmits. (I don't know what >>> apps used by you (other than dhcp) can depend so much on this.) >>> >>> >>> >> Unless I misread the code, I think that in some cases e skb is actually >> freed if the cfq (among others perhaps) scheduler returns an error on >> enqueue (flow control perhaps). Thus with alternative 1, it is possible >> that the skb is acted upon after being freed - this would be consistent >> with the DMAR errors I saw. >> > I can't see your point: could you give some scenario? > > Jarek P. > With NET_CLS_ACT set, net_dev_enqueue can return an error after freeing the skb. Alternative 1 disregards the error and assumes the skb is still valid. The original code and alternative 2 exit the loop assuming the skb has been freed. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 15:03 ` Michael Breuer @ 2010-01-07 17:56 ` Jarek Poplawski 2010-01-07 18:17 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-07 17:56 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 10:03:51AM -0500, Michael Breuer wrote: > On 1/7/2010 3:21 AM, Jarek Poplawski wrote: > >On Thu, Jan 07, 2010 at 02:55:20AM -0500, Michael Breuer wrote: > >>Unless I misread the code, I think that in some cases e skb is actually > >>freed if the cfq (among others perhaps) scheduler returns an error on > >>enqueue (flow control perhaps). Thus with alternative 1, it is possible > >>that the skb is acted upon after being freed - this would be consistent > >>with the DMAR errors I saw. > >I can't see your point: could you give some scenario? > > > >Jarek P. > With NET_CLS_ACT set, net_dev_enqueue can return an error after > freeing the skb. Alternative 1 disregards the error and assumes the > skb is still valid. The original code and alternative 2 exit the > loop assuming the skb has been freed. Not exactly: alternative 1 disregards the error, and tries to send next skbs if the message was longer. After consuming all the message it returns without err code (at least wrt. dev_queue_xmit). This is quite often practice to skip dev_queue_xmit() return (try to grep in net\). It should never touch any part of an earlier sent skb. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 17:56 ` Jarek Poplawski @ 2010-01-07 18:17 ` Jarek Poplawski 0 siblings, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-07 18:17 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 06:56:10PM +0100, Jarek Poplawski wrote: > [...] (try to grep in net\). [...] Did I say "net\"?! Extremely sorry! ;-) Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 7:55 ` Michael Breuer 2010-01-07 8:21 ` Jarek Poplawski @ 2010-01-07 15:05 ` Michael Breuer 2010-01-07 18:01 ` Jarek Poplawski 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 15:05 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 2:55 AM, Michael Breuer wrote: > On 1/7/2010 2:47 AM, Jarek Poplawski wrote: >> On Thu, Jan 07, 2010 at 02:20:22AM -0500, Michael Breuer wrote: >>> ... >>> Reapplied a couple of earlier patches - still can't do jumbo frames, >>> but >>> the rx errors are gone and speed has improved. Too early to assure that >>> it's stable. >>> >>> Patches that seem to fix the rx drops (all from Stephen): >>> 1) Patch change to tx_init >>> 2) Patch to lock netif_device_detach >>> 3) Patch to sky2_tx_complete to add netif_device_present test >>> Also in the mix: Jarek's alternative 2 >> BTW, the main difference between alt. 1 and 2 is error notification: >> alternative 2 doesn't hide some (most) of drops, so, dependending on >> app, there might be more and faster retransmits. (I don't know what >> apps used by you (other than dhcp) can depend so much on this.) >> > Unless I misread the code, I think that in some cases e skb is > actually freed if the cfq (among others perhaps) scheduler returns an > error on enqueue (flow control perhaps). Thus with alternative 1, it > is possible that the skb is acted upon after being freed - this would > be consistent with the DMAR errors I saw. > >>> With this set and mtu=1500 all seems good - decent if not stellar >>> throughput; no logged errors; no reported packet loss. As before, will >>> leave running and see if anything falls apart. >> Good news! >> >> Jarek P. >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > Bad news - crashed about an hour after I wrote this email - under load - same crash as before. Network watchdog... lots of attempts to reset the adapter... then hw watchdog rebooted the system. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 15:05 ` Michael Breuer @ 2010-01-07 18:01 ` Jarek Poplawski 2010-01-07 18:19 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-07 18:01 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 10:05:37AM -0500, Michael Breuer wrote: > Bad news - crashed about an hour after I wrote this email - under > load - same crash as before. Network watchdog... lots of attempts to > reset the adapter... then hw watchdog rebooted the system. It's a pity. Anyway, I'd be still interested in CONFIG_PACKET_MMAP off if you find time. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 18:01 ` Jarek Poplawski @ 2010-01-07 18:19 ` Michael Breuer 2010-01-07 18:35 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 18:19 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 1:01 PM, Jarek Poplawski wrote: > On Thu, Jan 07, 2010 at 10:05:37AM -0500, Michael Breuer wrote: > >> Bad news - crashed about an hour after I wrote this email - under >> load - same crash as before. Network watchdog... lots of attempts to >> reset the adapter... then hw watchdog rebooted the system. >> > It's a pity. Anyway, I'd be still interested in CONFIG_PACKET_MMAP off > if you find time. > > Jarek P. > Ok - any particular patch set to try with? I'm going to start with a clean tree using the latest 2.6.32 from git (tried 2.6.33-rc3, but can't get a usable console... will look at that later.) ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 18:19 ` Michael Breuer @ 2010-01-07 18:35 ` Jarek Poplawski 2010-01-07 18:40 ` Michael Breuer 2010-01-07 18:43 ` Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-07 18:35 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 01:19:41PM -0500, Michael Breuer wrote: > On 1/7/2010 1:01 PM, Jarek Poplawski wrote: > >On Thu, Jan 07, 2010 at 10:05:37AM -0500, Michael Breuer wrote: > >>Bad news - crashed about an hour after I wrote this email - under > >>load - same crash as before. Network watchdog... lots of attempts to > >>reset the adapter... then hw watchdog rebooted the system. > >It's a pity. Anyway, I'd be still interested in CONFIG_PACKET_MMAP off > >if you find time. > > > >Jarek P. > Ok - any particular patch set to try with? I'm going to start with a > clean tree using the latest 2.6.32 from git (tried 2.6.33-rc3, but > can't get a usable console... will look at that later.) My "Berck E. Nash" and Stephen's "pskb_may_pull" sky2 patches. (BTW, could you remind if it worked any better with 2.6.31 or earlier?) Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 18:35 ` Jarek Poplawski @ 2010-01-07 18:40 ` Michael Breuer 2010-01-07 18:43 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-07 18:40 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 1:35 PM, Jarek Poplawski wrote: > On Thu, Jan 07, 2010 at 01:19:41PM -0500, Michael Breuer wrote: > >> On 1/7/2010 1:01 PM, Jarek Poplawski wrote: >> >>> On Thu, Jan 07, 2010 at 10:05:37AM -0500, Michael Breuer wrote: >>> >>>> Bad news - crashed about an hour after I wrote this email - under >>>> load - same crash as before. Network watchdog... lots of attempts to >>>> reset the adapter... then hw watchdog rebooted the system. >>>> >>> It's a pity. Anyway, I'd be still interested in CONFIG_PACKET_MMAP off >>> if you find time. >>> >>> Jarek P. >>> >> Ok - any particular patch set to try with? I'm going to start with a >> clean tree using the latest 2.6.32 from git (tried 2.6.33-rc3, but >> can't get a usable console... will look at that later.) >> > My "Berck E. Nash" and Stephen's "pskb_may_pull" sky2 patches. (BTW, > could you remind if it worked any better with 2.6.31 or earlier?) > > Jarek P. > Never tried earlier than 2.6.31 - new machine. I had a bunch of issues on 2.6.31, including sky2 DMAR errors... so probably not a 2.6.32 regression. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 18:35 ` Jarek Poplawski 2010-01-07 18:40 ` Michael Breuer @ 2010-01-07 18:43 ` Michael Breuer 2010-01-07 18:50 ` Jarek Poplawski 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 18:43 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 1:35 PM, Jarek Poplawski wrote: > On Thu, Jan 07, 2010 at 01:19:41PM -0500, Michael Breuer wrote: > >> On 1/7/2010 1:01 PM, Jarek Poplawski wrote: >> >>> On Thu, Jan 07, 2010 at 10:05:37AM -0500, Michael Breuer wrote: >>> >>>> Bad news - crashed about an hour after I wrote this email - under >>>> load - same crash as before. Network watchdog... lots of attempts to >>>> reset the adapter... then hw watchdog rebooted the system. >>>> >>> It's a pity. Anyway, I'd be still interested in CONFIG_PACKET_MMAP off >>> if you find time. >>> >>> Jarek P. >>> >> Ok - any particular patch set to try with? I'm going to start with a >> clean tree using the latest 2.6.32 from git (tried 2.6.33-rc3, but >> can't get a usable console... will look at that later.) >> > My "Berck E. Nash" and Stephen's "pskb_may_pull" sky2 patches. (BTW, > could you remind if it worked any better with 2.6.31 or earlier?) > > Jarek P. > I'm not sure my crash-and-burn runs yesterday included the pskb_may_pull patch :( Going to rerun with these patches and with and without MMAP. Will also retry both with jumbo frames if possible. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 18:43 ` Michael Breuer @ 2010-01-07 18:50 ` Jarek Poplawski 2010-01-07 19:36 ` Jarek Poplawski 2010-01-07 23:11 ` Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-07 18:50 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 01:43:08PM -0500, Michael Breuer wrote: > On 1/7/2010 1:35 PM, Jarek Poplawski wrote: > >On Thu, Jan 07, 2010 at 01:19:41PM -0500, Michael Breuer wrote: > >>On 1/7/2010 1:01 PM, Jarek Poplawski wrote: > >>>On Thu, Jan 07, 2010 at 10:05:37AM -0500, Michael Breuer wrote: > >>>>Bad news - crashed about an hour after I wrote this email - under > >>>>load - same crash as before. Network watchdog... lots of attempts to > >>>>reset the adapter... then hw watchdog rebooted the system. > >>>It's a pity. Anyway, I'd be still interested in CONFIG_PACKET_MMAP off > >>>if you find time. > >>> > >>>Jarek P. > >>Ok - any particular patch set to try with? I'm going to start with a > >>clean tree using the latest 2.6.32 from git (tried 2.6.33-rc3, but > >>can't get a usable console... will look at that later.) > >My "Berck E. Nash" and Stephen's "pskb_may_pull" sky2 patches. (BTW, > >could you remind if it worked any better with 2.6.31 or earlier?) > > > >Jarek P. > I'm not sure my crash-and-burn runs yesterday included the > pskb_may_pull patch :( > > Going to rerun with these patches and with and without MMAP. Will > also retry both with jumbo frames if possible. If MMAP then some "alternative" too. But first no MMAP. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 18:50 ` Jarek Poplawski @ 2010-01-07 19:36 ` Jarek Poplawski 2010-01-07 19:55 ` Michael Breuer 2010-01-07 23:11 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-07 19:36 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 07:50:40PM +0100, Jarek Poplawski wrote: > > Going to rerun with these patches and with and without MMAP. Will > > also retry both with jumbo frames if possible. > > If MMAP then some "alternative" too. But first no MMAP. Another things IMHO worth to try: a sky2 module parameter "disable_msi=1", and CONFIG_DMAR off. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 19:36 ` Jarek Poplawski @ 2010-01-07 19:55 ` Michael Breuer 2010-01-07 20:22 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 19:55 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 2:36 PM, Jarek Poplawski wrote: > On Thu, Jan 07, 2010 at 07:50:40PM +0100, Jarek Poplawski wrote: > >>> Going to rerun with these patches and with and without MMAP. Will >>> also retry both with jumbo frames if possible. >>> >> If MMAP then some "alternative" too. But first no MMAP. >> > Another things IMHO worth to try: a sky2 module parameter > "disable_msi=1", and CONFIG_DMAR off. > > Jarek P. > Ok - that'd be with or without MMAP enabled? (note that so-far, without MMAP I'm not seeing any errors - throughput is running about half what I was seeing with MMAP enabled (before crashing that is). CPU is also way busier (also to be expected). One other observation - I had been seeing lots of DNS errors - IPV6 related format errors I really didn't think much of it as they were mostly .ru and seemed spam-related, but now I don't see any. Haven't updated bind; doubt the world has changed - so perhaps this is related to the network issue. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 19:55 ` Michael Breuer @ 2010-01-07 20:22 ` Jarek Poplawski 0 siblings, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-07 20:22 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 02:55:22PM -0500, Michael Breuer wrote: > On 1/7/2010 2:36 PM, Jarek Poplawski wrote: > >On Thu, Jan 07, 2010 at 07:50:40PM +0100, Jarek Poplawski wrote: > >>>Going to rerun with these patches and with and without MMAP. Will > >>>also retry both with jumbo frames if possible. > >>If MMAP then some "alternative" too. But first no MMAP. > >Another things IMHO worth to try: a sky2 module parameter > >"disable_msi=1", and CONFIG_DMAR off. > > > >Jarek P. > Ok - that'd be with or without MMAP enabled? (note that so-far, > without MMAP I'm not seeing any errors - throughput is running about > half what I was seeing with MMAP enabled (before crashing that is). > CPU is also way busier (also to be expected). One other observation > - I had been seeing lots of DNS errors - IPV6 related format errors > I really didn't think much of it as they were mostly .ru and seemed > spam-related, but now I don't see any. Haven't updated bind; doubt > the world has changed - so perhaps this is related to the network > issue. MMAP enabled (with some "alternative" patch - to avoid known bugs) should give as earlier the answer if these changes matter. But first let's try longer (if possible) if this "no MMAP" could really heal your hardware. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 18:50 ` Jarek Poplawski 2010-01-07 19:36 ` Jarek Poplawski @ 2010-01-07 23:11 ` Michael Breuer 2010-01-08 7:45 ` Jarek Poplawski 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-07 23:11 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/7/2010 1:50 PM, Jarek Poplawski wrote: > On Thu, Jan 07, 2010 at 01:43:08PM -0500, Michael Breuer wrote: > >> On 1/7/2010 1:35 PM, Jarek Poplawski wrote: >> >>> On Thu, Jan 07, 2010 at 01:19:41PM -0500, Michael Breuer wrote: >>> >>>> On 1/7/2010 1:01 PM, Jarek Poplawski wrote: >>>> >>>>> On Thu, Jan 07, 2010 at 10:05:37AM -0500, Michael Breuer wrote: >>>>> >>>>>> Bad news - crashed about an hour after I wrote this email - under >>>>>> load - same crash as before. Network watchdog... lots of attempts to >>>>>> reset the adapter... then hw watchdog rebooted the system. >>>>>> >>>>> It's a pity. Anyway, I'd be still interested in CONFIG_PACKET_MMAP off >>>>> if you find time. >>>>> >>>>> Jarek P. >>>>> >>>> Ok - any particular patch set to try with? I'm going to start with a >>>> clean tree using the latest 2.6.32 from git (tried 2.6.33-rc3, but >>>> can't get a usable console... will look at that later.) >>>> >>> My "Berck E. Nash" and Stephen's "pskb_may_pull" sky2 patches. (BTW, >>> could you remind if it worked any better with 2.6.31 or earlier?) >>> >>> Jarek P. >>> >> I'm not sure my crash-and-burn runs yesterday included the >> pskb_may_pull patch :( >> >> Going to rerun with these patches and with and without MMAP. Will >> also retry both with jumbo frames if possible. >> > If MMAP then some "alternative" too. But first no MMAP. > > Jarek P. > Results: * no MMAP, mtu=1500, neither alternative patch loaded: adapter crashed: Jan 7 15:44:23 mail kernel: DRHD: handling fault status reg 2 Jan 7 15:44:23 mail kernel: DMAR:[DMA Read] Request device [06:00.0] fault addr fffb7bffe000 Jan 7 15:44:23 mail kernel: DMAR:[fault reason 06] PTE Read access is not set Jan 7 15:44:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x80000000 Jan 7 15:44:23 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) Jan 7 15:44:24 mail smbd[6572]: [2010/01/07 15:44:24, 0] lib/util_sock.c:539(read_fd_with_timeout) Jan 7 15:44:24 mail smbd[6572]: [2010/01/07 15:44:24, 0] lib/util_sock.c:1491(get_peer_addr_internal) Jan 7 15:44:24 mail smbd[6572]: getpeername failed. Error was Transport endpoint is not connected Jan 7 15:44:24 mail smbd[6572]: read_fd_with_timeout: client 0.0.0.0 read error = Connection timed out. Jan 7 15:44:44 mail kernel: ------------[ cut here ]------------ Jan 7 15:44:44 mail kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164() Jan 7 15:44:44 mail kernel: Hardware name: System Product Name Jan 7 15:44:44 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out Jan 7 15:44:44 mail kernel: Modules linked in: ip6table_filter ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat nf_nat iptable_mangle iptable_raw bridge stp appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec snd_hda_intel snd_hda_codec ac97_bus snd_hwdep snd_seq snd_seq_device snd_pcm gspca_spca505 gspca_main firewire_ohci videodev v4l1_compat firewire_core pcspkr v4l2_compat_ioctl32 snd_timer iTCO_wdt i2c_i801 crc_itu_t iTCO_vendor_support snd soundcore snd_page_alloc sky2 wmi asus_atk0110 hwmon fbcon tileblit font bitblit softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfil Jan 7 15:44:44 mail kernel: lrect [last unloaded: microcode] Jan 7 15:44:44 mail kernel: Pid: 0, comm: swapper Tainted: G W 2.6.32NOMMAP-00847-g50ebb93-dirty #4 Jan 7 15:44:44 mail kernel: Call Trace: Jan 7 15:44:44 mail kernel: <IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94 Jan 7 15:44:44 mail kernel: [<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43 Jan 7 15:44:44 mail kernel: [<ffffffff813e2dcf>] ? netif_tx_lock+0x44/0x6c Jan 7 15:44:44 mail kernel: [<ffffffff813e2f37>] dev_watchdog+0xf3/0x164 Jan 7 15:44:44 mail kernel: [<ffffffff8106e8a4>] ? __queue_work+0x3a/0x42 Jan 7 15:44:44 mail kernel: [<ffffffff8106316b>] run_timer_softirq+0x1c8/0x270 Jan 7 15:44:44 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd Jan 7 15:44:44 mail kernel: [<ffffffff8107ef33>] ? tick_program_event+0x2a/0x2c Jan 7 15:44:44 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 Jan 7 15:44:44 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 Jan 7 15:44:44 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c Jan 7 15:44:44 mail kernel: [<ffffffff8146e3e2>] smp_apic_timer_interrupt+0x86/0x94 Jan 7 15:44:44 mail kernel: [<ffffffff810127e3>] apic_timer_interrupt+0x13/0x20 Jan 7 15:44:44 mail kernel: <EOI> [<ffffffff812c678a>] ? acpi_idle_enter_bm+0x256/0x28a Jan 7 15:44:44 mail kernel: [<ffffffff812c6783>] ? acpi_idle_enter_bm+0x24f/0x28a Jan 7 15:44:44 mail kernel: [<ffffffff813a5ec8>] ? cpuidle_idle_call+0x9e/0xfa Jan 7 15:44:44 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 Jan 7 15:44:44 mail kernel: [<ffffffff814639c2>] ? start_secondary+0x201/0x242 Jan 7 15:44:44 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- Jan 7 15:44:44 mail kernel: sky2 eth0: tx timeout Jan 7 15:44:44 mail kernel: sky2 eth0: transmit ring 77 .. 36 report=77 done=77 Jan 7 15:44:44 mail kernel: sky2 eth0: disabling interface Jan 7 15:44:44 mail kernel: sky2 eth0: enabling interface --- adapter dead after this --- rebooted. * no MMAP; alternative 1 patch, mtu=1500; no errors; sustained transfer rates about 25% lower than what I saw with mmap enabled...(before MMAP enabled crashed). * no MMAP mtu=9000; ran ok at low transfer rates - when high rates kicked in, got the sky2 interrupt error & things went south: Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 After this, remote connections broke and I rebooted... decided to rerun w/o MMAP again before going back to MMAP and trying those other sky2 options... * Retest of no MMAP + Alternative 1 - just to confirm consistency. Worked - no errors. Only version so far that allows the win7 backup to complete. * MMAP + NO DMAR + disable_msi=1... also works w/o errors... leaving this one running for a while - also completed a backup successfully. Fastest of the lot... about 3x faster than any other version, working or not. I'm leaving this one running for now. Not retesting jumbo for now. Be happy to help dig further. Tentative recommendations: 1) The af alternative patch seems rather necessary. First alternative seems to be working, I'd suggest that be submitted and backported to 2.6.32. 2) Steven's pskb_may_pull patch also ought to be included and backported. 3) Jumbo frame support for yukon2 should probably be disabled until/if fixed. 4) When possible I'll test dmar and disable_msi, and no dmar and no disable_msi. When I first hit issues, I was running without DMAR, but also without the above patches. I suppose the non-working permutations need to be either fixed or invalidated (or well documented). 5) It would be nice if someone with comparable hardware could reproduce these issues. FWIW, I can only recreate the crash running windows backup to a cifs share. Copying large files doesn't seem to do it. Could also be some other interaction going on here that perhaps others aren't running - would be happy to compare notes. Notes: This *could* be coincidental, but maybe not... With MMAP+NO DMAR + disable_msi there are far fewer ... actually almost no... bind error reports... and no bind format error messages. With NOMMAP and alternative one there are a few more bind error messages and one format error message during the several hours that version was up. All other configurations going back perhaps for two weeks have significantly more bind error reports - and all versions show increasing frequency of bind format errors (IPV6 only) in the roughly 10-15 minutes preceding the lockup/crash/interrupt error messages. There are none immediately preceding any crash, but perhaps there is some correlation between the network errors and bind ipv6 packets. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-07 23:11 ` Michael Breuer @ 2010-01-08 7:45 ` Jarek Poplawski 2010-01-08 16:40 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-08 7:45 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Thu, Jan 07, 2010 at 06:11:34PM -0500, Michael Breuer wrote: > Results: > * no MMAP, mtu=1500, neither alternative patch loaded: adapter crashed: > Jan 7 15:44:23 mail kernel: DRHD: handling fault status reg 2 > Jan 7 15:44:23 mail kernel: DMAR:[DMA Read] Request device [06:00.0] > fault addr fffb7bffe000 > Jan 7 15:44:23 mail kernel: DMAR:[fault reason 06] PTE Read access is > not set > Jan 7 15:44:23 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x80000000 > Jan 7 15:44:23 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) > Jan 7 15:44:24 mail smbd[6572]: [2010/01/07 15:44:24, 0] > lib/util_sock.c:539(read_fd_with_timeout) > Jan 7 15:44:24 mail smbd[6572]: [2010/01/07 15:44:24, 0] > lib/util_sock.c:1491(get_peer_addr_internal) > Jan 7 15:44:24 mail smbd[6572]: getpeername failed. Error was > Transport endpoint is not connected > Jan 7 15:44:24 mail smbd[6572]: read_fd_with_timeout: client 0.0.0.0 > read error = Connection timed out. > Jan 7 15:44:44 mail kernel: ------------[ cut here ]------------ > Jan 7 15:44:44 mail kernel: WARNING: at net/sched/sch_generic.c:261 > dev_watchdog+0xf3/0x164() > Jan 7 15:44:44 mail kernel: Hardware name: System Product Name > Jan 7 15:44:44 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit > queue 0 timed out > Jan 7 15:44:44 mail kernel: Modules linked in: ip6table_filter > ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat nf_nat > iptable_mangle iptable_raw bridge stp appletalk psnap llc nfsd lockd > nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit > tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp > xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm > snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec > snd_hda_intel snd_hda_codec ac97_bus snd_hwdep snd_seq snd_seq_device > snd_pcm gspca_spca505 gspca_main firewire_ohci videodev v4l1_compat > firewire_core pcspkr v4l2_compat_ioctl32 snd_timer iTCO_wdt i2c_i801 > crc_itu_t iTCO_vendor_support snd soundcore snd_page_alloc sky2 wmi > asus_atk0110 hwmon fbcon tileblit font bitblit softcursor raid456 > async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx > raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm > agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfil > Jan 7 15:44:44 mail kernel: lrect [last unloaded: microcode] > Jan 7 15:44:44 mail kernel: Pid: 0, comm: swapper Tainted: G W BTW, was there any other oops saved before this one? ... > --- adapter dead after this --- rebooted. > * no MMAP; alternative 1 patch, mtu=1500; no errors; sustained transfer > rates about 25% lower than what I saw with mmap enabled...(before MMAP > enabled crashed). ?? Read below... > * no MMAP mtu=9000; ran ok at low transfer rates - when high rates > kicked in, got the sky2 interrupt error & things went south: > Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > After this, remote connections broke and I rebooted... decided to rerun > w/o MMAP again before going back to MMAP and trying those other sky2 > options... > * Retest of no MMAP + Alternative 1 - just to confirm consistency. > Worked - no errors. Only version so far that allows the win7 backup to > complete. ??? Hmm... Alternative 1 or 2 doesn't even compile into when no MMAP, so it definitely needs re-retesting ;-) > * MMAP + NO DMAR + disable_msi=1... also works w/o errors... leaving > this one running for a while - also completed a backup successfully. > Fastest of the lot... about 3x faster than any other version, working or > not. Very interesting. It would be nice to give it a really long try, and if still true, try MMAP + NO DMAR only. > > I'm leaving this one running for now. Not retesting jumbo for now. Be > happy to help dig further. > > Tentative recommendations: > > 1) The af alternative patch seems rather necessary. First alternative > seems to be working, I'd suggest that be submitted and backported to > 2.6.32. > 2) Steven's pskb_may_pull patch also ought to be included and backported. > 3) Jumbo frame support for yukon2 should probably be disabled until/if > fixed. > 4) When possible I'll test dmar and disable_msi, and no dmar and no > disable_msi. When I first hit issues, I was running without DMAR, but > also without the above patches. I suppose the non-working permutations > need to be either fixed or invalidated (or well documented). > 5) It would be nice if someone with comparable hardware could reproduce > these issues. FWIW, I can only recreate the crash running windows backup > to a cifs share. Copying large files doesn't seem to do it. Could also > be some other interaction going on here that perhaps others aren't > running - would be happy to compare notes. > > Notes: > This *could* be coincidental, but maybe not... > With MMAP+NO DMAR + disable_msi there are far fewer ... actually almost > no... bind error reports... and no bind format error messages. With > NOMMAP and alternative one there are a few more bind error messages and > one format error message during the several hours that version was up. > All other configurations going back perhaps for two weeks have > significantly more bind error reports - and all versions show increasing > frequency of bind format errors (IPV6 only) in the roughly 10-15 minutes > preceding the lockup/crash/interrupt error messages. There are none > immediately preceding any crash, but perhaps there is some correlation > between the network errors and bind ipv6 packets. OK, for now let's make sure this MMAP + NO DMAR + disable_msi is really really working. Thanks, Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-08 7:45 ` Jarek Poplawski @ 2010-01-08 16:40 ` Michael Breuer 2010-01-08 21:29 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-08 16:40 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev, Berck E. Nash On 1/8/2010 2:45 AM, Jarek Poplawski wrote: > On Thu, Jan 07, 2010 at 06:11:34PM -0500, Michael Breuer wrote: > >> Results: >> ... >> Jan 7 15:44:44 mail kernel: Pid: 0, comm: swapper Tainted: G W >> > BTW, was there any other oops saved before this one? > > ... > Nope - just the one. >> --- adapter dead after this --- rebooted. >> * no MMAP; alternative 1 patch, mtu=1500; no errors; sustained transfer >> rates about 25% lower than what I saw with mmap enabled...(before MMAP >> enabled crashed). >> > ?? Read below... > > >> * no MMAP mtu=9000; ran ok at low transfer rates - when high rates >> kicked in, got the sky2 interrupt error& things went south: >> Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> After this, remote connections broke and I rebooted... decided to rerun >> w/o MMAP again before going back to MMAP and trying those other sky2 >> options... >> * Retest of no MMAP + Alternative 1 - just to confirm consistency. >> Worked - no errors. Only version so far that allows the win7 backup to >> complete. >> > ??? Hmm... Alternative 1 or 2 doesn't even compile into when no MMAP, > so it definitely needs re-retesting ;-) > I see your point. I'm pretty sure that run failed miserably. Perhaps something else is going on - some sort if intermittent thing that just got caught there... have one thought - see below. > >> * MMAP + NO DMAR + disable_msi=1... also works w/o errors... leaving >> this one running for a while - also completed a backup successfully. >> Fastest of the lot... about 3x faster than any other version, working or >> not. >> > Very interesting. It would be nice to give it a really long try, and > if still true, try MMAP + NO DMAR only. > Still up - no kernel errors reported. There was a large dropped packet rate (RX) which seems to actually correlate with DNS format error messages (ipv6 only). I spent some time tracking those down. Interestingly, most pointed back to one netblock & one ISP (0qf.ru). I blocked that domain and the errors expectedly dropped - as did the RX dropped packet rate. Since booting this version yesterday eth0 shows 1752944 dropped packets. 1752939 of those happened before I blocked the domain about 8 hours ago. I have run load tests since as well. I think this dns activity is sendmail attempting to validate spam - but not 100% sure yet as I can't correlate the .ru domain with anything sendmail has reported. I'm running a sniffer now hoping to catch the next one. I *think* but can't prove, that these are coming in via sendmail - i.e., bad email, not even spam really - just enough to get a system configured to do dns lookups as part of spam filtering to connect to the server in question. What comes back would seem to be corrupt ipv6 packets. This gets us back somewhat to Berck Nash's reported problem. His report of sky2 failure was due to external attack. Could this be related? Is it possible that absent some set of the patches & config settings in this version that ipv6 bind activity involving corrupt (perhaps intentionally) packets is breaking something? Will try rerunning without disable_msi later (after I catch the dns thing in the sniffer). > >> I'm leaving this one running for now. Not retesting jumbo for now. Be >> happy to help dig further. >> >> Tentative recommendations: >> >> 1) The af alternative patch seems rather necessary. First alternative >> seems to be working, I'd suggest that be submitted and backported to >> 2.6.32. >> 2) Steven's pskb_may_pull patch also ought to be included and backported. >> 3) Jumbo frame support for yukon2 should probably be disabled until/if >> fixed. >> 4) When possible I'll test dmar and disable_msi, and no dmar and no >> disable_msi. When I first hit issues, I was running without DMAR, but >> also without the above patches. I suppose the non-working permutations >> need to be either fixed or invalidated (or well documented). >> 5) It would be nice if someone with comparable hardware could reproduce >> these issues. FWIW, I can only recreate the crash running windows backup >> to a cifs share. Copying large files doesn't seem to do it. Could also >> be some other interaction going on here that perhaps others aren't >> running - would be happy to compare notes. >> >> Notes: >> This *could* be coincidental, but maybe not... >> With MMAP+NO DMAR + disable_msi there are far fewer ... actually almost >> no... bind error reports... and no bind format error messages. With >> NOMMAP and alternative one there are a few more bind error messages and >> one format error message during the several hours that version was up. >> All other configurations going back perhaps for two weeks have >> significantly more bind error reports - and all versions show increasing >> frequency of bind format errors (IPV6 only) in the roughly 10-15 minutes >> preceding the lockup/crash/interrupt error messages. There are none >> immediately preceding any crash, but perhaps there is some correlation >> between the network errors and bind ipv6 packets. >> > OK, for now let's make sure this MMAP + NO DMAR + disable_msi is > really really working. > Still running :) > Thanks, > Jarek P. > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-08 16:40 ` Michael Breuer @ 2010-01-08 21:29 ` Jarek Poplawski 2010-01-08 21:48 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-08 21:29 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Fri, Jan 08, 2010 at 11:40:25AM -0500, Michael Breuer wrote: > On 1/8/2010 2:45 AM, Jarek Poplawski wrote: > >On Thu, Jan 07, 2010 at 06:11:34PM -0500, Michael Breuer wrote: > >??? Hmm... Alternative 1 or 2 doesn't even compile into when no MMAP, > >so it definitely needs re-retesting ;-) > I see your point. I'm pretty sure that run failed miserably. Perhaps > something else is going on - some sort if intermittent thing that > just got caught there... have one thought - see below. It looks like very timing dependent and without MMAP it can work just below limit, which could trigger those dmar or msi bugs. Anyway, it seems to show there is no another serious bug in the MMAP part (i.e. except the one fixed be alternative 1 or 2 patches). > >>* MMAP + NO DMAR + disable_msi=1... also works w/o errors... leaving > >>this one running for a while - also completed a backup successfully. > >>Fastest of the lot... about 3x faster than any other version, working or > >>not. > >Very interesting. It would be nice to give it a really long try, and > >if still true, try MMAP + NO DMAR only. > Still up - no kernel errors reported. There was a large dropped > packet rate (RX) which seems to actually correlate with DNS format > error messages (ipv6 only). I spent some time tracking those down. > Interestingly, most pointed back to one netblock & one ISP (0qf.ru). > I blocked that domain and the errors expectedly dropped - as did the > RX dropped packet rate. Since booting this version yesterday eth0 > shows 1752944 dropped packets. 1752939 of those happened before I > blocked the domain about 8 hours ago. I have run load tests since as > well. > > I think this dns activity is sendmail attempting to validate spam - > but not 100% sure yet as I can't correlate the .ru domain with > anything sendmail has reported. I'm running a sniffer now hoping to > catch the next one. I *think* but can't prove, that these are coming > in via sendmail - i.e., bad email, not even spam really - just > enough to get a system configured to do dns lookups as part of spam > filtering to connect to the server in question. What comes back > would seem to be corrupt ipv6 packets. > > This gets us back somewhat to Berck Nash's reported problem. His > report of sky2 failure was due to external attack. Could this be > related? Is it possible that absent some set of the patches & config > settings in this version that ipv6 bind activity involving corrupt > (perhaps intentionally) packets is breaking something? Berck Nash reported oopses during sky2 TX timeout recovery, which are generally hardware/driver problems, and shouldn't be triggered by ip level bugs, so it should be queried as a separate bug report. > > Will try rerunning without disable_msi later (after I catch the dns > thing in the sniffer). > >>I'm leaving this one running for now. Not retesting jumbo for now. Be > >>happy to help dig further. > >> > >>Tentative recommendations: > >> > >>1) The af alternative patch seems rather necessary. First alternative > >>seems to be working, I'd suggest that be submitted and backported to > >>2.6.32. BTW, don't hurry with that yet, but in the next test, please try alternative 2 again (i.e. with MMAP + no DMAR + disable_msi). > >>2) Steven's pskb_may_pull patch also ought to be included and backported. This patch is very helpful for debugging, but I doubt appropriate for the mainline if it isn't triggered any more after 1) fix. But, please keep it yourself for some time in all these tests (and of course Berck Nash's patch). > >OK, for now let's make sure this MMAP + NO DMAR + disable_msi is > >really really working. > Still running :) Very nice :) Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-08 21:29 ` Jarek Poplawski @ 2010-01-08 21:48 ` Michael Breuer 2010-01-08 22:02 ` Jarek Poplawski 2010-01-09 4:45 ` Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-08 21:48 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/8/2010 4:29 PM, Jarek Poplawski wrote: > On Fri, Jan 08, 2010 at 11:40:25AM -0500, Michael Breuer wrote: > >> On 1/8/2010 2:45 AM, Jarek Poplawski wrote: >> >> ... > Berck Nash reported oopses during sky2 TX timeout recovery, which are > generally hardware/driver problems, and shouldn't be triggered by ip > level bugs, so it should be queried as a separate bug report. > > My thought was that his crash was secondary to the netdev watchdog & subsequent reset that I saw. >> Will try rerunning without disable_msi later (after I catch the dns >> thing in the sniffer). >> >>>> I'm leaving this one running for now. Not retesting jumbo for now. Be >>>> happy to help dig further. >>>> >>>> Tentative recommendations: >>>> >>>> 1) The af alternative patch seems rather necessary. First alternative >>>> seems to be working, I'd suggest that be submitted and backported to >>>> 2.6.32. >>>> > BTW, don't hurry with that yet, but in the next test, please try > alternative 2 again (i.e. with MMAP + no DMAR + disable_msi). > > Will do - still up from yesterday... no more dropped packets... none of the dns errors either. To be expected I suppose as long as I'm trying to sniff it. Assuming no immediate erorrs with alt2, no DMAR + disable_msi I'll report back after it's been up for a while. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-08 21:48 ` Michael Breuer @ 2010-01-08 22:02 ` Jarek Poplawski 2010-01-09 4:45 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-08 22:02 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Fri, Jan 08, 2010 at 04:48:11PM -0500, Michael Breuer wrote: > On 1/8/2010 4:29 PM, Jarek Poplawski wrote: > >On Fri, Jan 08, 2010 at 11:40:25AM -0500, Michael Breuer wrote: > >>On 1/8/2010 2:45 AM, Jarek Poplawski wrote: > >>... > >Berck Nash reported oopses during sky2 TX timeout recovery, which are > >generally hardware/driver problems, and shouldn't be triggered by ip > >level bugs, so it should be queried as a separate bug report. > > > My thought was that his crash was secondary to the netdev watchdog & > subsequent reset that I saw. Yes, the netdev watchdog which triggers on TX timeout. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-08 21:48 ` Michael Breuer 2010-01-08 22:02 ` Jarek Poplawski @ 2010-01-09 4:45 ` Michael Breuer 2010-01-09 5:44 ` Michael Breuer 2010-01-09 12:28 ` Jarek Poplawski 1 sibling, 2 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-09 4:45 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/8/2010 4:48 PM, Michael Breuer wrote: > On 1/8/2010 4:29 PM, Jarek Poplawski wrote: > ... >> BTW, don't hurry with that yet, but in the next test, please try >> alternative 2 again (i.e. with MMAP + no DMAR + disable_msi). >> > Will do - still up from yesterday... no more dropped packets... none > of the dns errors either. To be expected I suppose as long as I'm > trying to sniff it. Assuming no immediate erorrs with alt2, no DMAR + > disable_msi I'll report back after it's been up for a while. > -- Ok - ran alt1 and alt2, both with MMAP, no DMAR and disable_msi. Both seem to behave similarly. No logged errors; large numbers of dropped RX packets. One odd thing: when driving every sort of traffic through, I was able to hose the client adapter (win7) repeatedly by runnnig the win7 backup and connecting Windows Media Player to a Mediatomb stream while also running a remote X11 session. Looks like the SSDP traffic that occurs at the same time as the SMB traffic and X11 traffic takes out the adapter on Win7 - Nforce. Reran with msi enabled, MMAP and no DMAR. Also no errors; much faster, and the Win7 side survives the same conditions that don't work when msi is disabled. Doesn't make sense to me, but it is what it is. Going to leave this up for a while and see if things remain functional. > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-09 4:45 ` Michael Breuer @ 2010-01-09 5:44 ` Michael Breuer 2010-01-09 12:28 ` Jarek Poplawski 1 sibling, 0 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-09 5:44 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/8/2010 11:45 PM, Michael Breuer wrote: > On 1/8/2010 4:48 PM, Michael Breuer wrote: >> On 1/8/2010 4:29 PM, Jarek Poplawski wrote: >> ... >>> BTW, don't hurry with that yet, but in the next test, please try >>> alternative 2 again (i.e. with MMAP + no DMAR + disable_msi). >>> >> Will do - still up from yesterday... no more dropped packets... none >> of the dns errors either. To be expected I suppose as long as I'm >> trying to sniff it. Assuming no immediate erorrs with alt2, no DMAR + >> disable_msi I'll report back after it's been up for a while. >> -- > Ok - ran alt1 and alt2, both with MMAP, no DMAR and disable_msi. Both > seem to behave similarly. No logged errors; large numbers of dropped > RX packets. One odd thing: when driving every sort of traffic through, > I was able to hose the client adapter (win7) repeatedly by runnnig the > win7 backup and connecting Windows Media Player to a Mediatomb stream > while also running a remote X11 session. Looks like the SSDP traffic > that occurs at the same time as the SMB traffic and X11 traffic takes > out the adapter on Win7 - Nforce. > Reran with msi enabled, MMAP and no DMAR. Also no errors; much faster, > and the Win7 side survives the same conditions that don't work when > msi is disabled. Doesn't make sense to me, but it is what it is. > Tracked it down - this appears to be result of Win7 enabling "EnableDeadGWDetect" by default. Long story short, if this is set and a packet sent to the gw is retransmitted more than, 1/2 the value of maxdataretransmissions (default of 5), then the gateway is declared dead and the next one selected. If there is only one gateway, then the default gateway is effectively removed. So under load where the server is the default gateway, this is not a surprising result. With msi enabled, the system responds better, fewer retransmissions, no dropping the link. > Going to leave this up for a while and see if things remain functional. >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-09 4:45 ` Michael Breuer 2010-01-09 5:44 ` Michael Breuer @ 2010-01-09 12:28 ` Jarek Poplawski 2010-01-09 18:34 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-09 12:28 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Fri, Jan 08, 2010 at 11:45:35PM -0500, Michael Breuer wrote: > On 1/8/2010 4:48 PM, Michael Breuer wrote: > >On 1/8/2010 4:29 PM, Jarek Poplawski wrote: > >... > >>BTW, don't hurry with that yet, but in the next test, please try > >>alternative 2 again (i.e. with MMAP + no DMAR + disable_msi). > >> > >Will do - still up from yesterday... no more dropped packets... > >none of the dns errors either. To be expected I suppose as long as > >I'm trying to sniff it. Assuming no immediate erorrs with alt2, no > >DMAR + disable_msi I'll report back after it's been up for a > >while. > >-- > Ok - ran alt1 and alt2, both with MMAP, no DMAR and disable_msi. > Both seem to behave similarly. No logged errors; Ok - since alt2 introduces less changes and is acked by Stephen already, I'll resend it with your "Tested-by" in a new thread to clear things a bit. > large numbers of dropped RX packets. You might try some tweaking with another sky2 parameter "copybreak" or even editing its "NAPI_WEIGHT" variable. > One odd thing: when driving every sort of > traffic through, I was able to hose the client adapter (win7) > repeatedly by runnnig the win7 backup and connecting Windows Media > Player to a Mediatomb stream while also running a remote X11 > session. Looks like the SSDP traffic that occurs at the same time as > the SMB traffic and X11 traffic takes out the adapter on Win7 - > Nforce. > Reran with msi enabled, MMAP and no DMAR. Also no errors; much > faster, and the Win7 side survives the same conditions that don't > work when msi is disabled. Doesn't make sense to me, but it is what > it is. > > Going to leave this up for a while and see if things remain functional. It looks like DMAR is the main candidate for a new linux-kernel@ or bugzilla bug report. You should also consider reporting these ipv6 problems, especially if you think they can trigger TX timeouts. (In both cases it would be good to try current mainline first, when you get it workable.) Thanks, Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-09 12:28 ` Jarek Poplawski @ 2010-01-09 18:34 ` Michael Breuer 2010-01-13 20:39 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-09 18:34 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/9/2010 7:28 AM, Jarek Poplawski wrote: > On Fri, Jan 08, 2010 at 11:45:35PM -0500, Michael Breuer wrote: > >> On 1/8/2010 4:48 PM, Michael Breuer wrote: >> >>> On 1/8/2010 4:29 PM, Jarek Poplawski wrote: >>> ... >>> >>>> BTW, don't hurry with that yet, but in the next test, please try >>>> alternative 2 again (i.e. with MMAP + no DMAR + disable_msi). >>>> >>>> >>> Will do - still up from yesterday... no more dropped packets... >>> none of the dns errors either. To be expected I suppose as long as >>> I'm trying to sniff it. Assuming no immediate erorrs with alt2, no >>> DMAR + disable_msi I'll report back after it's been up for a >>> while. >>> -- >>> >> Ok - ran alt1 and alt2, both with MMAP, no DMAR and disable_msi. >> Both seem to behave similarly. No logged errors; >> > Ok - since alt2 introduces less changes and is acked by Stephen > already, I'll resend it with your "Tested-by" in a new thread to clear > things a bit. > > Ok - good, and thank you for all your help. >> large numbers of dropped RX packets. >> > You might try some tweaking with another sky2 parameter "copybreak" > or even editing its "NAPI_WEIGHT" variable. > > I'll play around with this after I figure out what's actually being dropped and why. Looking at ethtool, over 99% of RX packets are in the >=1024 bucket. I'll play with weight and perhaps rmem as time permits. >> One odd thing: when driving every sort of >> traffic through, I was able to hose the client adapter (win7) >> repeatedly by runnnig the win7 backup and connecting Windows Media >> Player to a Mediatomb stream while also running a remote X11 >> session. Looks like the SSDP traffic that occurs at the same time as >> the SMB traffic and X11 traffic takes out the adapter on Win7 - >> Nforce. >> Reran with msi enabled, MMAP and no DMAR. Also no errors; much >> faster, and the Win7 side survives the same conditions that don't >> work when msi is disabled. Doesn't make sense to me, but it is what >> it is. >> >> Going to leave this up for a while and see if things remain functional. >> > It looks like DMAR is the main candidate for a new linux-kernel@ or > bugzilla bug report. You should also consider reporting these ipv6 > problems, especially if you think they can trigger TX timeouts. (In > both cases it would be good to try current mainline first, when you > get it workable.) > Ok - I'll get onto mainline and then deal with DMAR. > Thanks, > Jarek P. > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-09 18:34 ` Michael Breuer @ 2010-01-13 20:39 ` Michael Breuer 2010-01-13 21:09 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-13 20:39 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/9/2010 1:34 PM, Michael Breuer wrote: > On 1/9/2010 7:28 AM, Jarek Poplawski wrote: ... >> It looks like DMAR is the main candidate for a new linux-kernel@ or >> bugzilla bug report. You should also consider reporting these ipv6 >> problems, especially if you think they can trigger TX timeouts. (In >> both cases it would be good to try current mainline first, when you >> get it workable.) > Ok - I'll get onto mainline and then deal with DMAR. Just an FYI - 2.6.32.3 with alt 3 af_packet patch & sky2 pskb_may_pull runs OK with DMAR (re)enabled and msi enabled. >> Thanks, >> Jarek P. > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-13 20:39 ` Michael Breuer @ 2010-01-13 21:09 ` Jarek Poplawski 2010-01-13 21:16 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-13 21:09 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Wed, Jan 13, 2010 at 03:39:37PM -0500, Michael Breuer wrote: > Just an FYI - 2.6.32.3 with alt 3 af_packet patch & sky2 > pskb_may_pull runs OK with DMAR (re)enabled and msi enabled. Hmm... What a pity! It was such a useful debugging tool for networking ;-) BTW, I'm not sure if "runs OK" means with or without those DHCP drops & large packets you described. Thanks, Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-13 21:09 ` Jarek Poplawski @ 2010-01-13 21:16 ` Michael Breuer 2010-01-13 21:34 ` Jarek Poplawski 2010-01-17 16:26 ` Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-13 21:16 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/13/2010 4:09 PM, Jarek Poplawski wrote: > On Wed, Jan 13, 2010 at 03:39:37PM -0500, Michael Breuer wrote: > >> Just an FYI - 2.6.32.3 with alt 3 af_packet patch& sky2 >> pskb_may_pull runs OK with DMAR (re)enabled and msi enabled. >> > Hmm... What a pity! It was such a useful debugging tool for > networking ;-) BTW, I'm not sure if "runs OK" means with or without > those DHCP drops& large packets you described. > > Thanks, > Jarek P. > As of now, no errors even when blasting traffic & forcing dhcp packets as before. I haven't tried putting mtu back to 9k yet. OK means that there are no obvious differences in behavior with or without DMAR all else being equal. There were some updates made to stable that could have fixed this - I'd guess intel_iommu fixes helped. If it helps, I'm still getting one error without DMAR enabled - at startup, there's a DMA sync oops - mismatch of 72 bytes coming from sky2. That oops was posted previously - with DMAR (re) enabled, there's no related oops. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-13 21:16 ` Michael Breuer @ 2010-01-13 21:34 ` Jarek Poplawski 2010-01-17 16:26 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-13 21:34 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Wed, Jan 13, 2010 at 04:16:36PM -0500, Michael Breuer wrote: > If it helps, I'm still getting one error without DMAR enabled - at > startup, there's a DMA sync oops - mismatch of 72 bytes coming from > sky2. That oops was posted previously - with DMAR (re) enabled, > there's no related oops. I hope re-posting this oops with this information should be helpful. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-13 21:16 ` Michael Breuer 2010-01-13 21:34 ` Jarek Poplawski @ 2010-01-17 16:26 ` Michael Breuer 2010-01-17 22:17 ` Jarek Poplawski 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-17 16:26 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 01/13/2010 04:16 PM, Michael Breuer wrote: > On 1/13/2010 4:09 PM, Jarek Poplawski wrote: >> On Wed, Jan 13, 2010 at 03:39:37PM -0500, Michael Breuer wrote: >>> Just an FYI - 2.6.32.3 with alt 3 af_packet patch& sky2 >>> pskb_may_pull runs OK with DMAR (re)enabled and msi enabled. >> Hmm... What a pity! It was such a useful debugging tool for >> networking ;-) BTW, I'm not sure if "runs OK" means with or without >> those DHCP drops& large packets you described. >> >> Thanks, >> Jarek P. > As of now, no errors even when blasting traffic & forcing dhcp packets > as before. I haven't tried putting mtu back to 9k yet. OK means that > there are no obvious differences in behavior with or without DMAR all > else being equal. > > There were some updates made to stable that could have fixed this - > I'd guess intel_iommu fixes helped. > > If it helps, I'm still getting one error without DMAR enabled - at > startup, there's a DMA sync oops - mismatch of 72 bytes coming from > sky2. That oops was posted previously - with DMAR (re) enabled, > there's no related oops. Update: after leaving the system up for a few days, I hit the DMAR error again. This happened during a scheduled backup from my win7 box. A reboot was required to re-enable eth0. After the error, eth0 was receiving, but was unable to transmit. For example, the log reported arp bogons; DHCPINFORM/ACK sequences (where the ACK that was logged was not transmitted), etc. The log was filled with sky2 eth0: tx timeout messages; as well as disable/enable of eth0. I attempted to get things up again without a reboot, but failed. Even rmmod & insmod did not fix whatever was broken on the TX side. Note that this is similar to the earlier sky2 errors I had under load with the variety of patches, and with or without DMAR enabled. Just took way longer this time. Note that eth1 remained functional. Unfortunately, with the latest set of patches installed, this is no longer reproducible at will. I'd guess therefore that the patches narrowed some hole, but didn't close it. Relevant log portions: Jan 17 05:29:49 mail dhcpd: DHCPREQUEST for 10.0.0.32 from 00:26:bb:aa:15:10 (mbitouch) via eth0 Jan 17 05:29:49 mail dhcpd: DHCPACK on 10.0.0.32 to 00:26:bb:aa:15:10 (mbitouch) via eth0 Jan 17 05:36:49 mail kernel: DRHD: handling fault status reg 2 Jan 17 05:36:49 mail kernel: DMAR:[DMA Read] Request device [06:00.0] fault addr ffe7957fe000 Jan 17 05:36:49 mail kernel: DMAR:[fault reason 06] PTE Read access is not set Jan 17 05:36:49 mail kernel: sky2 0000:06:00.0: error interrupt status=0xc0000000 Jan 17 05:36:49 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) Jan 17 05:36:49 mail smbd[14840]: [2010/01/17 05:36:49, 0] lib/util_sock.c:539(read_fd_with_timeout) Jan 17 05:36:49 mail smbd[14840]: [2010/01/17 05:36:49, 0] lib/util_sock.c:1491(get_peer_addr_internal) Jan 17 05:36:49 mail smbd[14840]: getpeername failed. Error was Transport endpoint is not connected Jan 17 05:36:49 mail smbd[14840]: read_fd_with_timeout: client 0.0.0.0 read error = Connection timed out. Jan 17 05:37:51 mail kernel: ------------[ cut here ]------------ Jan 17 05:37:51 mail kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164() Jan 17 05:37:51 mail kernel: Hardware name: System Product Name Jan 17 05:37:51 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out Jan 17 05:37:51 mail kernel: Modules linked in: nls_utf8 cifs cpufreq_stats ip6table_mangle ip6table_filter ip6_tables iptable_raw iptable_mangle ipt_MASQUERADE iptable_nat nf_nat appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp nf_conntrack_ipv6 xt_multiport xt_DSCP xt_dscp xt_MARK ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_ens1371 gameport snd_rawmidi snd_ac97_codec snd_hwdep ac97_bus firewire_ohci snd_seq firewire_core snd_seq_device gspca_spca505 gspca_main videodev i2c_i801 snd_pcm crc_itu_t v4l1_compat pcspkr v4l2_compat_ioctl32 asus_atk0110 hwmon iTCO_wdt iTCO_vendor_support snd_timer snd soundcore sky2 snd_page_alloc wmi fbcon tileblit font bitblit softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core Jan 17 05:37:51 mail kernel: cfbimgblt cfbfillrect [last unloaded: microcode] Jan 17 05:37:51 mail kernel: Pid: 0, comm: swapper Tainted: G W 2.6.32WITHMMAPNODMARAF3SKY2TXRGCLNV4TX-00893-gb5d5baa-dirty #2 Jan 17 05:37:51 mail kernel: Call Trace: Jan 17 05:37:51 mail kernel: <IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94 Jan 17 05:37:51 mail kernel: [<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43 Jan 17 05:37:51 mail kernel: [<ffffffff813e2e57>] ? netif_tx_lock+0x44/0x6c Jan 17 05:37:51 mail kernel: [<ffffffff813e2fbf>] dev_watchdog+0xf3/0x164 Jan 17 05:37:51 mail kernel: [<ffffffff8106e8a4>] ? __queue_work+0x3a/0x42 Jan 17 05:37:51 mail kernel: [<ffffffff8106316b>] run_timer_softirq+0x1c8/0x270 Jan 17 05:37:51 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd Jan 17 05:37:51 mail kernel: [<ffffffff8107ef33>] ? tick_program_event+0x2a/0x2c Jan 17 05:37:51 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 Jan 17 05:37:51 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 Jan 17 05:37:51 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c Jan 17 05:37:51 mail kernel: [<ffffffff8146f8f2>] smp_apic_timer_interrupt+0x86/0x94 Jan 17 05:37:51 mail kernel: [<ffffffff810127e3>] apic_timer_interrupt+0x13/0x20 Jan 17 05:37:51 mail kernel: <EOI> [<ffffffff812c678a>] ? acpi_idle_enter_bm+0x256/0x28a Jan 17 05:37:51 mail kernel: [<ffffffff812c6783>] ? acpi_idle_enter_bm+0x24f/0x28a Jan 17 05:37:51 mail kernel: [<ffffffff813a5f50>] ? cpuidle_idle_call+0x9e/0xfa Jan 17 05:37:51 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 Jan 17 05:37:51 mail kernel: [<ffffffff81464ed2>] ? start_secondary+0x201/0x242 Jan 17 05:37:51 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- Jan 17 05:37:51 mail kernel: sky2 eth0: tx timeout Jan 17 05:37:51 mail kernel: sky2 eth0: transmit ring 85 .. 45 report=85 done=85 Jan 17 05:37:51 mail kernel: sky2 eth0: disabling interface Jan 17 05:37:51 mail kernel: sky2 eth0: enabling interface <unrelated stuff> Jan 17 05:39:14 mail kernel: sky2 eth0: tx timeout Jan 17 05:39:14 mail kernel: sky2 eth0: transmit ring 2 .. 89 report=2 done=2 Jan 17 05:39:14 mail kernel: sky2 eth0: disabling interface Jan 17 05:39:14 mail kernel: sky2 eth0: enabling interface <time passes> Jan 17 05:40:22 mail kernel: sky2 eth0: tx timeout Jan 17 05:40:22 mail kernel: sky2 eth0: transmit ring 2 .. 89 report=2 done=2 Jan 17 05:40:22 mail kernel: sky2 eth0: disabling interface Jan 17 05:40:22 mail kernel: sky2 eth0: enabling interface Jan 17 05:40:22 mail NetworkManager: <info> (eth0): carrier now OFF (device state 1) Jan 17 05:40:25 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both <time passes> Jan 17 05:42:05 mail kernel: sky2 eth0: tx timeout Jan 17 05:42:05 mail kernel: sky2 eth0: transmit ring 2 .. 89 report=2 done=2 Jan 17 05:42:05 mail kernel: sky2 eth0: disabling interface Jan 17 05:42:05 mail kernel: sky2 eth0: enabling interface Jan 17 05:42:08 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both <time passes> Jan 17 05:44:13 mail kernel: sky2 eth0: tx timeout Jan 17 05:44:13 mail kernel: sky2 eth0: transmit ring 3 .. 90 report=3 done=3 Jan 17 05:44:13 mail kernel: sky2 eth0: disabling interface Jan 17 05:44:13 mail kernel: sky2 eth0: enabling interface Jan 17 05:44:16 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both <much of the same until I rebooted> ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-17 16:26 ` Michael Breuer @ 2010-01-17 22:17 ` Jarek Poplawski 2010-01-17 22:34 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-17 22:17 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Sun, Jan 17, 2010 at 11:26:46AM -0500, Michael Breuer wrote: > On 01/13/2010 04:16 PM, Michael Breuer wrote: > >On 1/13/2010 4:09 PM, Jarek Poplawski wrote: > >>On Wed, Jan 13, 2010 at 03:39:37PM -0500, Michael Breuer wrote: > >>>Just an FYI - 2.6.32.3 with alt 3 af_packet patch& sky2 > >>>pskb_may_pull runs OK with DMAR (re)enabled and msi enabled. > >>Hmm... What a pity! It was such a useful debugging tool for > >>networking ;-) BTW, I'm not sure if "runs OK" means with or without > >>those DHCP drops& large packets you described. > >> > >>Thanks, > >>Jarek P. > >As of now, no errors even when blasting traffic & forcing dhcp > >packets as before. I haven't tried putting mtu back to 9k yet. OK > >means that there are no obvious differences in behavior with or > >without DMAR all else being equal. > > > >There were some updates made to stable that could have fixed this > >- I'd guess intel_iommu fixes helped. > > > >If it helps, I'm still getting one error without DMAR enabled - at > >startup, there's a DMA sync oops - mismatch of 72 bytes coming > >from sky2. That oops was posted previously - with DMAR (re) > >enabled, there's no related oops. > Update: after leaving the system up for a few days, I hit the DMAR > error again. My proposal is to send some summary as a new thread, with dmar in the subject, and cc-ed dmar maintainers. > This happened during a scheduled backup from my win7 > box. A reboot was required to re-enable eth0. After the error, eth0 > was receiving, but was unable to transmit. For example, the log > reported arp bogons; DHCPINFORM/ACK sequences (where the ACK that > was logged was not transmitted), etc. The log was filled with sky2 > eth0: tx timeout messages; as well as disable/enable of eth0. > > I attempted to get things up again without a reboot, but failed. > Even rmmod & insmod did not fix whatever was broken on the TX side. > > Note that this is similar to the earlier sky2 errors I had under > load with the variety of patches, and with or without DMAR enabled. > Just took way longer this time. Note that eth1 remained functional. > > Unfortunately, with the latest set of patches installed, this is no > longer reproducible at will. I'd guess therefore that the patches > narrowed some hole, but didn't close it. It would be nice to name those patches each time. Anyway, try this again without DMAR. Thanks, Jarek P. > > Relevant log portions: > > Jan 17 05:29:49 mail dhcpd: DHCPREQUEST for 10.0.0.32 from > 00:26:bb:aa:15:10 (mbitouch) via eth0 > Jan 17 05:29:49 mail dhcpd: DHCPACK on 10.0.0.32 to > 00:26:bb:aa:15:10 (mbitouch) via eth0 > Jan 17 05:36:49 mail kernel: DRHD: handling fault status reg 2 > Jan 17 05:36:49 mail kernel: DMAR:[DMA Read] Request device > [06:00.0] fault addr ffe7957fe000 > Jan 17 05:36:49 mail kernel: DMAR:[fault reason 06] PTE Read access > is not set > Jan 17 05:36:49 mail kernel: sky2 0000:06:00.0: error interrupt > status=0xc0000000 > Jan 17 05:36:49 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) ... ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-17 22:17 ` Jarek Poplawski @ 2010-01-17 22:34 ` Michael Breuer 2010-01-17 23:05 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-17 22:34 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/17/2010 5:17 PM, Jarek Poplawski wrote: > On Sun, Jan 17, 2010 at 11:26:46AM -0500, Michael Breuer wrote: > >> On 01/13/2010 04:16 PM, Michael Breuer wrote: >> >>> On 1/13/2010 4:09 PM, Jarek Poplawski wrote: >>> >>>> On Wed, Jan 13, 2010 at 03:39:37PM -0500, Michael Breuer wrote: >>>> >>>> >> Update: after leaving the system up for a few days, I hit the DMAR >> error again. >> > My proposal is to send some summary as a new thread, with dmar in the > subject, and cc-ed dmar maintainers. > > Not sure I agree. The symptoms are identical to those I hit without DMAR earlier on. Also, as this issue only happens when there is high receive load, I'm thinking there's some sort of race between TX and RX within the sky2 driver, or hardware. I think that DMAR is correctly catching the error. >> This happened during a scheduled backup from my win7 >> box. A reboot was required to re-enable eth0. After the error, eth0 >> was receiving, but was unable to transmit. For example, the log >> reported arp bogons; DHCPINFORM/ACK sequences (where the ACK that >> was logged was not transmitted), etc. The log was filled with sky2 >> eth0: tx timeout messages; as well as disable/enable of eth0. >> >> I attempted to get things up again without a reboot, but failed. >> Even rmmod& insmod did not fix whatever was broken on the TX side. >> >> Note that this is similar to the earlier sky2 errors I had under >> load with the variety of patches, and with or without DMAR enabled. >> Just took way longer this time. Note that eth1 remained functional. >> >> Unfortunately, with the latest set of patches installed, this is no >> longer reproducible at will. I'd guess therefore that the patches >> narrowed some hole, but didn't close it. >> > It would be nice to name those patches each time. Anyway, try this > again without DMAR. > > Thanks, > Jarek P. > > My bad: was running with the af_packet.c version 3 patch; and Stephen's v4 patch from last week. Both on 2.6.32 from git (so 2.6.32.4). Can't move back to head as I've hit two unrelated issues. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-17 22:34 ` Michael Breuer @ 2010-01-17 23:05 ` Jarek Poplawski 2010-01-17 23:15 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-17 23:05 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Sun, Jan 17, 2010 at 05:34:19PM -0500, Michael Breuer wrote: > On 1/17/2010 5:17 PM, Jarek Poplawski wrote: > >On Sun, Jan 17, 2010 at 11:26:46AM -0500, Michael Breuer wrote: > >>On 01/13/2010 04:16 PM, Michael Breuer wrote: > >>>On 1/13/2010 4:09 PM, Jarek Poplawski wrote: > >>>>On Wed, Jan 13, 2010 at 03:39:37PM -0500, Michael Breuer wrote: > >>>> > >>Update: after leaving the system up for a few days, I hit the DMAR > >>error again. > >My proposal is to send some summary as a new thread, with dmar in the > >subject, and cc-ed dmar maintainers. > > > Not sure I agree. The symptoms are identical to those I hit without > DMAR earlier on. Also, as this issue only happens when there is high > receive load, I'm thinking there's some sort of race between TX and > RX within the sky2 driver, or hardware. I think that DMAR is > correctly catching the error. Hmm... OK, then let's wait with this report and go back to testing it "really really long" ;-) without DMAR, and maybe without the last Stephen's patch either? (So only the two things in the current linux-2.6.) Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-17 23:05 ` Jarek Poplawski @ 2010-01-17 23:15 ` Michael Breuer 2010-01-18 7:30 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-17 23:15 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/17/2010 6:05 PM, Jarek Poplawski wrote: > On Sun, Jan 17, 2010 at 05:34:19PM -0500, Michael Breuer wrote: > >> On 1/17/2010 5:17 PM, Jarek Poplawski wrote: >> >>> On Sun, Jan 17, 2010 at 11:26:46AM -0500, Michael Breuer wrote: >>> >>>> On 01/13/2010 04:16 PM, Michael Breuer wrote: >>>> >>>>> On 1/13/2010 4:09 PM, Jarek Poplawski wrote: >>>>> >>>>>> On Wed, Jan 13, 2010 at 03:39:37PM -0500, Michael Breuer wrote: >>>>>> >>>>>> >>>> Update: after leaving the system up for a few days, I hit the DMAR >>>> error again. >>>> >>> My proposal is to send some summary as a new thread, with dmar in the >>> subject, and cc-ed dmar maintainers. >>> >>> >> Not sure I agree. The symptoms are identical to those I hit without >> DMAR earlier on. Also, as this issue only happens when there is high >> receive load, I'm thinking there's some sort of race between TX and >> RX within the sky2 driver, or hardware. I think that DMAR is >> correctly catching the error. >> > Hmm... OK, then let's wait with this report and go back to testing > it "really really long" ;-) without DMAR, and maybe without the > last Stephen's patch either? (So only the two things in the current > linux-2.6.) > > Jarek P. > Ok - but absent the last patch, I think I still need the pskb_may_pull patch... so it'd be pskb_may_pull and afpacket v3 and no DMAR. Also - not sure if related, but there's still the odd tx side behavior when RX is under load. That I CAN reproduce at will (yesterday's report - no crash, but I confirmed that DHCPOFFER packets are being dropped somewhere after wireshark sees them and before hitting the wire. I am also wondering whether or not that testing I did yesterday set up today's hang - perhaps those lost TX packets are corrupting something that manifests worse later. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-17 23:15 ` Michael Breuer @ 2010-01-18 7:30 ` Jarek Poplawski 2010-01-18 16:29 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-18 7:30 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Sun, Jan 17, 2010 at 06:15:22PM -0500, Michael Breuer wrote: > On 1/17/2010 6:05 PM, Jarek Poplawski wrote: >> On Sun, Jan 17, 2010 at 05:34:19PM -0500, Michael Breuer wrote: >> >>> On 1/17/2010 5:17 PM, Jarek Poplawski wrote: >>> >>>> On Sun, Jan 17, 2010 at 11:26:46AM -0500, Michael Breuer wrote: >>>> >>>>> On 01/13/2010 04:16 PM, Michael Breuer wrote: >>>>> >>>>>> On 1/13/2010 4:09 PM, Jarek Poplawski wrote: >>>>>> >>>>>>> On Wed, Jan 13, 2010 at 03:39:37PM -0500, Michael Breuer wrote: >>>>>>> >>>>>>> >>>>> Update: after leaving the system up for a few days, I hit the DMAR >>>>> error again. >>>>> >>>> My proposal is to send some summary as a new thread, with dmar in the >>>> subject, and cc-ed dmar maintainers. >>>> >>>> >>> Not sure I agree. The symptoms are identical to those I hit without >>> DMAR earlier on. Also, as this issue only happens when there is high >>> receive load, I'm thinking there's some sort of race between TX and >>> RX within the sky2 driver, or hardware. I think that DMAR is >>> correctly catching the error. >>> >> Hmm... OK, then let's wait with this report and go back to testing >> it "really really long" ;-) without DMAR, and maybe without the >> last Stephen's patch either? (So only the two things in the current >> linux-2.6.) >> >> Jarek P. >> > Ok - but absent the last patch, I think I still need the pskb_may_pull > patch... so it'd be pskb_may_pull and afpacket v3 and no DMAR. Exactly. Or if it's working for you already, the mainline (2.6.33-rc4) with the pskb_may_pull patch. And check for warnings from the latter. > > Also - not sure if related, but there's still the odd tx side behavior > when RX is under load. That I CAN reproduce at will (yesterday's report > - no crash, but I confirmed that DHCPOFFER packets are being dropped > somewhere after wireshark sees them and before hitting the wire. I'm not sure either, but until there is no crash it might be some minor bug or/and missing stat. Btw, you could probably try alternative test with ping from this overloaded box to the router and win7. > > I am also wondering whether or not that testing I did yesterday set up > today's hang - perhaps those lost TX packets are corrupting something > that manifests worse later. Maybe, but you wrote earlier they had to fix something around this DMAR in the meantime, because it triggered much faster during your previous tests. So, I don't know why you assume this DMAR has to be correct this time. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 7:30 ` Jarek Poplawski @ 2010-01-18 16:29 ` Michael Breuer 2010-01-18 20:46 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-18 16:29 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 01/18/2010 02:30 AM, Jarek Poplawski wrote: > On Sun, Jan 17, 2010 at 06:15:22PM -0500, Michael Breuer wrote: > >> On 1/17/2010 6:05 PM, Jarek Poplawski wrote: >> >>> On Sun, Jan 17, 2010 at 05:34:19PM -0500, Michael Breuer wrote: >>> >>> >>>> On 1/17/2010 5:17 PM, Jarek Poplawski wrote: >>>> >>>> >>>>> On Sun, Jan 17, 2010 at 11:26:46AM -0500, Michael Breuer wrote: >>>>> >>>>> >>>>>> On 01/13/2010 04:16 PM, Michael Breuer wrote: >>>>>> >>>>>> >>>>>>> On 1/13/2010 4:09 PM, Jarek Poplawski wrote: >>>>>>> >>>>>>> >>>>>>>> On Wed, Jan 13, 2010 at 03:39:37PM -0500, Michael Breuer wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> Update: after leaving the system up for a few days, I hit the DMAR >>>>>> error again. >>>>>> >>>>>> >>>>> My proposal is to send some summary as a new thread, with dmar in the >>>>> subject, and cc-ed dmar maintainers. >>>>> >>>>> >>>>> >>>> Not sure I agree. The symptoms are identical to those I hit without >>>> DMAR earlier on. Also, as this issue only happens when there is high >>>> receive load, I'm thinking there's some sort of race between TX and >>>> RX within the sky2 driver, or hardware. I think that DMAR is >>>> correctly catching the error. >>>> >>>> >>> Hmm... OK, then let's wait with this report and go back to testing >>> it "really really long" ;-) without DMAR, and maybe without the >>> last Stephen's patch either? (So only the two things in the current >>> linux-2.6.) >>> >>> Jarek P. >>> >>> >> Ok - but absent the last patch, I think I still need the pskb_may_pull >> patch... so it'd be pskb_may_pull and afpacket v3 and no DMAR. >> > Exactly. Or if it's working for you already, the mainline (2.6.33-rc4) > with the pskb_may_pull patch. And check for warnings from the latter. > > >> Also - not sure if related, but there's still the odd tx side behavior >> when RX is under load. That I CAN reproduce at will (yesterday's report >> - no crash, but I confirmed that DHCPOFFER packets are being dropped >> somewhere after wireshark sees them and before hitting the wire. >> > I'm not sure either, but until there is no crash it might be some > minor bug or/and missing stat. Btw, you could probably try alternative > test with ping from this overloaded box to the router and win7. > > >> I am also wondering whether or not that testing I did yesterday set up >> today's hang - perhaps those lost TX packets are corrupting something >> that manifests worse later. >> > Maybe, but you wrote earlier they had to fix something around this > DMAR in the meantime, because it triggered much faster during your > previous tests. So, I don't know why you assume this DMAR has to be > correct this time. > > Jarek P. > Ok - up on the two patches, no DMAR. Some early observations: 1. There's an early on MMAP oops (see below). This happens once, at the completion of the transition to runlevel 5 (I've seen it entering runlevel 3 as well). This does not recur when runlevels are subsequently changed. I do not see this when running with DMAR enabled. 2. The dropped tx packet (DHCP) is a bit harder to recreate, but it still happens. Interestingly, I initially saw no dropped packets with ping - but after I went the DCHP route and eventually reconnected, I could then cause dropped tx packets with ping. To clarify: a) start throughput b) ping device - no packet loss - this was true for the entire test run. c) start throughput again d) ping - no loss. e) drop wifi on the device & restart - first attempt worked. Repeat attempt yielded the dropped DHCPOFFER packets. After about 6 tries, the device reconnected to wifi. f) ping again (after the reconnection) - packet loss rate about 80%. g) simultaneously ping the wifi router - no loss. h) After a while, packets are no longer dropped during ping. If I manage to cause the dhcp drop again, and then ping after the device finally reconnects, packet loss is significant for a while (maybe 30 sec to a minute). Then things return to normal. Note that the packet loss continues even if the reported throughput drops to nil. i) I can't cause the initial packet loss at RX rates below about 30,000KBPS (as reported by nethogs). At rates over 40 I can reproduce this on this set of patches & config about 60% of the time. The initial sky2 oops: Jan 18 10:42:43 mail kernel: ------------[ cut here ]------------ Jan 18 10:42:43 mail kernel: WARNING: at lib/dma-debug.c:898 check_sync+0xbd/0x426() Jan 18 10:42:43 mail kernel: Hardware name: System Product Name Jan 18 10:42:43 mail kernel: sky2 0000:06:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000003249b4022] [size=98 bytes] Jan 18 10:42:43 mail kernel: Modules linked in: microcode(+) ip6table_mangle ip6table_filter ip6_tables iptable_raw iptable_mangle ipt_MASQUERADE iptable_nat nf_nat appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp nf_conntrack_ipv6 xt_multiport xt_DSCP xt_dscp xt_MARK ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_hda_intel snd_rawmidi snd_hda_codec snd_ac97_codec gspca_spca505 ac97_bus snd_hwdep snd_seq gspca_main snd_seq_device firewire_ohci videodev firewire_core v4l1_compat snd_pcm i2c_i801 pcspkr v4l2_compat_ioctl32 crc_itu_t asus_atk0110 hwmon iTCO_wdt iTCO_vendor_support wmi snd_timer snd sky2 soundcore snd_page_alloc fbcon tileblit font bitblit softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbf Jan 18 10:42:43 mail kernel: illrect [last unloaded: ip6_tables] Jan 18 10:42:43 mail kernel: Pid: 0, comm: swapper Not tainted 2.6.32NOMMAPNODMARAF3SKY2PSKBMAYPULL-00893-gb5d5baa-dirty #3 Jan 18 10:42:43 mail kernel: Call Trace: Jan 18 10:42:43 mail kernel: <IRQ> [<ffffffff81053676>] warn_slowpath_common+0x7c/0x94 Jan 18 10:42:43 mail kernel: [<ffffffff810536e5>] warn_slowpath_fmt+0x41/0x43 Jan 18 10:42:43 mail kernel: [<ffffffff8127ae7d>] check_sync+0xbd/0x426 Jan 18 10:42:43 mail kernel: [<ffffffff813c5b4c>] ? __netdev_alloc_skb+0x34/0x50 Jan 18 10:42:43 mail kernel: [<ffffffff8127b539>] debug_dma_sync_single_for_cpu+0x42/0x44 Jan 18 10:42:43 mail kernel: [<ffffffff812788d7>] ? swiotlb_sync_single+0x2a/0xb6 Jan 18 10:42:43 mail kernel: [<ffffffff81278a33>] ? swiotlb_sync_single_for_cpu+0xc/0xe Jan 18 10:42:43 mail kernel: [<ffffffffa015eed6>] sky2_poll+0x4c6/0xae1 [sky2] Jan 18 10:42:43 mail kernel: [<ffffffff814673f2>] ? _spin_unlock_irqrestore+0x29/0x41 Jan 18 10:42:43 mail kernel: [<ffffffff813cc7ea>] net_rx_action+0xb5/0x1f3 Jan 18 10:42:43 mail kernel: [<ffffffff8105ae57>] __do_softirq+0xf8/0x1cd Jan 18 10:42:43 mail kernel: [<ffffffff810a2e0e>] ? handle_IRQ_event+0x119/0x12b Jan 18 10:42:43 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 Jan 18 10:42:43 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 Jan 18 10:42:43 mail kernel: [<ffffffff8105aa37>] irq_exit+0x4a/0x8c Jan 18 10:42:43 mail kernel: [<ffffffff8146b445>] do_IRQ+0xa5/0xbc Jan 18 10:42:43 mail kernel: [<ffffffff81012613>] ret_from_intr+0x0/0x16 Jan 18 10:42:43 mail kernel: <EOI> [<ffffffff812c251e>] ? acpi_idle_enter_bm+0x256/0x28a Jan 18 10:42:43 mail kernel: [<ffffffff812c2517>] ? acpi_idle_enter_bm+0x24f/0x28a Jan 18 10:42:43 mail kernel: [<ffffffff813a1b78>] ? cpuidle_idle_call+0x9e/0xfa Jan 18 10:42:43 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 Jan 18 10:42:43 mail kernel: [<ffffffff81460acf>] ? start_secondary+0x201/0x242 Jan 18 10:42:43 mail kernel: ---[ end trace 188c0cdbace3665e ]--- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 16:29 ` Michael Breuer @ 2010-01-18 20:46 ` Jarek Poplawski 2010-01-18 20:56 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-18 20:46 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Mon, Jan 18, 2010 at 11:29:31AM -0500, Michael Breuer wrote: > Ok - up on the two patches, no DMAR. Some early observations: > > 1. There's an early on MMAP oops (see below). This happens once, at > the completion of the transition to runlevel 5 (I've seen it > entering runlevel 3 as well). This does not recur when runlevels are > subsequently changed. I do not see this when running with DMAR > enabled. OK, you mentioned this oops (actually a warning only) happened during previous tests too. > > 2. The dropped tx packet (DHCP) is a bit harder to recreate, but it > still happens. Btw, I guess you improved the test because you didn't mention it here, even after my explicit question?: http://permalink.gmane.org/gmane.linux.network/149171 > Interestingly, I initially saw no dropped packets > with ping - but after I went the DCHP route and eventually > reconnected, I could then cause dropped tx packets with ping. To > clarify: > > a) start throughput > b) ping device - no packet loss - this was true for the entire test run. > c) start throughput again > d) ping - no loss. > e) drop wifi on the device & restart - first attempt worked. Repeat > attempt yielded the dropped DHCPOFFER packets. After about 6 tries, > the device reconnected to wifi. > f) ping again (after the reconnection) - packet loss rate about 80%. > g) simultaneously ping the wifi router - no loss. > h) After a while, packets are no longer dropped during ping. If I > manage to cause the dhcp drop again, and then ping after the device > finally reconnects, packet loss is significant for a while (maybe 30 > sec to a minute). Then things return to normal. Note that the packet > loss continues even if the reported throughput drops to nil. > i) I can't cause the initial packet loss at RX rates below about > 30,000KBPS (as reported by nethogs). At rates over 40 I can > reproduce this on this set of patches & config about 60% of the > time. I forgot to mention, but did you try to check if these lost ping packets are "being dropped somewhere after wireshark sees them and before hitting the wire" like DHCPOFFER? Aren't there any sky2 warnings/resets while this happens? Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 20:46 ` Jarek Poplawski @ 2010-01-18 20:56 ` Michael Breuer 2010-01-18 21:00 ` Stephen Hemminger 2010-01-18 21:25 ` Jarek Poplawski 0 siblings, 2 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-18 20:56 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/18/2010 3:46 PM, Jarek Poplawski wrote: > On Mon, Jan 18, 2010 at 11:29:31AM -0500, Michael Breuer wrote: > >> Ok - up on the two patches, no DMAR. Some early observations: >> >> 1. There's an early on MMAP oops (see below). This happens once, at >> the completion of the transition to runlevel 5 (I've seen it >> entering runlevel 3 as well). This does not recur when runlevels are >> subsequently changed. I do not see this when running with DMAR >> enabled. >> > OK, you mentioned this oops (actually a warning only) happened during > previous tests too. > Yes - dk if it's significant or not. Only obvious difference between DMAR and not. > >> 2. The dropped tx packet (DHCP) is a bit harder to recreate, but it >> still happens. >> > Btw, I guess you improved the test because you didn't mention it here, > even after my explicit question?: > http://permalink.gmane.org/gmane.linux.network/149171 > I had been focusing on the hangs - dhcp causing the initial crash from December. After things stabilized with the af patch & skb may pull I started noticing the dropped tx packets. I reported the TX loss on the 16th of January after confirming the issue. > >> Interestingly, I initially saw no dropped packets >> with ping - but after I went the DCHP route and eventually >> reconnected, I could then cause dropped tx packets with ping. To >> clarify: >> >> a) start throughput >> b) ping device - no packet loss - this was true for the entire test run. >> c) start throughput again >> d) ping - no loss. >> e) drop wifi on the device& restart - first attempt worked. Repeat >> attempt yielded the dropped DHCPOFFER packets. After about 6 tries, >> the device reconnected to wifi. >> f) ping again (after the reconnection) - packet loss rate about 80%. >> g) simultaneously ping the wifi router - no loss. >> h) After a while, packets are no longer dropped during ping. If I >> manage to cause the dhcp drop again, and then ping after the device >> finally reconnects, packet loss is significant for a while (maybe 30 >> sec to a minute). Then things return to normal. Note that the packet >> loss continues even if the reported throughput drops to nil. >> i) I can't cause the initial packet loss at RX rates below about >> 30,000KBPS (as reported by nethogs). At rates over 40 I can >> reproduce this on this set of patches& config about 60% of the >> time. >> > I forgot to mention, but did you try to check if these lost ping > packets are "being dropped somewhere after wireshark sees them and > before hitting the wire" like DHCPOFFER? Aren't there any sky2 > warnings/resets while this happens? > > Jarek P. > Yes. There are no errors, and no statistics anywhere that I know to look reflect the loss. Nothing in netstat; ethtool -S; etc. The only loss reported is RX. The recent TX warnings/resets happened while the machine was up for several days and while unattended and under high RX load. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 20:56 ` Michael Breuer @ 2010-01-18 21:00 ` Stephen Hemminger 2010-01-18 21:06 ` Jarek Poplawski 2010-01-18 21:24 ` Michael Breuer 2010-01-18 21:25 ` Jarek Poplawski 1 sibling, 2 replies; 145+ messages in thread From: Stephen Hemminger @ 2010-01-18 21:00 UTC (permalink / raw) To: Michael Breuer Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On Mon, 18 Jan 2010 15:56:45 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > >> 2. The dropped tx packet (DHCP) is a bit harder to recreate, but it > >> still happens. > >> You might want to use tc filter rule to set priority of DHCP packets higher. This would cause them to be in a separate queue and eliminate the problem. -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 21:00 ` Stephen Hemminger @ 2010-01-18 21:06 ` Jarek Poplawski 2010-01-18 21:24 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-18 21:06 UTC (permalink / raw) To: Stephen Hemminger Cc: Michael Breuer, David Miller, akpm, flyboy, linux-kernel, netdev On Mon, Jan 18, 2010 at 01:00:38PM -0800, Stephen Hemminger wrote: > On Mon, 18 Jan 2010 15:56:45 -0500 > Michael Breuer <mbreuer@majjas.com> wrote: > > > >> 2. The dropped tx packet (DHCP) is a bit harder to recreate, but it > > >> still happens. > > >> > > You might want to use tc filter rule to set priority of DHCP packets > higher. This would cause them to be in a separate queue and eliminate > the problem. > But it looks like they are dropped in TX after passing a qdisc (and wireshark). Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 21:00 ` Stephen Hemminger 2010-01-18 21:06 ` Jarek Poplawski @ 2010-01-18 21:24 ` Michael Breuer 2010-01-18 21:50 ` Jarek Poplawski 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-18 21:24 UTC (permalink / raw) To: Stephen Hemminger Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On 1/18/2010 4:00 PM, Stephen Hemminger wrote: > On Mon, 18 Jan 2010 15:56:45 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >>>> 2. The dropped tx packet (DHCP) is a bit harder to recreate, but it >>>> still happens. >>>> >>>> > You might want to use tc filter rule to set priority of DHCP packets > higher. This would cause them to be in a separate queue and eliminate > the problem. > > Ok - for fun, tried that - no change. Not sure I see why this might be a factor. The packet loss happens when TX load is low and RX high. Also, packets only being dropped if traversing a router vs.to the router itself. Keep in mind that pings to the router did not lose packets, pings through the router lost packets. The router was not under load (traffic is being generated from a device connected via the 1Gb switch, not the wifi router), and tcpdump on the router input port shows the pings to the router, but not the ones through the router. One added note, when I just tried this, the test data ended while the packet loss was occurring. The DHCPOFFER packet loss did not clear until about a minute after the throughput abated. I really think something is getting hosed, and I'd but some weird interaction with the arp logic high on the list of suspects. Not sure what else would be a factor when looking at the extra hop on the same subnet. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 21:24 ` Michael Breuer @ 2010-01-18 21:50 ` Jarek Poplawski 0 siblings, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-18 21:50 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Mon, Jan 18, 2010 at 04:24:52PM -0500, Michael Breuer wrote: > On 1/18/2010 4:00 PM, Stephen Hemminger wrote: > >On Mon, 18 Jan 2010 15:56:45 -0500 > >Michael Breuer<mbreuer@majjas.com> wrote: > > > >>>>2. The dropped tx packet (DHCP) is a bit harder to recreate, but it > >>>>still happens. > >>>> > >You might want to use tc filter rule to set priority of DHCP packets > >higher. This would cause them to be in a separate queue and eliminate > >the problem. > > > Ok - for fun, tried that - no change. Not sure I see why this might > be a factor. The packet loss happens when TX load is low and RX > high. > Also, packets only being dropped if traversing a router vs.to the > router itself. Keep in mind that pings to the router did not lose > packets, pings through the router lost packets. The router was not > under load (traffic is being generated from a device connected via > the 1Gb switch, not the wifi router), and tcpdump on the router > input port shows the pings to the router, but not the ones through > the router. > > One added note, when I just tried this, the test data ended while > the packet loss was occurring. The DHCPOFFER packet loss did not > clear until about a minute after the throughput abated. I really > think something is getting hosed, and I'd but some weird interaction > with the arp logic high on the list of suspects. Not sure what else > would be a factor when looking at the extra hop on the same subnet. Good point! Actually, IIRC, your setup might be a problem: you seem to have two switches on the path (I guess the router is a bridge for these wireless), so I wonder if it's not something between them. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 20:56 ` Michael Breuer 2010-01-18 21:00 ` Stephen Hemminger @ 2010-01-18 21:25 ` Jarek Poplawski 2010-01-18 21:39 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-18 21:25 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Mon, Jan 18, 2010 at 03:56:45PM -0500, Michael Breuer wrote: > On 1/18/2010 3:46 PM, Jarek Poplawski wrote: > >On Mon, Jan 18, 2010 at 11:29:31AM -0500, Michael Breuer wrote: > >>Ok - up on the two patches, no DMAR. Some early observations: > >> > >>1. There's an early on MMAP oops (see below). This happens once, at > >>the completion of the transition to runlevel 5 (I've seen it > >>entering runlevel 3 as well). This does not recur when runlevels are > >>subsequently changed. I do not see this when running with DMAR > >>enabled. > >OK, you mentioned this oops (actually a warning only) happened during > >previous tests too. > Yes - dk if it's significant or not. Only obvious difference between > DMAR and not. OK, let's try (as long as possible) if it can break so hard as with DMAR. > >>2. The dropped tx packet (DHCP) is a bit harder to recreate, but it > >>still happens. > >Btw, I guess you improved the test because you didn't mention it here, > >even after my explicit question?: > >http://permalink.gmane.org/gmane.linux.network/149171 > I had been focusing on the hangs - dhcp causing the initial crash > from December. After things stabilized with the af patch & skb may > pull I started noticing the dropped tx packets. I reported the TX > loss on the 16th of January after confirming the issue. OK, but we need to establish some status quo after these patches before any new things (including DMAR), so I'd suggest trying this config really longer and harder. > >>Interestingly, I initially saw no dropped packets > >>with ping - but after I went the DCHP route and eventually > >>reconnected, I could then cause dropped tx packets with ping. To > >>clarify: > >> > >>a) start throughput > >>b) ping device - no packet loss - this was true for the entire test run. > >>c) start throughput again > >>d) ping - no loss. > >>e) drop wifi on the device& restart - first attempt worked. Repeat > >>attempt yielded the dropped DHCPOFFER packets. After about 6 tries, > >>the device reconnected to wifi. > >>f) ping again (after the reconnection) - packet loss rate about 80%. > >>g) simultaneously ping the wifi router - no loss. > >>h) After a while, packets are no longer dropped during ping. If I > >>manage to cause the dhcp drop again, and then ping after the device > >>finally reconnects, packet loss is significant for a while (maybe 30 > >>sec to a minute). Then things return to normal. Note that the packet > >>loss continues even if the reported throughput drops to nil. > >>i) I can't cause the initial packet loss at RX rates below about > >>30,000KBPS (as reported by nethogs). At rates over 40 I can > >>reproduce this on this set of patches& config about 60% of the > >>time. > >I forgot to mention, but did you try to check if these lost ping > >packets are "being dropped somewhere after wireshark sees them and > >before hitting the wire" like DHCPOFFER? Aren't there any sky2 > >warnings/resets while this happens? > > > >Jarek P. > Yes. There are no errors, and no statistics anywhere that I know to > look reflect the loss. Nothing in netstat; ethtool -S; etc. The only > loss reported is RX. The recent TX warnings/resets happened while > the machine was up for several days and while unattended and under > high RX load. Please check "tc -s qdisc" each time as well. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 21:25 ` Jarek Poplawski @ 2010-01-18 21:39 ` Michael Breuer 2010-01-18 22:08 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-18 21:39 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/18/2010 4:25 PM, Jarek Poplawski wrote: > On Mon, Jan 18, 2010 at 03:56:45PM -0500, Michael Breuer wrote: > >> On 1/18/2010 3:46 PM, Jarek Poplawski wrote: >> >>> On Mon, Jan 18, 2010 at 11:29:31AM -0500, Michael Breuer wrote: >>> >>>> Ok - up on the two patches, no DMAR. Some early observations: >>>> >>>> 1. There's an early on MMAP oops (see below). This happens once, at >>>> the completion of the transition to runlevel 5 (I've seen it >>>> entering runlevel 3 as well). This does not recur when runlevels are >>>> subsequently changed. I do not see this when running with DMAR >>>> enabled. >>>> >>> OK, you mentioned this oops (actually a warning only) happened during >>> previous tests too. >>> >> Yes - dk if it's significant or not. Only obvious difference between >> DMAR and not. >> > OK, let's try (as long as possible) if it can break so hard as with > DMAR. > > >>>> 2. The dropped tx packet (DHCP) is a bit harder to recreate, but it >>>> still happens. >>>> >>> Btw, I guess you improved the test because you didn't mention it here, >>> even after my explicit question?: >>> http://permalink.gmane.org/gmane.linux.network/149171 >>> >> I had been focusing on the hangs - dhcp causing the initial crash >> from December. After things stabilized with the af patch& skb may >> pull I started noticing the dropped tx packets. I reported the TX >> loss on the 16th of January after confirming the issue. >> > OK, but we need to establish some status quo after these patches > before any new things (including DMAR), so I'd suggest trying this > config really longer and harder. > > >>>> Interestingly, I initially saw no dropped packets >>>> with ping - but after I went the DCHP route and eventually >>>> reconnected, I could then cause dropped tx packets with ping. To >>>> clarify: >>>> >>>> a) start throughput >>>> b) ping device - no packet loss - this was true for the entire test run. >>>> c) start throughput again >>>> d) ping - no loss. >>>> e) drop wifi on the device& restart - first attempt worked. Repeat >>>> attempt yielded the dropped DHCPOFFER packets. After about 6 tries, >>>> the device reconnected to wifi. >>>> f) ping again (after the reconnection) - packet loss rate about 80%. >>>> g) simultaneously ping the wifi router - no loss. >>>> h) After a while, packets are no longer dropped during ping. If I >>>> manage to cause the dhcp drop again, and then ping after the device >>>> finally reconnects, packet loss is significant for a while (maybe 30 >>>> sec to a minute). Then things return to normal. Note that the packet >>>> loss continues even if the reported throughput drops to nil. >>>> i) I can't cause the initial packet loss at RX rates below about >>>> 30,000KBPS (as reported by nethogs). At rates over 40 I can >>>> reproduce this on this set of patches& config about 60% of the >>>> time. >>>> >>> I forgot to mention, but did you try to check if these lost ping >>> packets are "being dropped somewhere after wireshark sees them and >>> before hitting the wire" like DHCPOFFER? Aren't there any sky2 >>> warnings/resets while this happens? >>> >>> Jarek P. >>> >> Yes. There are no errors, and no statistics anywhere that I know to >> look reflect the loss. Nothing in netstat; ethtool -S; etc. The only >> loss reported is RX. The recent TX warnings/resets happened while >> the machine was up for several days and while unattended and under >> high RX load. >> > Please check "tc -s qdisc" each time as well. > > Jarek P > Some output from tc -s qdisc: Before test: qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 35279532 bytes 291080 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 377308 bytes 3107 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 During test (after initial observed packet loss): qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 123389424 bytes 1781403 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 400862 bytes 3250 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 During test - while packet loss occuring: qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 150518974 bytes 2138312 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 422003 bytes 3432 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 After the conclusion of the test: qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 244900497 bytes 3416350 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 564380 bytes 4708 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 During the test, 8.9GB received; 232.9MB sent). I also connected a second device through the wifi router. I was able to ping that device w/o loss while DHCP packets were being dropped to the other connected device. Last note: just moved to 2.6.32.4 from .3 for this test (from git). ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 21:39 ` Michael Breuer @ 2010-01-18 22:08 ` Jarek Poplawski 2010-01-18 22:17 ` Jarek Poplawski 2010-01-18 22:25 ` Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-18 22:08 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Mon, Jan 18, 2010 at 04:39:24PM -0500, Michael Breuer wrote: > Some output from tc -s qdisc: ... > After the conclusion of the test: > qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 > 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 244900497 bytes 3416350 pkt (dropped 0, overlimits 0 requeues 0) > rate 0bit 0pps backlog 0b 0p requeues 0 > qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 > 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 564380 bytes 4708 pkt (dropped 0, overlimits 0 requeues 0) > rate 0bit 0pps backlog 0b 0p requeues 0 Great! > > > During the test, 8.9GB received; 232.9MB sent). > > I also connected a second device through the wifi router. I was able > to ping that device w/o loss while DHCP packets were being dropped > to the other connected device. Could you remind us if the problem is always with this first device? Btw, I wonder if you could test it skipping the (HP?) switch? Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 22:08 ` Jarek Poplawski @ 2010-01-18 22:17 ` Jarek Poplawski 2010-01-18 22:47 ` Michael Breuer 2010-01-18 22:25 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-18 22:17 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Mon, Jan 18, 2010 at 11:08:14PM +0100, Jarek Poplawski wrote: > Btw, I wonder if you could test it skipping the (HP?) switch? If so, then of course don't forget to try tcpdump on the router. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 22:17 ` Jarek Poplawski @ 2010-01-18 22:47 ` Michael Breuer 2010-01-19 5:46 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-18 22:47 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/18/2010 5:17 PM, Jarek Poplawski wrote: > On Mon, Jan 18, 2010 at 11:08:14PM +0100, Jarek Poplawski wrote: > >> Btw, I wonder if you could test it skipping the (HP?) switch? >> > If so, then of course don't forget to try tcpdump on the router. > > Jarek P. > Well - no.... but I'm not sure that would show anything. Setup diagram: Server->gb switch-> (100mb) wifi router -> devices | Win7 PC (gb) The problem does not occur (at least I haven't been able to recreate it) at 100mb, and the wifi router doesn't do 1Gb. I drive the traffic from the win7 PC to the server. I've seen the loss when the only traffic going through the wifi router was ping & dhcp. I've also never seen any loss on a device directly attached to the 1GB switch. I can drive load through the wifi router while driving load from the Win7 box, but don't see TX packet loss at all when not doing DHCP RELEASE/RENEW. As there is no packet loss to devices not involved in the DHCP sequence through the same path, I'm not really sure that the GB switch is implicated. As I don't have a standalone sniffer, I'm thinking that it might be easier to instrument places where the TX packet could be dropped and see at least whether it's getting to the card. Given the circumstances of the TX drop, and that it was DHCP traffic while under load that caused the oops rectified with the two patches, I'm thinking that the packet loss is the current manifestation of whatever the underlying problem is. Given the extra hop required to break things, and given that a dhcp release/renew seems to trigger things, I keep coming back to arp logic as being somehow implicated. If arp is somehow involved, then I'd expect to see manifestations under similar circumstances with other drivers. As the pskb_may_pull patch stopped the crash, perhaps other drivers do suffer packet loss and it's just not been widely noticed or attributed to the kernel - especially if the network topology is a factor. I do know people at large enterprises who have been complaining of what *could* be this same issue, however they're currently blaming their switch vendors. As most traffic is TCP, this is really only noticed by those few places deeply concerned with latency. It's likely something altogether different, but then again, maybe not. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 22:47 ` Michael Breuer @ 2010-01-19 5:46 ` Michael Breuer 2010-01-19 8:41 ` Jarek Poplawski 2010-01-19 10:47 ` Jarek Poplawski 0 siblings, 2 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-19 5:46 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/18/2010 5:47 PM, Michael Breuer wrote: > On 1/18/2010 5:17 PM, Jarek Poplawski wrote: >> On Mon, Jan 18, 2010 at 11:08:14PM +0100, Jarek Poplawski wrote: >>> Btw, I wonder if you could test it skipping the (HP?) switch? >> If so, then of course don't forget to try tcpdump on the router. >> >> Jarek P. > Well - no.... but I'm not sure that would show anything. > > Setup diagram: > > Server->gb switch-> (100mb) wifi router -> devices > | > Win7 PC (gb) > > The problem does not occur (at least I haven't been able to recreate > it) at 100mb, and the wifi router doesn't do 1Gb. I drive the traffic > from the win7 PC to the server. I've seen the loss when the only > traffic going through the wifi router was ping & dhcp. I've also never > seen any loss on a device directly attached to the 1GB switch. I can > drive load through the wifi router while driving load from the Win7 > box, but don't see TX packet loss at all when not doing DHCP > RELEASE/RENEW. > > As there is no packet loss to devices not involved in the DHCP > sequence through the same path, I'm not really sure that the GB switch > is implicated. > > As I don't have a standalone sniffer, I'm thinking that it might be > easier to instrument places where the TX packet could be dropped and > see at least whether it's getting to the card. > > Given the circumstances of the TX drop, and that it was DHCP traffic > while under load that caused the oops rectified with the two patches, > I'm thinking that the packet loss is the current manifestation of > whatever the underlying problem is. Given the extra hop required to > break things, and given that a dhcp release/renew seems to trigger > things, I keep coming back to arp logic as being somehow implicated. > > If arp is somehow involved, then I'd expect to see manifestations > under similar circumstances with other drivers. As the pskb_may_pull > patch stopped the crash, perhaps other drivers do suffer packet loss > and it's just not been widely noticed or attributed to the kernel - > especially if the network topology is a factor. I do know people at > large enterprises who have been complaining of what *could* be this > same issue, however they're currently blaming their switch vendors. As > most traffic is TCP, this is really only noticed by those few places > deeply concerned with latency. It's likely something altogether > different, but then again, maybe not. Ok - one last update for a while ...not sure what's next... I put some printk's into sky2.c xmit logic - the packets are being sent to the card, and the i/o's are completing successfully. So it would seem either the switch is dropping the packets, or else the wifi router is. As tcpdump doesn't show the packets arriving on the wifi router, I'm leaning towards the switch. I ran wireshark on the win7 box to see what is coming off the switch. I did notice one thing that's visible to the win7 box but is not showing up on the linux wireshark - before every successful dhcpoffer, there's an XID message broadcast from the device. I'm wondering why I don't see this on the linux side: The packet is from the mac of the device, dst ff:ff:ff:ff:ff:ff; protocol eth:llc... hex packet: ffffffffffff001cccf39ff600060001af810100. Now I guess I've got some reading to do... I've got no idea what the correct application of llc messages would be given my topology :(. I do suspect that the llc stuff (or lack thereof under some conditions) is causing the switch to fail to forward the dhcpoffer message. As the dhcpoffer message is not broadcast, but directed to the remote mac address and as that address is not connected directly to the switch, I'm guessing that under some conditions whatever tells the switch how to find the mac is missing. I'd guess that the wifi router should be letting the switch know around the time it forwards the first arp and/or DHCP broadcast message from the client... or maybe the linux box should be doing something before the offer. So net-net, as far as my TX packet loss issue, sky2 is in the clear. If something on the linux side should be informing the switch about something then there may still be an issue. If the wifi router should be doing something differently, then it's unfortunately likely a 2.4.37 kernel issue (That's what dd-wrt is using). ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 5:46 ` Michael Breuer @ 2010-01-19 8:41 ` Jarek Poplawski 2010-01-19 15:28 ` Michael Breuer 2010-01-21 19:48 ` Michael Breuer 2010-01-19 10:47 ` Jarek Poplawski 1 sibling, 2 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-19 8:41 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Tue, Jan 19, 2010 at 12:46:24AM -0500, Michael Breuer wrote: > So net-net, as far as my TX packet loss issue, sky2 is in the clear. If > something on the linux side should be informing the switch about > something then there may still be an issue. If the wifi router should be > doing something differently, then it's unfortunately likely a 2.4.37 > kernel issue (That's what dd-wrt is using). IMHO until there is no proof from a sniffer or some regs dumps the switch and the router are more suspicious than your NIC or linux box. Then debugging these other things isn't so much interesting from my POV ;-) Anyway, if you only want to get it working (instead of debugging), it seems you might try moving the dhcp server to the router or maybe even using two separate servers with their pools - unless I missed something in your config. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 8:41 ` Jarek Poplawski @ 2010-01-19 15:28 ` Michael Breuer 2010-01-21 19:48 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-19 15:28 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/19/2010 3:41 AM, Jarek Poplawski wrote: > On Tue, Jan 19, 2010 at 12:46:24AM -0500, Michael Breuer wrote: > >> So net-net, as far as my TX packet loss issue, sky2 is in the clear. If >> something on the linux side should be informing the switch about >> something then there may still be an issue. If the wifi router should be >> doing something differently, then it's unfortunately likely a 2.4.37 >> kernel issue (That's what dd-wrt is using). >> > IMHO until there is no proof from a sniffer or some regs dumps the > switch and the router are more suspicious than your NIC or linux box. > Then debugging these other things isn't so much interesting from my > POV ;-) > > Anyway, if you only want to get it working (instead of debugging), > it seems you might try moving the dhcp server to the router or maybe > even using two separate servers with their pools - unless I missed > something in your config. > > Jarek P. > I'm going to take a stab at debugging before creating a new subnet. Thanks. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 8:41 ` Jarek Poplawski 2010-01-19 15:28 ` Michael Breuer @ 2010-01-21 19:48 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-21 19:48 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/19/2010 3:41 AM, Jarek Poplawski wrote: > On Tue, Jan 19, 2010 at 12:46:24AM -0500, Michael Breuer wrote: > >> So net-net, as far as my TX packet loss issue, sky2 is in the clear. If >> something on the linux side should be informing the switch about >> something then there may still be an issue. If the wifi router should be >> doing something differently, then it's unfortunately likely a 2.4.37 >> kernel issue (That's what dd-wrt is using). >> > IMHO until there is no proof from a sniffer or some regs dumps the > switch and the router are more suspicious than your NIC or linux box. > Then debugging these other things isn't so much interesting from my > POV ;-) > > Anyway, if you only want to get it working (instead of debugging), > it seems you might try moving the dhcp server to the router or maybe > even using two separate servers with their pools - unless I missed > something in your config. > > Jarek P. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > Postscript: HP has confirmed that the 1400-8g switch doesn't work, "well," in a mixed 100Mb/1000Mb environment, and is dropping DHCP packets. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 5:46 ` Michael Breuer 2010-01-19 8:41 ` Jarek Poplawski @ 2010-01-19 10:47 ` Jarek Poplawski 2010-01-19 15:47 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-19 10:47 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Tue, Jan 19, 2010 at 12:46:24AM -0500, Michael Breuer wrote: > Ok - one last update for a while ...not sure what's next... I put some > printk's into sky2.c xmit logic - the packets are being sent to the ... Btw, could you try if this patch can makes difference in triggering the "lib/dma-debug.c:898" warning? Jarek P. --- drivers/net/sky2.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c index 7650f73..e02e9e9 100644 --- a/drivers/net/sky2.c +++ b/drivers/net/sky2.c @@ -2479,6 +2479,9 @@ static int sky2_status_intr(struct sky2_hw *hw, int to_do, u16 idx) port = le->css & CSS_LINK_BIT; dev = hw->dev[port]; + if (!netif_running(dev)) + continue; + sky2 = netdev_priv(dev); length = le16_to_cpu(le->length); status = le32_to_cpu(le->status); ^ permalink raw reply related [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 10:47 ` Jarek Poplawski @ 2010-01-19 15:47 ` Michael Breuer 2010-01-19 19:59 ` Jarek Poplawski 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-19 15:47 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/19/2010 5:47 AM, Jarek Poplawski wrote: > On Tue, Jan 19, 2010 at 12:46:24AM -0500, Michael Breuer wrote: > >> Ok - one last update for a while ...not sure what's next... I put some >> printk's into sky2.c xmit logic - the packets are being sent to the >> > ... > > Btw, could you try if this patch can makes difference in triggering > the "lib/dma-debug.c:898" warning? > > Jarek P. > --- > > drivers/net/sky2.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c > index 7650f73..e02e9e9 100644 > --- a/drivers/net/sky2.c > +++ b/drivers/net/sky2.c > @@ -2479,6 +2479,9 @@ static int sky2_status_intr(struct sky2_hw *hw, int to_do, u16 idx) > > port = le->css& CSS_LINK_BIT; > dev = hw->dev[port]; > + if (!netif_running(dev)) > + continue; > + > sky2 = netdev_priv(dev); > length = le16_to_cpu(le->length); > status = le32_to_cpu(le->status); > Still get the warning... but now 60 bytes. Jan 19 10:43:50 mail kernel: ------------[ cut here ]------------ Jan 19 10:43:50 mail kernel: WARNING: at lib/dma-debug.c:902 check_sync+0xc1/0x43f() Jan 19 10:43:50 mail kernel: Hardware name: System Product Name Jan 19 10:43:50 mail kernel: sky2 0000:04:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x0000000320a0b022] [size=60 bytes] Jan 19 10:43:50 mail kernel: Modules linked in: microcode(+) ip6table_filter ip6table_mangle ip6_tables iptable_raw iptable_mangle ipt_MASQUERADE iptable_nat nf_nat appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec snd_hda_intel ac97_bus snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm gspca_spca505 snd_timer gspca_main firewire_ohci snd soundcore videodev i2c_i801 firewire_core v4l1_compat snd_page_alloc v4l2_compat_ioctl32 pcspkr wmi crc_itu_t iTCO_wdt iTCO_vendor_support asus_atk0110 sky2 hwmon fbcon tileblit font bitblit softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbf Jan 19 10:43:50 mail kernel: illrect [last unloaded: ip6_tables] Jan 19 10:43:50 mail kernel: Pid: 0, comm: swapper Not tainted 2.6.32.4MMAPNODMARAF3SKY2PSKBMAYPULL-00912-g914160d-dirty #3 Jan 19 10:43:50 mail kernel: Call Trace: Jan 19 10:43:50 mail kernel: <IRQ> [<ffffffff810536ee>] warn_slowpath_common+0x7c/0x94 Jan 19 10:43:50 mail kernel: [<ffffffff8105375d>] warn_slowpath_fmt+0x41/0x43 Jan 19 10:43:50 mail kernel: [<ffffffff8127b871>] check_sync+0xc1/0x43f Jan 19 10:43:50 mail kernel: [<ffffffff8107a683>] ? timekeeping_get_ns+0x1b/0x3d Jan 19 10:43:50 mail kernel: [<ffffffff813c6770>] ? __netdev_alloc_skb+0x34/0x50 Jan 19 10:43:50 mail kernel: [<ffffffff8127bf42>] debug_dma_sync_single_for_cpu+0x42/0x44 Jan 19 10:43:50 mail kernel: [<ffffffff812792c7>] ? swiotlb_sync_single+0x2a/0xb6 Jan 19 10:43:50 mail kernel: [<ffffffff81279423>] ? swiotlb_sync_single_for_cpu+0xc/0xe Jan 19 10:43:50 mail kernel: [<ffffffffa0155ee0>] sky2_poll+0x4d0/0xaeb [sky2] Jan 19 10:43:50 mail kernel: [<ffffffff813cd40e>] net_rx_action+0xb5/0x1f3 Jan 19 10:43:50 mail kernel: [<ffffffff8105af0f>] __do_softirq+0xf8/0x1cd Jan 19 10:43:50 mail kernel: [<ffffffff810a3006>] ? handle_IRQ_event+0x119/0x12b Jan 19 10:43:50 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 Jan 19 10:43:50 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 Jan 19 10:43:50 mail kernel: [<ffffffff8105aaef>] irq_exit+0x4a/0x8c Jan 19 10:43:50 mail kernel: [<ffffffff8146c095>] do_IRQ+0xa5/0xbc Jan 19 10:43:50 mail kernel: [<ffffffff81012613>] ret_from_intr+0x0/0x16 Jan 19 10:43:50 mail kernel: <EOI> [<ffffffff812c2f6e>] ? acpi_idle_enter_bm+0x256/0x28a Jan 19 10:43:50 mail kernel: [<ffffffff812c2f67>] ? acpi_idle_enter_bm+0x24f/0x28a Jan 19 10:43:50 mail kernel: [<ffffffff813a279c>] ? cpuidle_idle_call+0x9e/0xfa Jan 19 10:43:50 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 Jan 19 10:43:50 mail kernel: [<ffffffff814616d2>] ? start_secondary+0x201/0x242 Jan 19 10:43:50 mail kernel: ---[ end trace 310dbc8966610916 ]--- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 15:47 ` Michael Breuer @ 2010-01-19 19:59 ` Jarek Poplawski 2010-01-19 20:06 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-19 19:59 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Tue, Jan 19, 2010 at 10:47:27AM -0500, Michael Breuer wrote: ... > Still get the warning... but now 60 bytes. > Jan 19 10:43:50 mail kernel: ------------[ cut here ]------------ > Jan 19 10:43:50 mail kernel: WARNING: at lib/dma-debug.c:902 Thanks for trying this. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 19:59 ` Jarek Poplawski @ 2010-01-19 20:06 ` Michael Breuer 2010-01-19 20:29 ` Jarek Poplawski 2010-01-19 22:45 ` Jarek Poplawski 0 siblings, 2 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-19 20:06 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/19/2010 2:59 PM, Jarek Poplawski wrote: > On Tue, Jan 19, 2010 at 10:47:27AM -0500, Michael Breuer wrote: > ... > >> Still get the warning... but now 60 bytes. >> Jan 19 10:43:50 mail kernel: ------------[ cut here ]------------ >> Jan 19 10:43:50 mail kernel: WARNING: at lib/dma-debug.c:902 >> > Thanks for trying this. > Jarek P. > NP. Let me know what else I can do to help with this one. Note that it seems innocuous. I'll continue with the older set and no dmar for now and see if I get another crash. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 20:06 ` Michael Breuer @ 2010-01-19 20:29 ` Jarek Poplawski 2010-01-19 22:45 ` Jarek Poplawski 1 sibling, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-19 20:29 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Tue, Jan 19, 2010 at 03:06:01PM -0500, Michael Breuer wrote: > On 1/19/2010 2:59 PM, Jarek Poplawski wrote: > >On Tue, Jan 19, 2010 at 10:47:27AM -0500, Michael Breuer wrote: > >... > >>Still get the warning... but now 60 bytes. > >>Jan 19 10:43:50 mail kernel: ------------[ cut here ]------------ > >>Jan 19 10:43:50 mail kernel: WARNING: at lib/dma-debug.c:902 > >Thanks for trying this. > >Jarek P. > NP. Let me know what else I can do to help with this one. Note that > it seems innocuous. I'll continue with the older set and no dmar for > now and see if I get another crash. I'll try to look at it more. Since, as you predicted, this might be connected with dmar problems, it needs some ideas for debugging/fixing and until this time you could give it a rest from crashing. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 20:06 ` Michael Breuer 2010-01-19 20:29 ` Jarek Poplawski @ 2010-01-19 22:45 ` Jarek Poplawski 2010-01-20 1:01 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2010-01-19 22:45 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Tue, Jan 19, 2010 at 03:06:01PM -0500, Michael Breuer wrote: > On 1/19/2010 2:59 PM, Jarek Poplawski wrote: > >On Tue, Jan 19, 2010 at 10:47:27AM -0500, Michael Breuer wrote: > >... > >>Still get the warning... but now 60 bytes. > >>Jan 19 10:43:50 mail kernel: ------------[ cut here ]------------ > >>Jan 19 10:43:50 mail kernel: WARNING: at lib/dma-debug.c:902 > >Thanks for trying this. > >Jarek P. > NP. Let me know what else I can do to help with this one. Note that > it seems innocuous. I'll continue with the older set and no dmar for > now and see if I get another crash. Maybe one more idea to try (not compiled). Jarek P. --- drivers/net/sky2.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c index 7650f73..cdebdd3 100644 --- a/drivers/net/sky2.c +++ b/drivers/net/sky2.c @@ -2252,12 +2252,14 @@ static struct sk_buff *receive_copy(struct sky2_port *sky2, skb = netdev_alloc_skb_ip_align(sky2->netdev, length); if (likely(skb)) { pci_dma_sync_single_for_cpu(sky2->hw->pdev, re->data_addr, - length, PCI_DMA_FROMDEVICE); + pci_unmap_len(re, data_size), + PCI_DMA_FROMDEVICE); skb_copy_from_linear_data(re->skb, skb->data, length); skb->ip_summed = re->skb->ip_summed; skb->csum = re->skb->csum; pci_dma_sync_single_for_device(sky2->hw->pdev, re->data_addr, - length, PCI_DMA_FROMDEVICE); + pci_unmap_len(re, data_size), + PCI_DMA_FROMDEVICE); re->skb->ip_summed = CHECKSUM_NONE; skb_put(skb, length); } ^ permalink raw reply related [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-19 22:45 ` Jarek Poplawski @ 2010-01-20 1:01 ` Michael Breuer 2010-01-20 1:10 ` Stephen Hemminger 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-20 1:01 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/19/2010 5:45 PM, Jarek Poplawski wrote: > On Tue, Jan 19, 2010 at 03:06:01PM -0500, Michael Breuer wrote: > >> On 1/19/2010 2:59 PM, Jarek Poplawski wrote: >> >>> On Tue, Jan 19, 2010 at 10:47:27AM -0500, Michael Breuer wrote: >>> ... >>> >>>> Still get the warning... but now 60 bytes. >>>> Jan 19 10:43:50 mail kernel: ------------[ cut here ]------------ >>>> Jan 19 10:43:50 mail kernel: WARNING: at lib/dma-debug.c:902 >>>> >>> Thanks for trying this. >>> Jarek P. >>> >> NP. Let me know what else I can do to help with this one. Note that >> it seems innocuous. I'll continue with the older set and no dmar for >> now and see if I get another crash. >> > Maybe one more idea to try (not compiled). > > Jarek P. > --- > > drivers/net/sky2.c | 6 ++++-- > 1 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c > index 7650f73..cdebdd3 100644 > --- a/drivers/net/sky2.c > +++ b/drivers/net/sky2.c > @@ -2252,12 +2252,14 @@ static struct sk_buff *receive_copy(struct sky2_port *sky2, > skb = netdev_alloc_skb_ip_align(sky2->netdev, length); > if (likely(skb)) { > pci_dma_sync_single_for_cpu(sky2->hw->pdev, re->data_addr, > - length, PCI_DMA_FROMDEVICE); > + pci_unmap_len(re, data_size), > + PCI_DMA_FROMDEVICE); > skb_copy_from_linear_data(re->skb, skb->data, length); > skb->ip_summed = re->skb->ip_summed; > skb->csum = re->skb->csum; > pci_dma_sync_single_for_device(sky2->hw->pdev, re->data_addr, > - length, PCI_DMA_FROMDEVICE); > + pci_unmap_len(re, data_size), > + PCI_DMA_FROMDEVICE); > re->skb->ip_summed = CHECKSUM_NONE; > skb_put(skb, length); > } > That not only compiled, but it cleared the error as well. Additionally, I used to see a bit of a delay receiving the login prompt when first connecting to the box by ssh. That delay is gone with this patch. I'd guess that the warning wasn't quite as innocuous as I thought. Note: tested on 2.6.32.4. I'll leave this up for a bit before attempting to move back to head. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-20 1:01 ` Michael Breuer @ 2010-01-20 1:10 ` Stephen Hemminger 2010-01-21 16:14 ` Stefan Richter 0 siblings, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2010-01-20 1:10 UTC (permalink / raw) To: Michael Breuer Cc: Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev On Tue, 19 Jan 2010 20:01:10 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > On 1/19/2010 5:45 PM, Jarek Poplawski wrote: > > On Tue, Jan 19, 2010 at 03:06:01PM -0500, Michael Breuer wrote: > > > >> On 1/19/2010 2:59 PM, Jarek Poplawski wrote: > >> > >>> On Tue, Jan 19, 2010 at 10:47:27AM -0500, Michael Breuer wrote: > >>> ... > >>> > >>>> Still get the warning... but now 60 bytes. > >>>> Jan 19 10:43:50 mail kernel: ------------[ cut here ]------------ > >>>> Jan 19 10:43:50 mail kernel: WARNING: at lib/dma-debug.c:902 > >>>> > >>> Thanks for trying this. > >>> Jarek P. > >>> > >> NP. Let me know what else I can do to help with this one. Note that > >> it seems innocuous. I'll continue with the older set and no dmar for > >> now and see if I get another crash. > >> > > Maybe one more idea to try (not compiled). > > > > Jarek P. > > --- > > > > drivers/net/sky2.c | 6 ++++-- > > 1 files changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c > > index 7650f73..cdebdd3 100644 > > --- a/drivers/net/sky2.c > > +++ b/drivers/net/sky2.c > > @@ -2252,12 +2252,14 @@ static struct sk_buff *receive_copy(struct sky2_port *sky2, > > skb = netdev_alloc_skb_ip_align(sky2->netdev, length); > > if (likely(skb)) { > > pci_dma_sync_single_for_cpu(sky2->hw->pdev, re->data_addr, > > - length, PCI_DMA_FROMDEVICE); > > + pci_unmap_len(re, data_size), > > + PCI_DMA_FROMDEVICE); > > skb_copy_from_linear_data(re->skb, skb->data, length); > > skb->ip_summed = re->skb->ip_summed; > > skb->csum = re->skb->csum; > > pci_dma_sync_single_for_device(sky2->hw->pdev, re->data_addr, > > - length, PCI_DMA_FROMDEVICE); > > + pci_unmap_len(re, data_size), > > + PCI_DMA_FROMDEVICE); > > re->skb->ip_summed = CHECKSUM_NONE; > > skb_put(skb, length); > > } > > > That not only compiled, but it cleared the error as well. Additionally, > I used to see a bit of a delay receiving the login prompt when first > connecting to the box by ssh. That delay is gone with this patch. I'd > guess that the warning wasn't quite as innocuous as I thought. > Note: tested on 2.6.32.4. I'll leave this up for a bit before > attempting to move back to head. Seems like an underlying bug in the DMA api. Maybe it just can't handle operations on partial mapping. Other drivers with same problem: bnx2, cassini, pcnet32, r8169, rrunner, skge, sungem, tg3, -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-20 1:10 ` Stephen Hemminger @ 2010-01-21 16:14 ` Stefan Richter 2010-01-21 16:50 ` Stefan Richter 0 siblings, 1 reply; 145+ messages in thread From: Stefan Richter @ 2010-01-21 16:14 UTC (permalink / raw) To: Stephen Hemminger Cc: Michael Breuer, Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev Stephen Hemminger wrote: > On Tue, 19 Jan 2010 20:01:10 -0500 > Michael Breuer <mbreuer@majjas.com> wrote: >> On 1/19/2010 5:45 PM, Jarek Poplawski wrote: [...] >> > --- a/drivers/net/sky2.c >> > +++ b/drivers/net/sky2.c >> > @@ -2252,12 +2252,14 @@ static struct sk_buff *receive_copy(struct sky2_port *sky2, >> > skb = netdev_alloc_skb_ip_align(sky2->netdev, length); >> > if (likely(skb)) { >> > pci_dma_sync_single_for_cpu(sky2->hw->pdev, re->data_addr, >> > - length, PCI_DMA_FROMDEVICE); >> > + pci_unmap_len(re, data_size), >> > + PCI_DMA_FROMDEVICE); >> > skb_copy_from_linear_data(re->skb, skb->data, length); >> > skb->ip_summed = re->skb->ip_summed; >> > skb->csum = re->skb->csum; >> > pci_dma_sync_single_for_device(sky2->hw->pdev, re->data_addr, >> > - length, PCI_DMA_FROMDEVICE); >> > + pci_unmap_len(re, data_size), >> > + PCI_DMA_FROMDEVICE); >> > re->skb->ip_summed = CHECKSUM_NONE; >> > skb_put(skb, length); >> > } >> > >> That not only compiled, but it cleared the error as well. Additionally, >> I used to see a bit of a delay receiving the login prompt when first >> connecting to the box by ssh. That delay is gone with this patch. I'd >> guess that the warning wasn't quite as innocuous as I thought. >> Note: tested on 2.6.32.4. I'll leave this up for a bit before >> attempting to move back to head. > > Seems like an underlying bug in the DMA api. Maybe it just can't > handle operations on partial mapping. > > Other drivers with same problem: > bnx2, cassini, pcnet32, r8169, rrunner, skge, sungem, tg3, I didn't read the start of this thread and may misunderstand the issue, but I comment anyway...: Documentation/DMA-API.txt says the following about (pci_)dma_sync_single and friends: "All the parameters must be the same as those passed into the single mapping API." http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/DMA-API.txt;h=5aceb88b3f8b622c8c23ed288f37f42de705c904;hb=HEAD#l367 So, it might be considered an API bug --- but it is a documented one. :-) -- Stefan Richter -=====-==-=- ---= =-=-= http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-21 16:14 ` Stefan Richter @ 2010-01-21 16:50 ` Stefan Richter 0 siblings, 0 replies; 145+ messages in thread From: Stefan Richter @ 2010-01-21 16:50 UTC (permalink / raw) To: Stephen Hemminger Cc: Michael Breuer, Jarek Poplawski, David Miller, akpm, flyboy, linux-kernel, netdev Stefan Richter wrote: > I didn't read the start of this thread and may misunderstand the issue, > but I comment anyway...: > > Documentation/DMA-API.txt says the following about (pci_)dma_sync_single > and friends: "All the parameters must be the same as those passed into > the single mapping API." > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/DMA-API.txt;h=5aceb88b3f8b622c8c23ed288f37f42de705c904;hb=HEAD#l367 > > So, it might be considered an API bug --- but it is a documented one. :-) Disregard my comment. I just noticed that there is more qualified discussion of this in the thread "[PATCH] sky2: Fix WARNING: at lib/dma-debug.c:902 check_sync". -- Stefan Richter -=====-==-=- ---= =-=-= http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 22:08 ` Jarek Poplawski 2010-01-18 22:17 ` Jarek Poplawski @ 2010-01-18 22:25 ` Michael Breuer 2010-01-18 22:40 ` Jarek Poplawski 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-18 22:25 UTC (permalink / raw) To: Jarek Poplawski Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On 1/18/2010 5:08 PM, Jarek Poplawski wrote: > On Mon, Jan 18, 2010 at 04:39:24PM -0500, Michael Breuer wrote: > >> Some output from tc -s qdisc: >> > ... > >> After the conclusion of the test: >> qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 >> 1 2 0 0 1 1 1 1 1 1 1 1 >> Sent 244900497 bytes 3416350 pkt (dropped 0, overlimits 0 requeues 0) >> rate 0bit 0pps backlog 0b 0p requeues 0 >> qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 >> 1 2 0 0 1 1 1 1 1 1 1 1 >> Sent 564380 bytes 4708 pkt (dropped 0, overlimits 0 requeues 0) >> rate 0bit 0pps backlog 0b 0p requeues 0 >> > Great! > > >> >> During the test, 8.9GB received; 232.9MB sent). >> >> I also connected a second device through the wifi router. I was able >> to ping that device w/o loss while DHCP packets were being dropped >> to the other connected device. >> > Could you remind us if the problem is always with this first device? > Btw, I wonder if you could test it skipping the (HP?) switch? > > Jarek P. > No - can be any device connected via the wifi router - wired or wireless. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() 2010-01-18 22:25 ` Michael Breuer @ 2010-01-18 22:40 ` Jarek Poplawski 0 siblings, 0 replies; 145+ messages in thread From: Jarek Poplawski @ 2010-01-18 22:40 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, David Miller, akpm, flyboy, linux-kernel, netdev On Mon, Jan 18, 2010 at 05:25:43PM -0500, Michael Breuer wrote: > On 1/18/2010 5:08 PM, Jarek Poplawski wrote: > >Could you remind us if the problem is always with this first device? > >Btw, I wonder if you could test it skipping the (HP?) switch? > > > >Jarek P. > No - can be any device connected via the wifi router - wired or wireless. Anyway, if it can't be repeated with this Win7 box or even the router getting dhcp itself, then this router's interaction seems most suspected. Jarek P. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-26 20:37 ` Michael Breuer 2009-12-26 22:05 ` [PATCH] sky2: make sure ethernet header is in transmit skb Stephen Hemminger @ 2009-12-27 17:03 ` Michael Breuer 2009-12-27 18:22 ` Stephen Hemminger 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-27 17:03 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On 12/26/2009 3:37 PM, Michael Breuer wrote: > On 12/26/2009 12:57 PM, Stephen Hemminger wrote: >> On Fri, 25 Dec 2009 22:23:51 -0500 >> Michael Breuer<mbreuer@majjas.com> wrote: >> >>> On 12/25/2009 6:22 PM, Stephen Hemminger wrote: >>>> On Fri, 25 Dec 2009 11:28:55 -0500 >>>> Michael Breuer<mbreuer@majjas.com> wrote: >>>> >>>> >>>>> More data points - I'm able to reliably recreate this now. >>>>> While I thought it was coincidence, each and every time I hit this >>>>> issue >>>>> there is a DHCP renew event immediately before the first error. >>>>> The crash occurs while under load - in my case seems that the >>>>> traffic is >>>>> actually IPV6 (hadn't noticed that before). >>>>> I ran nethogs on a remote display - the reported rx rate on the >>>>> IPV6 smb >>>>> connection at the time of the lockup was 33889.688 KB/sec on a 1gbit >>>>> nic. I've got two events like this - don't recall if the earlier >>>>> one was >>>>> the exact same # - but it was in the ballpark. >>>>> >>>>> On 12/24/2009 2:01 AM, Andrew Morton wrote: >>>>> >>>>>> cc's added again. >>>>>> >>>>>> On Wed, 23 Dec 2009 17:54:27 -0500 Michael >>>>>> Breuer<mbreuer@majjas.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Ok - not the firmware. Ran another Windows backup and sky2 went >>>>>>> down. >>>>>>> >>>>>>> Nothing in dmesg.old - have oops in syslog. System became >>>>>>> unresponsive >>>>>>> and watchdog kicked in after a minute. >>>>>>> >>>>>>> Also note that I have a similar oops with VT-D disabled (posted >>>>>>> here on >>>>>>> 12/5). I'm attaching the oops from that below this oops for >>>>>>> comparison. >>>>>>> That also happened under similar load. >>>>>>> >>>>>>> On the assumption that I can recreate this (although it takes a >>>>>>> while) >>>>>>> please let me know how I can help. >>>>>>> >>>>>>> What's in my log (starting with an smbd error about 2 min before >>>>>>> the >>>>>>> oops (note: the dchpd is not the system doing the backup). >>>>>>> >>>>>>> >>>>>> This (nastily wordwrapped) oops appers to be quite different from >>>>>> Berck's one. >>>>>> >>>>>> >>>>>> >>>> What is the MTU? >>>> >>> 1500 >>>>> >> It looks like the problem only shows up for packets generated by DHCP, >> and these come through AF_PACKET. The problem maybe related to how this >> packets are fragmented into header and page, in a different way than >> other >> packets confusing the driver or DMA engine. >> >> Does this help? >> ----- >> >> --- a/drivers/net/sky2.c 2009-12-26 09:50:20.869565022 -0800 >> +++ b/drivers/net/sky2.c 2009-12-26 09:55:54.620645355 -0800 >> @@ -1616,6 +1616,13 @@ static netdev_tx_t sky2_xmit_frame(struc >> if (unlikely(tx_avail(sky2)< tx_le_req(skb))) >> return NETDEV_TX_BUSY; >> >> + if (!pskb_may_pull(skb, ETH_HLEN)) { >> + if (net_ratelimit()) >> + pr_info(PFX "%s: packet missing ether header (%d)?", >> + dev->name, skb->len); >> + goto drop; >> + } >> + >> len = skb_headlen(skb); >> mapping = pci_map_single(hw->pdev, skb->data, len, >> PCI_DMA_TODEVICE); >> >> @@ -1761,6 +1768,7 @@ mapping_unwind: >> mapping_error: >> if (net_ratelimit()) >> dev_warn(&hw->pdev->dev, "%s: tx mapping error\n", dev->name); >> +drop: >> dev_kfree_skb(skb); >> return NETDEV_TX_OK; >> } >> >> >> >> > That seems to have done the trick! > > Still one odd message sequence, but no hangs or crashes. > > The first time I forced a DHCP renew while running at high throughput, > I got the same SMB errors I saw in my original error log (pre-crash). > This only happened once: > Dec 26 15:24:18 mail dhcpd: DHCPACK on 10.0.0.56 to 00:1c:cc:f3:9f:f6 > (BLACKBERRY-9542) via eth0 > Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] > lib/util_sock.c:1564(matchname) > Dec 26 15:24:25 mail smbd[8937]: matchname: host name/address > mismatch: ::ffff:10.0.0.11 != potter.majjas.com > Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] > lib/util_sock.c:1685(get_peer_name) > Dec 26 15:24:25 mail smbd[8937]: Matchname failed on > potter.majjas.com ::ffff:10.0.0.11 > Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] > smbd/nttrans.c:2076(call_nt_transact_ioctl) > Dec 26 15:24:25 mail smbd[8937]: call_nt_transact_ioctl(0x900eb): > Currently not implemented. > > I would discount this, but the same sequence was present in the logs > pre-crash as well. I do not see this at all absent the preceding DHCP > renew sequence. I also don't see this unless the adapter is under load. > > > Just an FYI - I left stuff running overnight and saw lots of this this AM... ans one from last night: Dec 26 21:39:54 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 27 05:09:42 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 27 05:09:44 mail kernel: __ratelimit: 7 callbacks suppressed ... repeats for a couple of minutes. Never saw this before, but then again, the system didn't stay up before. I've also got this in dmesg (command, not /var/log/dmesg): sky2 eth0: rx error, status 0x10660010 length 1518 sky2 eth0: rx error, status 0x10c60010 length 1518 sky2 eth0: rx error, status 0x14860010 length 1518 sky2 eth0: rx error, status 0x19e60010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x10660010 length 1518 sky2 eth0: rx error, status 0x10a60010 length 1518 sky2 eth0: rx error, status 0x10a60010 length 1518 sky2 eth0: rx error, status 0x11060010 length 1518 sky2 eth0: rx error, status 0x11060010 length 1518 DMA-API: debugging out of memory - disabling ... then later ... sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 sky2 eth0: rx error, status 0x23360010 length 1518 DRHD: handling fault status reg 2 DMAR:[DMA Write] Request device [06:00.0] fault addr fff82da28000 DMAR:[fault reason 05] PTE Write access is not set DRHD: handling fault status reg 102 DMAR:[DMA Write] Request device [06:00.0] fault addr fff82da03000 DMAR:[fault reason 05] PTE Write access is not set DRHD: handling fault status reg 202 DMAR:[DMA Write] Request device [06:00.0] fault addr fff82d9ec000 DMAR:[fault reason 05] PTE Write access is not set DRHD: handling fault status reg 302 DMAR:[DMA Write] Request device [06:00.0] fault addr fff82d9a0000 DMAR:[fault reason 05] PTE Write access is not set DRHD: handling fault status reg 402 DMAR:[DMA Write] Request device [06:00.0] fault addr fff82d968000 DMAR:[fault reason 05] PTE Write access is not set ------------[ cut here ]------------ WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164() Hardware name: System Product Name NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out Modules linked in: ip6table_mangle ip6table_filter ip6_tables iptable_raw iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp appletalk psnap llc fbcon tileblit font bitblit softcursor nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc hwmon_vid coretemp acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp nf_conntrack_ipv6 xt_multiport xt_DSCP xt_dscp xt_MARK ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi gspca_spca505 snd_hda_intel snd_ac97_codec gspca_main snd_hda_codec videodev v4l1_compat snd_hwdep ac97_bus snd_seq v4l2_compat_ioctl32 snd_seq_device snd_pcm firewire_ohci pcspkr firewire_core snd_timer snd soundcore snd_page_alloc wmi sky2 asus_atk0110 i2c_i801 crc_itu_t hwmon iTCO_wdt iTCO_vendor_support raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: ip6_tables] Pid: 25, comm: ksoftirqd/7 Tainted: G W 2.6.32-00830-gca45ac4-dirty #3 Call Trace: <IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94 [<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43 [<ffffffff813e2dd7>] ? netif_tx_lock+0x44/0x6c [<ffffffff813e2f3f>] dev_watchdog+0xf3/0x164 [<ffffffff8146b76a>] ? _spin_unlock_irqrestore+0x29/0x41 [<ffffffff8103c62e>] ? check_preempt_curr_idle+0x15/0x17 [<ffffffff8104357d>] ? task_rq_unlock+0x11/0x13 [<ffffffff8104e050>] ? try_to_wake_up+0x2a2/0x2b4 [<ffffffff8106316b>] run_timer_softirq+0x1c8/0x270 [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd [<ffffffff81012e1c>] call_softirq+0x1c/0x30 <EOI> [<ffffffff810143a3>] do_softirq+0x4b/0xa6 [<ffffffff8105a730>] ksoftirqd+0x65/0xdc [<ffffffff8105a6cb>] ? ksoftirqd+0x0/0xdc [<ffffffff8107275c>] kthread+0x7f/0x87 [<ffffffff81012d1a>] child_rip+0xa/0x20 [<ffffffff8101267d>] ? restore_args+0x0/0x30 [<ffffffff810726dd>] ? kthread+0x0/0x87 [<ffffffff81012d10>] ? child_rip+0x0/0x20 ---[ end trace 57f7151f6a5def07 ]--- sky2 eth0: tx timeout sky2 eth0: transmit ring 27 .. 114 report=27 done=27 sky2 eth0: disabling interface sky2 eth0: enabling interface sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x40000008 sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x40000008 sky2 0000:06:00.0: error interrupt status=0x40000008 sky2 0000:06:00.0: error interrupt status=0x40000008 sky2 0000:06:00.0: error interrupt status=0x40000008 sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x40000008 sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x8 iint_free: readcount: 1 iint_free: opencount: 1 iint_free: readcount: 1 iint_free: opencount: 1 sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x40000008 sky2 0000:06:00.0: error interrupt status=0x40000008 sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x8 sky2 0000:06:00.0: error interrupt status=0x8 ... this repeats for a while I'm guessing that the dmesg and kernel messages correspond. System stayed up and everything seems ok now. There were connections open at the time of these events and they appear to have remained up. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-27 17:03 ` sky2 panic in 2.6.32.1 under load (new oops) Michael Breuer @ 2009-12-27 18:22 ` Stephen Hemminger 2009-12-27 19:39 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2009-12-27 18:22 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On Sun, 27 Dec 2009 12:03:52 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > Just an FYI - I left stuff running overnight and saw lots of this this > AM... ans one from last night: > Dec 26 21:39:54 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 You have hardware/firmware problem, that is beyond the driver. That status indicates: Parity Error/MAC - "This interrupt is intended to indicate a panic event (hardware fault)" What is your hardware configuration (lspci, kernel config, memory, cpu)? ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-27 18:22 ` Stephen Hemminger @ 2009-12-27 19:39 ` Michael Breuer 2009-12-29 17:30 ` Stephen Hemminger 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-27 19:39 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev [-- Attachment #1: Type: text/plain, Size: 995 bytes --] On 12/27/2009 01:22 PM, Stephen Hemminger wrote: > On Sun, 27 Dec 2009 12:03:52 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> Just an FYI - I left stuff running overnight and saw lots of this this >> AM... ans one from last night: >> Dec 26 21:39:54 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> > You have hardware/firmware problem, that is beyond the driver. > That status indicates: > Parity Error/MAC - "This interrupt is intended to indicate a panic event (hardware fault)" > > What is your hardware configuration (lspci, kernel config, memory, cpu)? > > Asus p6t Deluxe V2 (X58 chipset) -VT-D enabled; 12Gb; Intel i7 920, ht enabled. I've attached the lspci and kernel config. Also, for what it's worth, a windows backup kicked just before this happened (scheduled for 5AM) - so consistent with what was crashing the system before the patch. [-- Attachment #2: info.tar.bz2 --] [-- Type: application/x-bzip2, Size: 20114 bytes --] ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-27 19:39 ` Michael Breuer @ 2009-12-29 17:30 ` Stephen Hemminger 2009-12-29 17:39 ` Michael Breuer ` (2 more replies) 0 siblings, 3 replies; 145+ messages in thread From: Stephen Hemminger @ 2009-12-29 17:30 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev There could be DMA problems, either because chip isn't remembering upper address, or because of lack of wires from chip to memory controller; there have been problems on motherboards where I/O devices couldn't access all of memory. Try this, it forces each transmit to have full 64 bit address. It drops an optimization that was done to reduce the number of transmit descriptors used. --- a/drivers/net/sky2.c 2009-12-29 09:23:37.114074275 -0800 +++ b/drivers/net/sky2.c 2009-12-29 09:26:38.699912035 -0800 @@ -1038,10 +1038,11 @@ static void tx_init(struct sky2_port *sk sky2->tx_tcpsum = 0; sky2->tx_last_mss = 0; - le = get_tx_le(sky2, &sky2->tx_prod); - le->addr = 0; - le->opcode = OP_ADDR64 | HW_OWNER; - sky2->tx_last_upper = 0; + if (sizeof(dma_addr_t) == sizeof(u32)) { + le = get_tx_le(sky2, &sky2->tx_prod); + le->addr = 0; + le->opcode = OP_ADDR64 | HW_OWNER; + } } /* Update chip's next pointer */ @@ -1669,10 +1670,9 @@ static netdev_tx_t sky2_xmit_frame(struc /* Send high bits if needed */ upper = upper_32_bits(mapping); - if (upper != sky2->tx_last_upper) { + if (sizeof(mapping) > sizeof(u32)) { le = get_tx_le(sky2, &slot); le->addr = cpu_to_le32(upper); - sky2->tx_last_upper = upper; le->opcode = OP_ADDR64 | HW_OWNER; } @@ -1762,10 +1762,9 @@ static netdev_tx_t sky2_xmit_frame(struc goto mapping_unwind; upper = upper_32_bits(mapping); - if (upper != sky2->tx_last_upper) { + if (sizeof(mapping) > sizeof(u32)) { le = get_tx_le(sky2, &slot); le->addr = cpu_to_le32(upper); - sky2->tx_last_upper = upper; le->opcode = OP_ADDR64 | HW_OWNER; } --- a/drivers/net/sky2.h 2009-12-29 09:23:37.164072886 -0800 +++ b/drivers/net/sky2.h 2009-12-29 09:25:15.302197993 -0800 @@ -2191,7 +2191,6 @@ struct sky2_port { u16 tx_pending; u16 tx_last_mss; - u32 tx_last_upper; u32 tx_tcpsum; struct rx_ring_info *rx_ring ____cacheline_aligned_in_smp; ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-29 17:30 ` Stephen Hemminger @ 2009-12-29 17:39 ` Michael Breuer 2009-12-29 18:38 ` Michael Breuer 2009-12-29 18:54 ` Michael Breuer 2 siblings, 0 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-29 17:39 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev Ok - I can also try with VT-D disabled for another data point. On 12/29/2009 12:30 PM, Stephen Hemminger wrote: > There could be DMA problems, either because chip isn't remembering > upper address, or because of lack of wires from chip to memory controller; > there have been problems on motherboards where I/O devices couldn't access > all of memory. > > Try this, it forces each transmit to have full 64 bit address. It drops an > optimization that was done to reduce the number of transmit descriptors > used. > > --- a/drivers/net/sky2.c 2009-12-29 09:23:37.114074275 -0800 > +++ b/drivers/net/sky2.c 2009-12-29 09:26:38.699912035 -0800 > @@ -1038,10 +1038,11 @@ static void tx_init(struct sky2_port *sk > sky2->tx_tcpsum = 0; > sky2->tx_last_mss = 0; > > - le = get_tx_le(sky2,&sky2->tx_prod); > - le->addr = 0; > - le->opcode = OP_ADDR64 | HW_OWNER; > - sky2->tx_last_upper = 0; > + if (sizeof(dma_addr_t) == sizeof(u32)) { > + le = get_tx_le(sky2,&sky2->tx_prod); > + le->addr = 0; > + le->opcode = OP_ADDR64 | HW_OWNER; > + } > } > > /* Update chip's next pointer */ > @@ -1669,10 +1670,9 @@ static netdev_tx_t sky2_xmit_frame(struc > > /* Send high bits if needed */ > upper = upper_32_bits(mapping); > - if (upper != sky2->tx_last_upper) { > + if (sizeof(mapping)> sizeof(u32)) { > le = get_tx_le(sky2,&slot); > le->addr = cpu_to_le32(upper); > - sky2->tx_last_upper = upper; > le->opcode = OP_ADDR64 | HW_OWNER; > } > > @@ -1762,10 +1762,9 @@ static netdev_tx_t sky2_xmit_frame(struc > goto mapping_unwind; > > upper = upper_32_bits(mapping); > - if (upper != sky2->tx_last_upper) { > + if (sizeof(mapping)> sizeof(u32)) { > le = get_tx_le(sky2,&slot); > le->addr = cpu_to_le32(upper); > - sky2->tx_last_upper = upper; > le->opcode = OP_ADDR64 | HW_OWNER; > } > > --- a/drivers/net/sky2.h 2009-12-29 09:23:37.164072886 -0800 > +++ b/drivers/net/sky2.h 2009-12-29 09:25:15.302197993 -0800 > @@ -2191,7 +2191,6 @@ struct sky2_port { > > u16 tx_pending; > u16 tx_last_mss; > - u32 tx_last_upper; > u32 tx_tcpsum; > > struct rx_ring_info *rx_ring ____cacheline_aligned_in_smp; > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-29 17:30 ` Stephen Hemminger 2009-12-29 17:39 ` Michael Breuer @ 2009-12-29 18:38 ` Michael Breuer 2009-12-29 18:54 ` Michael Breuer 2 siblings, 0 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-29 18:38 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev The frequency of the errors is reduced, but in place I'm seeing a large number of DHCPINFORM/DHCPACK to/from the system from which I'm generating load. It appears that it's now taking a higher throughput to cause the interrupt status errors than without this patch, but they're still occurring. Just for fun, I'll try without this patch and with VT-D disabled later today. On 12/29/2009 12:30 PM, Stephen Hemminger wrote: > There could be DMA problems, either because chip isn't remembering > upper address, or because of lack of wires from chip to memory controller; > there have been problems on motherboards where I/O devices couldn't access > all of memory. > > Try this, it forces each transmit to have full 64 bit address. It drops an > optimization that was done to reduce the number of transmit descriptors > used. > > --- a/drivers/net/sky2.c 2009-12-29 09:23:37.114074275 -0800 > +++ b/drivers/net/sky2.c 2009-12-29 09:26:38.699912035 -0800 > @@ -1038,10 +1038,11 @@ static void tx_init(struct sky2_port *sk > sky2->tx_tcpsum = 0; > sky2->tx_last_mss = 0; > > - le = get_tx_le(sky2,&sky2->tx_prod); > - le->addr = 0; > - le->opcode = OP_ADDR64 | HW_OWNER; > - sky2->tx_last_upper = 0; > + if (sizeof(dma_addr_t) == sizeof(u32)) { > + le = get_tx_le(sky2,&sky2->tx_prod); > + le->addr = 0; > + le->opcode = OP_ADDR64 | HW_OWNER; > + } > } > > /* Update chip's next pointer */ > @@ -1669,10 +1670,9 @@ static netdev_tx_t sky2_xmit_frame(struc > > /* Send high bits if needed */ > upper = upper_32_bits(mapping); > - if (upper != sky2->tx_last_upper) { > + if (sizeof(mapping)> sizeof(u32)) { > le = get_tx_le(sky2,&slot); > le->addr = cpu_to_le32(upper); > - sky2->tx_last_upper = upper; > le->opcode = OP_ADDR64 | HW_OWNER; > } > > @@ -1762,10 +1762,9 @@ static netdev_tx_t sky2_xmit_frame(struc > goto mapping_unwind; > > upper = upper_32_bits(mapping); > - if (upper != sky2->tx_last_upper) { > + if (sizeof(mapping)> sizeof(u32)) { > le = get_tx_le(sky2,&slot); > le->addr = cpu_to_le32(upper); > - sky2->tx_last_upper = upper; > le->opcode = OP_ADDR64 | HW_OWNER; > } > > --- a/drivers/net/sky2.h 2009-12-29 09:23:37.164072886 -0800 > +++ b/drivers/net/sky2.h 2009-12-29 09:25:15.302197993 -0800 > @@ -2191,7 +2191,6 @@ struct sky2_port { > > u16 tx_pending; > u16 tx_last_mss; > - u32 tx_last_upper; > u32 tx_tcpsum; > > struct rx_ring_info *rx_ring ____cacheline_aligned_in_smp; > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-29 17:30 ` Stephen Hemminger 2009-12-29 17:39 ` Michael Breuer 2009-12-29 18:38 ` Michael Breuer @ 2009-12-29 18:54 ` Michael Breuer 2009-12-29 19:49 ` Stephen Hemminger 2 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-29 18:54 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev One more data point - it appears that the errors are occurring when the interface is under load AND there is ipv6 dns activity. As this only happens when under load I can't prove it, but it seems there is a correlation. > There could be DMA problems, either because chip isn't remembering > upper address, or because of lack of wires from chip to memory controller; > there have been problems on motherboards where I/O devices couldn't access > all of memory. > > Try this, it forces each transmit to have full 64 bit address. It drops an > optimization that was done to reduce the number of transmit descriptors > used. > > --- a/drivers/net/sky2.c 2009-12-29 09:23:37.114074275 -0800 > +++ b/drivers/net/sky2.c 2009-12-29 09:26:38.699912035 -0800 > @@ -1038,10 +1038,11 @@ static void tx_init(struct sky2_port *sk > sky2->tx_tcpsum = 0; > sky2->tx_last_mss = 0; > > - le = get_tx_le(sky2,&sky2->tx_prod); > - le->addr = 0; > - le->opcode = OP_ADDR64 | HW_OWNER; > - sky2->tx_last_upper = 0; > + if (sizeof(dma_addr_t) == sizeof(u32)) { > + le = get_tx_le(sky2,&sky2->tx_prod); > + le->addr = 0; > + le->opcode = OP_ADDR64 | HW_OWNER; > + } > } > > /* Update chip's next pointer */ > @@ -1669,10 +1670,9 @@ static netdev_tx_t sky2_xmit_frame(struc > > /* Send high bits if needed */ > upper = upper_32_bits(mapping); > - if (upper != sky2->tx_last_upper) { > + if (sizeof(mapping)> sizeof(u32)) { > le = get_tx_le(sky2,&slot); > le->addr = cpu_to_le32(upper); > - sky2->tx_last_upper = upper; > le->opcode = OP_ADDR64 | HW_OWNER; > } > > @@ -1762,10 +1762,9 @@ static netdev_tx_t sky2_xmit_frame(struc > goto mapping_unwind; > > upper = upper_32_bits(mapping); > - if (upper != sky2->tx_last_upper) { > + if (sizeof(mapping)> sizeof(u32)) { > le = get_tx_le(sky2,&slot); > le->addr = cpu_to_le32(upper); > - sky2->tx_last_upper = upper; > le->opcode = OP_ADDR64 | HW_OWNER; > } > > --- a/drivers/net/sky2.h 2009-12-29 09:23:37.164072886 -0800 > +++ b/drivers/net/sky2.h 2009-12-29 09:25:15.302197993 -0800 > @@ -2191,7 +2191,6 @@ struct sky2_port { > > u16 tx_pending; > u16 tx_last_mss; > - u32 tx_last_upper; > u32 tx_tcpsum; > > struct rx_ring_info *rx_ring ____cacheline_aligned_in_smp; > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-29 18:54 ` Michael Breuer @ 2009-12-29 19:49 ` Stephen Hemminger 2009-12-29 20:41 ` Michael Breuer 2009-12-30 7:23 ` Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Stephen Hemminger @ 2009-12-29 19:49 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev This patch will dump more info in case of hardware error. --- a/drivers/net/sky2.c 2009-12-29 11:30:06.441137685 -0800 +++ b/drivers/net/sky2.c 2009-12-29 11:46:37.389262922 -0800 @@ -2629,12 +2625,71 @@ static void sky2_hw_error(struct sky2_hw } } +static void dump_txring(struct sky2_hw *hw, unsigned port) +{ + struct sky2_port *sky2 = netdev_priv(hw->dev[port]); + int sop = 1; + unsigned idx; + + printk(KERN_INFO PFX "Tx ring pending=%u...%u report=%d done=%d\n", + sky2->tx_cons, sky2->tx_prod, + sky2_read16(hw, port == 0 ? STAT_TXA1_RIDX : STAT_TXA2_RIDX), + sky2_read16(hw, Q_ADDR(txqaddr[port], Q_DONE))); + + /* Dump contents of tx ring */ + for (idx = sky2->tx_next; + idx != sky2->tx_prod && idx < sky2->tx_ring_size; + idx = RING_NEXT(idx, sky2->tx_ring_size)) { + const struct sky2_tx_le *le = sky2->tx_le + idx; + u32 a = le32_to_cpu(le->addr); + + if (sop) + printk(KERN_DEBUG "%u:", idx); + sop = 0; + + switch(le->opcode & ~HW_OWNER) { + case OP_ADDR64: + printk(" %#x:", a); + break; + case OP_LRGLEN: + printk(" mtu=%d", a); + break; + case OP_VLAN: + printk(" vlan=%d", be16_to_cpu(le->length)); + break; + case OP_TCPLISW: + printk(" csum=%#x", a); + break; + case OP_LARGESEND: + printk(" tso=%#x(%d)", a, le16_to_cpu(le->length)); + break; + case OP_PACKET: + printk(" %#x(%d)", a, le16_to_cpu(le->length)); + break; + case OP_BUFFER: + printk(" frag=%#x(%d)", a, le16_to_cpu(le->length)); + break; + default: + printk(" op=%#x,%#x(%d)", le->opcode, a, + le16_to_cpu(le->length)); + } + + if (le->ctrl & EOP) { + printk("\n"); + sop = 1; + } + } +} + static void sky2_hw_intr(struct sky2_hw *hw) { struct pci_dev *pdev = hw->pdev; u32 status = sky2_read32(hw, B0_HWE_ISRC); u32 hwmsk = sky2_read32(hw, B0_HWE_IMSK); + printk(KERN_DEBUG PFX "hardware interrupt status %#x\n", status); + dump_txring(hw, 0); + status &= hwmsk; if (status & Y2_IS_TIST_OV) -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-29 19:49 ` Stephen Hemminger @ 2009-12-29 20:41 ` Michael Breuer 2009-12-30 7:23 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-29 20:41 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev The error is coming from sky2_err_intr, not sky2_hw_intr: sky2 0000:06:00.0: error interrupt status=0x40000008 sky2 0000:06:00.0: error interrupt status=0x8 Also, fwiw, when running with vt-d disabled, I don't get the dhcpinform/ack messages, but continue to get the interrupt status messages. That said, I am not certain the the dhcpinform/ack is related at to the sky2 issue. On 12/29/2009 2:49 PM, Stephen Hemminger wrote: > This patch will dump more info in case of hardware error. > > --- a/drivers/net/sky2.c 2009-12-29 11:30:06.441137685 -0800 > +++ b/drivers/net/sky2.c 2009-12-29 11:46:37.389262922 -0800 > @@ -2629,12 +2625,71 @@ static void sky2_hw_error(struct sky2_hw > } > } > > +static void dump_txring(struct sky2_hw *hw, unsigned port) > +{ > + struct sky2_port *sky2 = netdev_priv(hw->dev[port]); > + int sop = 1; > + unsigned idx; > + > + printk(KERN_INFO PFX "Tx ring pending=%u...%u report=%d done=%d\n", > + sky2->tx_cons, sky2->tx_prod, > + sky2_read16(hw, port == 0 ? STAT_TXA1_RIDX : STAT_TXA2_RIDX), > + sky2_read16(hw, Q_ADDR(txqaddr[port], Q_DONE))); > + > + /* Dump contents of tx ring */ > + for (idx = sky2->tx_next; > + idx != sky2->tx_prod&& idx< sky2->tx_ring_size; > + idx = RING_NEXT(idx, sky2->tx_ring_size)) { > + const struct sky2_tx_le *le = sky2->tx_le + idx; > + u32 a = le32_to_cpu(le->addr); > + > + if (sop) > + printk(KERN_DEBUG "%u:", idx); > + sop = 0; > + > + switch(le->opcode& ~HW_OWNER) { > + case OP_ADDR64: > + printk(" %#x:", a); > + break; > + case OP_LRGLEN: > + printk(" mtu=%d", a); > + break; > + case OP_VLAN: > + printk(" vlan=%d", be16_to_cpu(le->length)); > + break; > + case OP_TCPLISW: > + printk(" csum=%#x", a); > + break; > + case OP_LARGESEND: > + printk(" tso=%#x(%d)", a, le16_to_cpu(le->length)); > + break; > + case OP_PACKET: > + printk(" %#x(%d)", a, le16_to_cpu(le->length)); > + break; > + case OP_BUFFER: > + printk(" frag=%#x(%d)", a, le16_to_cpu(le->length)); > + break; > + default: > + printk(" op=%#x,%#x(%d)", le->opcode, a, > + le16_to_cpu(le->length)); > + } > + > + if (le->ctrl& EOP) { > + printk("\n"); > + sop = 1; > + } > + } > +} > + > static void sky2_hw_intr(struct sky2_hw *hw) > { > struct pci_dev *pdev = hw->pdev; > u32 status = sky2_read32(hw, B0_HWE_ISRC); > u32 hwmsk = sky2_read32(hw, B0_HWE_IMSK); > > + printk(KERN_DEBUG PFX "hardware interrupt status %#x\n", status); > + dump_txring(hw, 0); > + > status&= hwmsk; > > if (status& Y2_IS_TIST_OV) > > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-29 19:49 ` Stephen Hemminger 2009-12-29 20:41 ` Michael Breuer @ 2009-12-30 7:23 ` Michael Breuer 2009-12-30 7:58 ` Stephen Hemminger 2009-12-30 7:59 ` Stephen Hemminger 1 sibling, 2 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-30 7:23 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev Ok - I called dump_txring from sky2_net_intr: --- a/drivers/net/sky2.c +++ b/drivers/net/sky2.c @@ -2725,8 +2791,10 @@ static void sky2_watchdog(unsigned long arg) /* Hardware/software error handling */ static void sky2_err_intr(struct sky2_hw *hw, u32 status) { - if (net_ratelimit()) + if (net_ratelimit()) { dev_warn(&hw->pdev->dev, "error interrupt status=%#x\n", status); + dump_txring(hw, 0); + } if (status & Y2_IS_HW_ERR) sky2_hw_intr(hw); And got this: Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 On 12/29/2009 02:49 PM, Stephen Hemminger wrote: > This patch will dump more info in case of hardware error. > > --- a/drivers/net/sky2.c 2009-12-29 11:30:06.441137685 -0800 > +++ b/drivers/net/sky2.c 2009-12-29 11:46:37.389262922 -0800 > @@ -2629,12 +2625,71 @@ static void sky2_hw_error(struct sky2_hw > } > } > > +static void dump_txring(struct sky2_hw *hw, unsigned port) > +{ > + struct sky2_port *sky2 = netdev_priv(hw->dev[port]); > + int sop = 1; > + unsigned idx; > + > + printk(KERN_INFO PFX "Tx ring pending=%u...%u report=%d done=%d\n", > + sky2->tx_cons, sky2->tx_prod, > + sky2_read16(hw, port == 0 ? STAT_TXA1_RIDX : STAT_TXA2_RIDX), > + sky2_read16(hw, Q_ADDR(txqaddr[port], Q_DONE))); > + > + /* Dump contents of tx ring */ > + for (idx = sky2->tx_next; > + idx != sky2->tx_prod&& idx< sky2->tx_ring_size; > + idx = RING_NEXT(idx, sky2->tx_ring_size)) { > + const struct sky2_tx_le *le = sky2->tx_le + idx; > + u32 a = le32_to_cpu(le->addr); > + > + if (sop) > + printk(KERN_DEBUG "%u:", idx); > + sop = 0; > + > + switch(le->opcode& ~HW_OWNER) { > + case OP_ADDR64: > + printk(" %#x:", a); > + break; > + case OP_LRGLEN: > + printk(" mtu=%d", a); > + break; > + case OP_VLAN: > + printk(" vlan=%d", be16_to_cpu(le->length)); > + break; > + case OP_TCPLISW: > + printk(" csum=%#x", a); > + break; > + case OP_LARGESEND: > + printk(" tso=%#x(%d)", a, le16_to_cpu(le->length)); > + break; > + case OP_PACKET: > + printk(" %#x(%d)", a, le16_to_cpu(le->length)); > + break; > + case OP_BUFFER: > + printk(" frag=%#x(%d)", a, le16_to_cpu(le->length)); > + break; > + default: > + printk(" op=%#x,%#x(%d)", le->opcode, a, > + le16_to_cpu(le->length)); > + } > + > + if (le->ctrl& EOP) { > + printk("\n"); > + sop = 1; > + } > + } > +} > + > static void sky2_hw_intr(struct sky2_hw *hw) > { > struct pci_dev *pdev = hw->pdev; > u32 status = sky2_read32(hw, B0_HWE_ISRC); > u32 hwmsk = sky2_read32(hw, B0_HWE_IMSK); > > + printk(KERN_DEBUG PFX "hardware interrupt status %#x\n", status); > + dump_txring(hw, 0); > + > status&= hwmsk; > > if (status& Y2_IS_TIST_OV) > > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 7:23 ` Michael Breuer @ 2009-12-30 7:58 ` Stephen Hemminger 2009-12-30 17:49 ` Michael Breuer 2009-12-30 7:59 ` Stephen Hemminger 1 sibling, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2009-12-30 7:58 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On Wed, 30 Dec 2009 02:23:20 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > Ok - I called dump_txring from sky2_net_intr: > --- a/drivers/net/sky2.c > +++ b/drivers/net/sky2.c > @@ -2725,8 +2791,10 @@ static void sky2_watchdog(unsigned long arg) > /* Hardware/software error handling */ > static void sky2_err_intr(struct sky2_hw *hw, u32 status) > { > - if (net_ratelimit()) > + if (net_ratelimit()) { > dev_warn(&hw->pdev->dev, "error interrupt > status=%#x\n", status); > + dump_txring(hw, 0); > + } > > if (status & Y2_IS_HW_ERR) > sky2_hw_intr(hw); > > And got this: > Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 > Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 > Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 > Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 > Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 > Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 > I notice that you have NOUVEAU Nvidia drivers loaded? The one difference in HW between your board and mine is that I have ATI video card. -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 7:58 ` Stephen Hemminger @ 2009-12-30 17:49 ` Michael Breuer 2009-12-30 19:15 ` audit.c skb - tty race condition - was " Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-30 17:49 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On 12/30/2009 2:58 AM, Stephen Hemminger wrote: > On Wed, 30 Dec 2009 02:23:20 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> Ok - I called dump_txring from sky2_net_intr: >> --- a/drivers/net/sky2.c >> +++ b/drivers/net/sky2.c >> @@ -2725,8 +2791,10 @@ static void sky2_watchdog(unsigned long arg) >> /* Hardware/software error handling */ >> static void sky2_err_intr(struct sky2_hw *hw, u32 status) >> { >> - if (net_ratelimit()) >> + if (net_ratelimit()) { >> dev_warn(&hw->pdev->dev, "error interrupt >> status=%#x\n", status); >> + dump_txring(hw, 0); >> + } >> >> if (status& Y2_IS_HW_ERR) >> sky2_hw_intr(hw); >> >> And got this: >> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 >> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 >> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 >> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 >> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 >> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 >> >> > I notice that you have NOUVEAU Nvidia drivers loaded? The one difference in HW > between your board and mine is that I have ATI video card. > > Seems the problem is linked to auditd and X11 (but not nouveau). Today, I ran a bunch of scenarios. I first determined that the problem only manifest in runlevel 5. Next, this occurred with or without KMS and with or without nouveau. This happened whether or not I was logged in (local or remote), and regardless of window manager (xdm, gdm, kdm). I then checked to see what else was different between runlevel 3 and 5 - only thing was auditd. I disabled auditd and reran - no errors. Now for the odd stuff: The errors only manifest if the high throughput data transfer is initiated when the system is in runlevel 5 and auditd was started by init when transitioning from runlevel 3 to 5. For example, the following scenarios do not cause the errors to manifest: runlevel3; start auditd runlevel 5; start transfer runlevel3; chkconfig auditd off; runlevel5; start auditd; start transfer runlevel3; start transfer (note: errors do not occur if I transition to runlevel 5 after the high bandwidth transfer has started) runlevel3; startx; start transfer The only way I get the problem to manifest is transition to runlevel 5 with chkconfig auditd on (level 5 only) and then initate the windows backup. I'm guessing that there is some sort of race condition happening between X (xdm/gdm/kdm/greeter?) and auditd that is somehow corrupting something. I'd hazard a more or less obvious guess that whatever's being corrupted differs when there is already a high throughput transfer under way. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: audit.c skb - tty race condition - was sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 17:49 ` Michael Breuer @ 2009-12-30 19:15 ` Michael Breuer 2009-12-30 20:44 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-30 19:15 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev And now looking at audit.c it seems reasonable that there is a race condition when auditd is started at roughly the same time as X. I'm guessing that the kaudit thread is fired up; the tty connected; and at the same time X grabs the tty. Somewhere in there an skb gets hosed and is then reused by whatever comes along - in my case sky2 as that's where the subsequent demand is. If the demand happens first, the contaminated skb (dk in what way yet) is probably waiting to manifest as some other bug that's been frustrating people. On 12/30/2009 12:49 PM, Michael Breuer wrote: > On 12/30/2009 2:58 AM, Stephen Hemminger wrote: >> On Wed, 30 Dec 2009 02:23:20 -0500 >> Michael Breuer<mbreuer@majjas.com> wrote: >> >>> Ok - I called dump_txring from sky2_net_intr: >>> --- a/drivers/net/sky2.c >>> +++ b/drivers/net/sky2.c >>> @@ -2725,8 +2791,10 @@ static void sky2_watchdog(unsigned long arg) >>> /* Hardware/software error handling */ >>> static void sky2_err_intr(struct sky2_hw *hw, u32 status) >>> { >>> - if (net_ratelimit()) >>> + if (net_ratelimit()) { >>> dev_warn(&hw->pdev->dev, "error interrupt >>> status=%#x\n", status); >>> + dump_txring(hw, 0); >>> + } >>> >>> if (status& Y2_IS_HW_ERR) >>> sky2_hw_intr(hw); >>> >>> And got this: >>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x40000008 >>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x40000008 >>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 >>> done=29 >>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 >>> done=29 >>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x8 >>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x8 >>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 >>> done=31 >>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 >>> done=31 >>> >> I notice that you have NOUVEAU Nvidia drivers loaded? The one >> difference in HW >> between your board and mine is that I have ATI video card. >> > Seems the problem is linked to auditd and X11 (but not nouveau). > > Today, I ran a bunch of scenarios. I first determined that the problem > only manifest in runlevel 5. Next, this occurred with or without KMS > and with or without nouveau. This happened whether or not I was logged > in (local or remote), and regardless of window manager (xdm, gdm, > kdm). I then checked to see what else was different between runlevel 3 > and 5 - only thing was auditd. I disabled auditd and reran - no errors. > > Now for the odd stuff: > > The errors only manifest if the high throughput data transfer is > initiated when the system is in runlevel 5 and auditd was started by > init when transitioning from runlevel 3 to 5. For example, the > following scenarios do not cause the errors to manifest: > > runlevel3; start auditd runlevel 5; start transfer > runlevel3; chkconfig auditd off; runlevel5; start auditd; start transfer > runlevel3; start transfer (note: errors do not occur if I transition > to runlevel 5 after the high bandwidth transfer has started) > runlevel3; startx; start transfer > > The only way I get the problem to manifest is transition to runlevel 5 > with chkconfig auditd on (level 5 only) and then initate the windows > backup. > > I'm guessing that there is some sort of race condition happening > between X (xdm/gdm/kdm/greeter?) and auditd that is somehow corrupting > something. I'd hazard a more or less obvious guess that whatever's > being corrupted differs when there is already a high throughput > transfer under way. > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: audit.c skb - tty race condition - was sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 19:15 ` audit.c skb - tty race condition - was " Michael Breuer @ 2009-12-30 20:44 ` Michael Breuer 2009-12-30 21:15 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-30 20:44 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev A couple more observations: 1) enabling auditd for runlevel 3 mitigates the issue 2) starting a remote x session (XDMCP) while under load and while auditd is already running also triggers the sky2 interrupt status messages - so maybe not tty1 - but some sort of X & auditd interaction. Even in this case, the frequency of the error messages is much less than when auditd is started in runlevel 5 for the first time. On 12/30/2009 2:15 PM, Michael Breuer wrote: > And now looking at audit.c it seems reasonable that there is a race > condition when auditd is started at roughly the same time as X. I'm > guessing that the kaudit thread is fired up; the tty connected; and at > the same time X grabs the tty. Somewhere in there an skb gets hosed > and is then reused by whatever comes along - in my case sky2 as that's > where the subsequent demand is. If the demand happens first, the > contaminated skb (dk in what way yet) is probably waiting to manifest > as some other bug that's been frustrating people. > On 12/30/2009 12:49 PM, Michael Breuer wrote: >> On 12/30/2009 2:58 AM, Stephen Hemminger wrote: >>> On Wed, 30 Dec 2009 02:23:20 -0500 >>> Michael Breuer<mbreuer@majjas.com> wrote: >>> >>>> Ok - I called dump_txring from sky2_net_intr: >>>> --- a/drivers/net/sky2.c >>>> +++ b/drivers/net/sky2.c >>>> @@ -2725,8 +2791,10 @@ static void sky2_watchdog(unsigned long arg) >>>> /* Hardware/software error handling */ >>>> static void sky2_err_intr(struct sky2_hw *hw, u32 status) >>>> { >>>> - if (net_ratelimit()) >>>> + if (net_ratelimit()) { >>>> dev_warn(&hw->pdev->dev, "error interrupt >>>> status=%#x\n", status); >>>> + dump_txring(hw, 0); >>>> + } >>>> >>>> if (status& Y2_IS_HW_ERR) >>>> sky2_hw_intr(hw); >>>> >>>> And got this: >>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x40000008 >>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x40000008 >>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 >>>> done=29 >>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 >>>> done=29 >>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x8 >>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x8 >>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 >>>> done=31 >>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 >>>> done=31 >>>> >>> I notice that you have NOUVEAU Nvidia drivers loaded? The one >>> difference in HW >>> between your board and mine is that I have ATI video card. >>> >> Seems the problem is linked to auditd and X11 (but not nouveau). >> >> Today, I ran a bunch of scenarios. I first determined that the >> problem only manifest in runlevel 5. Next, this occurred with or >> without KMS and with or without nouveau. This happened whether or not >> I was logged in (local or remote), and regardless of window manager >> (xdm, gdm, kdm). I then checked to see what else was different >> between runlevel 3 and 5 - only thing was auditd. I disabled auditd >> and reran - no errors. >> >> Now for the odd stuff: >> >> The errors only manifest if the high throughput data transfer is >> initiated when the system is in runlevel 5 and auditd was started by >> init when transitioning from runlevel 3 to 5. For example, the >> following scenarios do not cause the errors to manifest: >> >> runlevel3; start auditd runlevel 5; start transfer >> runlevel3; chkconfig auditd off; runlevel5; start auditd; start transfer >> runlevel3; start transfer (note: errors do not occur if I transition >> to runlevel 5 after the high bandwidth transfer has started) >> runlevel3; startx; start transfer >> >> The only way I get the problem to manifest is transition to runlevel >> 5 with chkconfig auditd on (level 5 only) and then initate the >> windows backup. >> >> I'm guessing that there is some sort of race condition happening >> between X (xdm/gdm/kdm/greeter?) and auditd that is somehow >> corrupting something. I'd hazard a more or less obvious guess that >> whatever's being corrupted differs when there is already a high >> throughput transfer under way. >> >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: audit.c skb - tty race condition - was sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 20:44 ` Michael Breuer @ 2009-12-30 21:15 ` Michael Breuer 2009-12-30 21:21 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-30 21:15 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On 12/30/2009 3:44 PM, Michael Breuer wrote: > A couple more observations: > > 1) enabling auditd for runlevel 3 mitigates the issue > 2) starting a remote x session (XDMCP) while under load and while > auditd is already running also triggers the sky2 interrupt status > messages - so maybe not tty1 - but some sort of X & auditd > interaction. Even in this case, the frequency of the error messages is > much less than when auditd is started in runlevel 5 for the first time. And just for some certainty - running with audit=0, the errors do not manifest. > > On 12/30/2009 2:15 PM, Michael Breuer wrote: >> And now looking at audit.c it seems reasonable that there is a race >> condition when auditd is started at roughly the same time as X. I'm >> guessing that the kaudit thread is fired up; the tty connected; and >> at the same time X grabs the tty. Somewhere in there an skb gets >> hosed and is then reused by whatever comes along - in my case sky2 as >> that's where the subsequent demand is. If the demand happens first, >> the contaminated skb (dk in what way yet) is probably waiting to >> manifest as some other bug that's been frustrating people. >> On 12/30/2009 12:49 PM, Michael Breuer wrote: >>> On 12/30/2009 2:58 AM, Stephen Hemminger wrote: >>>> On Wed, 30 Dec 2009 02:23:20 -0500 >>>> Michael Breuer<mbreuer@majjas.com> wrote: >>>> >>>>> Ok - I called dump_txring from sky2_net_intr: >>>>> --- a/drivers/net/sky2.c >>>>> +++ b/drivers/net/sky2.c >>>>> @@ -2725,8 +2791,10 @@ static void sky2_watchdog(unsigned long arg) >>>>> /* Hardware/software error handling */ >>>>> static void sky2_err_intr(struct sky2_hw *hw, u32 status) >>>>> { >>>>> - if (net_ratelimit()) >>>>> + if (net_ratelimit()) { >>>>> dev_warn(&hw->pdev->dev, "error interrupt >>>>> status=%#x\n", status); >>>>> + dump_txring(hw, 0); >>>>> + } >>>>> >>>>> if (status& Y2_IS_HW_ERR) >>>>> sky2_hw_intr(hw); >>>>> >>>>> And got this: >>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>>> status=0x40000008 >>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>>> status=0x40000008 >>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 >>>>> report=29 done=29 >>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 >>>>> report=29 done=29 >>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>>> status=0x8 >>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>>> status=0x8 >>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 >>>>> report=30 done=31 >>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 >>>>> report=30 done=31 >>>>> >>>> I notice that you have NOUVEAU Nvidia drivers loaded? The one >>>> difference in HW >>>> between your board and mine is that I have ATI video card. >>>> >>> Seems the problem is linked to auditd and X11 (but not nouveau). >>> >>> Today, I ran a bunch of scenarios. I first determined that the >>> problem only manifest in runlevel 5. Next, this occurred with or >>> without KMS and with or without nouveau. This happened whether or >>> not I was logged in (local or remote), and regardless of window >>> manager (xdm, gdm, kdm). I then checked to see what else was >>> different between runlevel 3 and 5 - only thing was auditd. I >>> disabled auditd and reran - no errors. >>> >>> Now for the odd stuff: >>> >>> The errors only manifest if the high throughput data transfer is >>> initiated when the system is in runlevel 5 and auditd was started by >>> init when transitioning from runlevel 3 to 5. For example, the >>> following scenarios do not cause the errors to manifest: >>> >>> runlevel3; start auditd runlevel 5; start transfer >>> runlevel3; chkconfig auditd off; runlevel5; start auditd; start >>> transfer >>> runlevel3; start transfer (note: errors do not occur if I transition >>> to runlevel 5 after the high bandwidth transfer has started) >>> runlevel3; startx; start transfer >>> >>> The only way I get the problem to manifest is transition to runlevel >>> 5 with chkconfig auditd on (level 5 only) and then initate the >>> windows backup. >>> >>> I'm guessing that there is some sort of race condition happening >>> between X (xdm/gdm/kdm/greeter?) and auditd that is somehow >>> corrupting something. I'd hazard a more or less obvious guess that >>> whatever's being corrupted differs when there is already a high >>> throughput transfer under way. >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-kernel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >> >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: audit.c skb - tty race condition - was sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 21:15 ` Michael Breuer @ 2009-12-30 21:21 ` Michael Breuer 0 siblings, 0 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-30 21:21 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On 12/30/2009 4:15 PM, Michael Breuer wrote: > On 12/30/2009 3:44 PM, Michael Breuer wrote: >> A couple more observations: >> >> 1) enabling auditd for runlevel 3 mitigates the issue >> 2) starting a remote x session (XDMCP) while under load and while >> auditd is already running also triggers the sky2 interrupt status >> messages - so maybe not tty1 - but some sort of X & auditd >> interaction. Even in this case, the frequency of the error messages >> is much less than when auditd is started in runlevel 5 for the first >> time. > And just for some certainty - running with audit=0, the errors do not > manifest. Disregard - took way longer, but they're back: Dec 30 16:16:42 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:16:42 mail kernel: sky2 Tx ring pending=72...74 report=73 done=73 Dec 30 16:16:42 mail kernel: 72: 0x0: 0xb18a00be(8490) Dec 30 16:16:42 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:16:42 mail kernel: sky2 Tx ring pending=86...88 report=87 done=88 Dec 30 16:16:42 mail kernel: 86: 0x0: 0xb1a140be(9014) Dec 30 16:16:58 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:16:58 mail kernel: sky2 Tx ring pending=50...52 report=51 done=52 Dec 30 16:16:58 mail kernel: 50: 0x0: 0xac27c0be(8490) Dec 30 16:17:15 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Dec 30 16:17:15 mail kernel: sky2 Tx ring pending=16...18 report=16 done=17 Dec 30 16:17:15 mail kernel: 16: 0x0: 0xa67ec0be(8490) Dec 30 16:17:15 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:17:15 mail kernel: sky2 Tx ring pending=122...124 report=123 done=124 Dec 30 16:17:15 mail kernel: 122: 0x0: 0xa68980be(8490) Dec 30 16:17:16 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:17:16 mail kernel: sky2 Tx ring pending=66...68 report=67 done=67 Dec 30 16:17:16 mail kernel: 66: 0x0: 0xa67200be(8490) Dec 30 16:17:16 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:17:16 mail kernel: sky2 Tx ring pending=74...76 report=75 done=76 Dec 30 16:17:16 mail kernel: 74: 0x0: 0xa659c0be(8490) Dec 30 16:17:21 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0 Dec 30 16:17:21 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via eth0 Dec 30 16:18:01 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:18:01 mail kernel: sky2 Tx ring pending=0...2 report=1 done=1 Dec 30 16:18:01 mail kernel: 0: 0x0: 0x926580be(8490) Dec 30 16:18:04 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:18:04 mail kernel: sky2 Tx ring pending=116...118 report=117 done=117 Dec 30 16:18:04 mail kernel: 116: 0x0: 0x91f380be(8490) Dec 30 16:18:04 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Dec 30 16:18:04 mail kernel: sky2 Tx ring pending=102...104 report=102 done=103 Dec 30 16:18:04 mail kernel: 102: 0x0: 0x91ed80be(8490) Dec 30 16:18:05 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:18:05 mail kernel: sky2 Tx ring pending=64...68 report=65 done=68 Dec 30 16:18:05 mail kernel: 64: 0x0: 0x91d340be(8490) Dec 30 16:18:05 mail kernel: 66: 0x0: 0x920040ca(105) Dec 30 16:18:06 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:18:06 mail kernel: sky2 Tx ring pending=82...86 report=83 done=83 Dec 30 16:18:06 mail kernel: 82: 0x0: 0x91bf80be(9014) Dec 30 16:18:06 mail kernel: 84: 0x0: 0x91bec0be(7518) Dec 30 16:18:06 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:18:06 mail kernel: sky2 Tx ring pending=83...86 report=85 done=86 Dec 30 16:18:06 mail kernel: 82: 0x0: 0x91bf80be(9014) Dec 30 16:18:06 mail kernel: 84: 0x0: 0x91bec0be(7518) Dec 30 16:18:06 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 16:18:06 mail kernel: sky2 Tx ring pending=85...88 report=87 done=87 Dec 30 16:18:06 mail kernel: 84: 0x0: 0x91bec0be(7518) Dec 30 16:18:06 mail kernel: 86: 0x0: 0x91b080be(8398) Dec 30 16:18:08 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Dec 30 16:18:08 mail kernel: sky2 Tx ring pending=28...30 report=28 done=29 Dec 30 16:18:08 mail kernel: 28: 0x0: 0x919500be(8506) ... continues ... >> >> On 12/30/2009 2:15 PM, Michael Breuer wrote: >>> And now looking at audit.c it seems reasonable that there is a race >>> condition when auditd is started at roughly the same time as X. I'm >>> guessing that the kaudit thread is fired up; the tty connected; and >>> at the same time X grabs the tty. Somewhere in there an skb gets >>> hosed and is then reused by whatever comes along - in my case sky2 >>> as that's where the subsequent demand is. If the demand happens >>> first, the contaminated skb (dk in what way yet) is probably waiting >>> to manifest as some other bug that's been frustrating people. >>> On 12/30/2009 12:49 PM, Michael Breuer wrote: >>>> On 12/30/2009 2:58 AM, Stephen Hemminger wrote: >>>>> On Wed, 30 Dec 2009 02:23:20 -0500 >>>>> Michael Breuer<mbreuer@majjas.com> wrote: >>>>> >>>>>> Ok - I called dump_txring from sky2_net_intr: >>>>>> --- a/drivers/net/sky2.c >>>>>> +++ b/drivers/net/sky2.c >>>>>> @@ -2725,8 +2791,10 @@ static void sky2_watchdog(unsigned long arg) >>>>>> /* Hardware/software error handling */ >>>>>> static void sky2_err_intr(struct sky2_hw *hw, u32 status) >>>>>> { >>>>>> - if (net_ratelimit()) >>>>>> + if (net_ratelimit()) { >>>>>> dev_warn(&hw->pdev->dev, "error interrupt >>>>>> status=%#x\n", status); >>>>>> + dump_txring(hw, 0); >>>>>> + } >>>>>> >>>>>> if (status& Y2_IS_HW_ERR) >>>>>> sky2_hw_intr(hw); >>>>>> >>>>>> And got this: >>>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>>>> status=0x40000008 >>>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>>>> status=0x40000008 >>>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 >>>>>> report=29 done=29 >>>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 >>>>>> report=29 done=29 >>>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>>>> status=0x8 >>>>>> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >>>>>> status=0x8 >>>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 >>>>>> report=30 done=31 >>>>>> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 >>>>>> report=30 done=31 >>>>>> >>>>> I notice that you have NOUVEAU Nvidia drivers loaded? The one >>>>> difference in HW >>>>> between your board and mine is that I have ATI video card. >>>>> >>>> Seems the problem is linked to auditd and X11 (but not nouveau). >>>> >>>> Today, I ran a bunch of scenarios. I first determined that the >>>> problem only manifest in runlevel 5. Next, this occurred with or >>>> without KMS and with or without nouveau. This happened whether or >>>> not I was logged in (local or remote), and regardless of window >>>> manager (xdm, gdm, kdm). I then checked to see what else was >>>> different between runlevel 3 and 5 - only thing was auditd. I >>>> disabled auditd and reran - no errors. >>>> >>>> Now for the odd stuff: >>>> >>>> The errors only manifest if the high throughput data transfer is >>>> initiated when the system is in runlevel 5 and auditd was started >>>> by init when transitioning from runlevel 3 to 5. For example, the >>>> following scenarios do not cause the errors to manifest: >>>> >>>> runlevel3; start auditd runlevel 5; start transfer >>>> runlevel3; chkconfig auditd off; runlevel5; start auditd; start >>>> transfer >>>> runlevel3; start transfer (note: errors do not occur if I >>>> transition to runlevel 5 after the high bandwidth transfer has >>>> started) >>>> runlevel3; startx; start transfer >>>> >>>> The only way I get the problem to manifest is transition to >>>> runlevel 5 with chkconfig auditd on (level 5 only) and then initate >>>> the windows backup. >>>> >>>> I'm guessing that there is some sort of race condition happening >>>> between X (xdm/gdm/kdm/greeter?) and auditd that is somehow >>>> corrupting something. I'd hazard a more or less obvious guess that >>>> whatever's being corrupted differs when there is already a high >>>> throughput transfer under way. >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe >>>> linux-kernel" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> Please read the FAQ at http://www.tux.org/lkml/ >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-kernel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >> >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 7:23 ` Michael Breuer 2009-12-30 7:58 ` Stephen Hemminger @ 2009-12-30 7:59 ` Stephen Hemminger 2009-12-30 15:40 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2009-12-30 7:59 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On Wed, 30 Dec 2009 02:23:20 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > And got this: > Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 > Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 > Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 > Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 > Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 > Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 The interesting bits are only output at DEBUG level. So either change kernel log level or syslog. -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 7:59 ` Stephen Hemminger @ 2009-12-30 15:40 ` Michael Breuer 2009-12-30 18:10 ` Stephen Hemminger 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-30 15:40 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev Ok - reran: Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=31...40 report=35 done=35 Dec 30 10:37:14 mail kernel: 30: 0x0: 0xeb50c0ca(9014) Dec 30 10:37:14 mail kernel: 32: 0x0: 0xeb5080ca(9014) Dec 30 10:37:14 mail kernel: 34: 0x0: 0xeb5040ca(9014) Dec 30 10:37:14 mail kernel: 36: 0x0: 0xeb5000ca(9014) Dec 30 10:37:14 mail kernel: 38: 0x0: 0xeb4fc0ca(9014) Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=87...92 report=89 done=91 Dec 30 10:37:14 mail kernel: 86: 0x0: 0xeafd80ca(9014) Dec 30 10:37:14 mail kernel: 88: 0x0: 0xeafd40ca(9014) Dec 30 10:37:14 mail kernel: 90: 0x0: 0xeafd00ca(9014) Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=39...42 report=39 done=41 Dec 30 10:37:15 mail kernel: 38: 0x0: 0xe5fec0ca(9014) Dec 30 10:37:15 mail kernel: 40: 0x0: 0xe5fe80ca(9014) Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=117...124 report=119 done=119 Dec 30 10:37:15 mail kernel: 116: 0x0: 0xe5d6c0ca(9014) Dec 30 10:37:15 mail kernel: 118: 0x0: 0xe5d680ca(9014) Dec 30 10:37:15 mail kernel: 120: 0x0: 0xe5d660ca(5938) Dec 30 10:37:15 mail kernel: 122: 0x0: 0xe5d640ca(4213) continues on like this - but this is the start. On 12/30/2009 02:59 AM, Stephen Hemminger wrote: > On Wed, 30 Dec 2009 02:23:20 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> And got this: >> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 >> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=28...30 report=29 done=29 >> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 >> Dec 30 02:17:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 >> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 >> Dec 30 02:17:23 mail kernel: sky2 Tx ring pending=30...32 report=30 done=31 >> > The interesting bits are only output at DEBUG level. So either change kernel > log level or syslog. > > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 15:40 ` Michael Breuer @ 2009-12-30 18:10 ` Stephen Hemminger 2009-12-30 18:37 ` Michael Breuer 2009-12-31 18:09 ` Michael Breuer 0 siblings, 2 replies; 145+ messages in thread From: Stephen Hemminger @ 2009-12-30 18:10 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On Wed, 30 Dec 2009 10:40:56 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=31...40 report=35 done=35 > Dec 30 10:37:14 mail kernel: 30: 0x0: 0xeb50c0ca(9014) > Dec 30 10:37:14 mail kernel: 32: 0x0: 0xeb5080ca(9014) > Dec 30 10:37:14 mail kernel: 34: 0x0: 0xeb5040ca(9014) > Dec 30 10:37:14 mail kernel: 36: 0x0: 0xeb5000ca(9014) > Dec 30 10:37:14 mail kernel: 38: 0x0: 0xeb4fc0ca(9014) You must be using 9K MTU. Do you see the problem with smaller MTU? > Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=87...92 report=89 done=91 > Dec 30 10:37:14 mail kernel: 86: 0x0: 0xeafd80ca(9014) > Dec 30 10:37:14 mail kernel: 88: 0x0: 0xeafd40ca(9014) > Dec 30 10:37:14 mail kernel: 90: 0x0: 0xeafd00ca(9014) > Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 > Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=39...42 report=39 done=41 > Dec 30 10:37:15 mail kernel: 38: 0x0: 0xe5fec0ca(9014) > Dec 30 10:37:15 mail kernel: 40: 0x0: 0xe5fe80ca(9014) > Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=117...124 report=119 > done=119 > Dec 30 10:37:15 mail kernel: 116: 0x0: 0xe5d6c0ca(9014) > Dec 30 10:37:15 mail kernel: 118: 0x0: 0xe5d680ca(9014) > Dec 30 10:37:15 mail kernel: 120: 0x0: 0xe5d660ca(5938) > Dec 30 10:37:15 mail kernel: 122: 0x0: 0xe5d640ca(4213) -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 18:10 ` Stephen Hemminger @ 2009-12-30 18:37 ` Michael Breuer 2009-12-31 18:09 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-30 18:37 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On 12/30/2009 1:10 PM, Stephen Hemminger wrote: > On Wed, 30 Dec 2009 10:40:56 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=31...40 report=35 done=35 >> Dec 30 10:37:14 mail kernel: 30: 0x0: 0xeb50c0ca(9014) >> Dec 30 10:37:14 mail kernel: 32: 0x0: 0xeb5080ca(9014) >> Dec 30 10:37:14 mail kernel: 34: 0x0: 0xeb5040ca(9014) >> Dec 30 10:37:14 mail kernel: 36: 0x0: 0xeb5000ca(9014) >> Dec 30 10:37:14 mail kernel: 38: 0x0: 0xeb4fc0ca(9014) >> > You must be using 9K MTU. Do you see the problem with smaller > MTU? > Yes - decided to try 9k in the past day or so. Can redo with smaller again & debug if it would be helpful. > >> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=87...92 report=89 done=91 >> Dec 30 10:37:14 mail kernel: 86: 0x0: 0xeafd80ca(9014) >> Dec 30 10:37:14 mail kernel: 88: 0x0: 0xeafd40ca(9014) >> Dec 30 10:37:14 mail kernel: 90: 0x0: 0xeafd00ca(9014) >> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 >> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=39...42 report=39 done=41 >> Dec 30 10:37:15 mail kernel: 38: 0x0: 0xe5fec0ca(9014) >> Dec 30 10:37:15 mail kernel: 40: 0x0: 0xe5fe80ca(9014) >> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=117...124 report=119 >> done=119 >> Dec 30 10:37:15 mail kernel: 116: 0x0: 0xe5d6c0ca(9014) >> Dec 30 10:37:15 mail kernel: 118: 0x0: 0xe5d680ca(9014) >> Dec 30 10:37:15 mail kernel: 120: 0x0: 0xe5d660ca(5938) >> Dec 30 10:37:15 mail kernel: 122: 0x0: 0xe5d640ca(4213) >> > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 18:10 ` Stephen Hemminger 2009-12-30 18:37 ` Michael Breuer @ 2009-12-31 18:09 ` Michael Breuer 2009-12-31 18:24 ` Stephen Hemminger 2010-01-01 17:42 ` Michael Breuer 1 sibling, 2 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-31 18:09 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev Did some more digging today... Haven't nailed it, but there's something going on with X and tty... Among other things, when I telinit 3 && telinit 5 the tty keeps switching between 7 and 8 (nothing else running on either tty). It would appear that somehow the tty deallocation isn't complete when X restarts. Also, X grabbing a tty seems to be a requisite step in recreating the sky2 issue. On 12/30/2009 1:10 PM, Stephen Hemminger wrote: > On Wed, 30 Dec 2009 10:40:56 -0500 > Michael Breuer<mbreuer@majjas.com> wrote: > > >> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=31...40 report=35 done=35 >> Dec 30 10:37:14 mail kernel: 30: 0x0: 0xeb50c0ca(9014) >> Dec 30 10:37:14 mail kernel: 32: 0x0: 0xeb5080ca(9014) >> Dec 30 10:37:14 mail kernel: 34: 0x0: 0xeb5040ca(9014) >> Dec 30 10:37:14 mail kernel: 36: 0x0: 0xeb5000ca(9014) >> Dec 30 10:37:14 mail kernel: 38: 0x0: 0xeb4fc0ca(9014) >> > You must be using 9K MTU. Do you see the problem with smaller > MTU? > > >> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=87...92 report=89 done=91 >> Dec 30 10:37:14 mail kernel: 86: 0x0: 0xeafd80ca(9014) >> Dec 30 10:37:14 mail kernel: 88: 0x0: 0xeafd40ca(9014) >> Dec 30 10:37:14 mail kernel: 90: 0x0: 0xeafd00ca(9014) >> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 >> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=39...42 report=39 done=41 >> Dec 30 10:37:15 mail kernel: 38: 0x0: 0xe5fec0ca(9014) >> Dec 30 10:37:15 mail kernel: 40: 0x0: 0xe5fe80ca(9014) >> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x40000008 >> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=117...124 report=119 >> done=119 >> Dec 30 10:37:15 mail kernel: 116: 0x0: 0xe5d6c0ca(9014) >> Dec 30 10:37:15 mail kernel: 118: 0x0: 0xe5d680ca(9014) >> Dec 30 10:37:15 mail kernel: 120: 0x0: 0xe5d660ca(5938) >> Dec 30 10:37:15 mail kernel: 122: 0x0: 0xe5d640ca(4213) >> > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-31 18:09 ` Michael Breuer @ 2009-12-31 18:24 ` Stephen Hemminger 2010-01-01 17:42 ` Michael Breuer 1 sibling, 0 replies; 145+ messages in thread From: Stephen Hemminger @ 2009-12-31 18:24 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev I have similar motherboards and ACPI is having issues with parsing the tables, probably not related but see: http://bugzilla.kernel.org/show_bug.cgi?id=14954 -- ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-31 18:09 ` Michael Breuer 2009-12-31 18:24 ` Stephen Hemminger @ 2010-01-01 17:42 ` Michael Breuer 2010-01-01 19:26 ` sky2 panic in 2.6.32.1 under load (tty NULL write) Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-01 17:42 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev More results... Seems that this is related to the tty (or maybe vt or maybe console) driver. I had disabled mingetty on tty1 as a workaround to a Fedora/KDE issue. When I reenabled it, I was then able to recreate this problem (sky2) in runlevel 3. Retesting, it seems that the following conditions must be met in order for the interrupt errors to happen: 1) mingetty running on tty1 before the start of the test run 2) High sustained incoming load - 25MB/sec or more (as reported by nethogs) 3) I've only been able to recreate this using samba & a Windows 7 backup - but there could be other triggers As it happens, much of my other data all came down to mingetty starting or not starting on tty1 at specific points of test preparation. At this point I can recreate 100% of the time if I ensure that there is a mingetty running on tty1, regardless of runlevel; and I can not recreate 100% of the time when there is no mingetty running on tty1 at the start of the test. The results do not change if I start or stop mingetty on tty1 during the test. It's possible that X running on tty1 may also trigger the issue, but I haven't tested that. Note that running mingetty on the other tty's does not affect the outcome of the test. The test is also unaffected if I chvt to or away from tty1 before the test. I do not have to be logged in at the console (on any vt) to trigger the issue. In a nutshell, to recreate: make sure there is a mingetty running on tty1, blast traffic into the system on eth0 (sky2). On 12/31/2009 1:09 PM, Michael Breuer wrote: > Did some more digging today... Haven't nailed it, but there's > something going on with X and tty... > > Among other things, when I telinit 3 && telinit 5 the tty keeps > switching between 7 and 8 (nothing else running on either tty). It > would appear that somehow the tty deallocation isn't complete when X > restarts. Also, X grabbing a tty seems to be a requisite step in > recreating the sky2 issue. > > On 12/30/2009 1:10 PM, Stephen Hemminger wrote: >> On Wed, 30 Dec 2009 10:40:56 -0500 >> Michael Breuer<mbreuer@majjas.com> wrote: >> >>> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x40000008 >>> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=31...40 report=35 >>> done=35 >>> Dec 30 10:37:14 mail kernel: 30: 0x0: 0xeb50c0ca(9014) >>> Dec 30 10:37:14 mail kernel: 32: 0x0: 0xeb5080ca(9014) >>> Dec 30 10:37:14 mail kernel: 34: 0x0: 0xeb5040ca(9014) >>> Dec 30 10:37:14 mail kernel: 36: 0x0: 0xeb5000ca(9014) >>> Dec 30 10:37:14 mail kernel: 38: 0x0: 0xeb4fc0ca(9014) >> You must be using 9K MTU. Do you see the problem with smaller >> MTU? >> >>> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x40000008 >>> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=87...92 report=89 >>> done=91 >>> Dec 30 10:37:14 mail kernel: 86: 0x0: 0xeafd80ca(9014) >>> Dec 30 10:37:14 mail kernel: 88: 0x0: 0xeafd40ca(9014) >>> Dec 30 10:37:14 mail kernel: 90: 0x0: 0xeafd00ca(9014) >>> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x8 >>> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=39...42 report=39 >>> done=41 >>> Dec 30 10:37:15 mail kernel: 38: 0x0: 0xe5fec0ca(9014) >>> Dec 30 10:37:15 mail kernel: 40: 0x0: 0xe5fe80ca(9014) >>> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x40000008 >>> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=117...124 report=119 >>> done=119 >>> Dec 30 10:37:15 mail kernel: 116: 0x0: 0xe5d6c0ca(9014) >>> Dec 30 10:37:15 mail kernel: 118: 0x0: 0xe5d680ca(9014) >>> Dec 30 10:37:15 mail kernel: 120: 0x0: 0xe5d660ca(5938) >>> Dec 30 10:37:15 mail kernel: 122: 0x0: 0xe5d640ca(4213) > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (tty NULL write) 2010-01-01 17:42 ` Michael Breuer @ 2010-01-01 19:26 ` Michael Breuer 2010-01-01 20:34 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-01 19:26 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev Running with TTY_DEBUG_HANGUP: On boot - lots of stuff opening and closing tty1. Once up and stable (runlevel 3), I did initctl stop tty1, and then initctl start tty1. This is what was logged: Jan 1 14:03:21 mail root: close mingetty1 Jan 1 14:03:21 mail kernel: tty_release_dev of tty1 (tty count=6)... Jan 1 14:03:21 mail kernel: tty_release_dev of tty1 (tty count=5)... Jan 1 14:03:21 mail kernel: tty1 vhangup... Jan 1 14:03:21 mail init: tty1 main process (6031) killed by TERM signal Jan 1 14:03:36 mail root: start mingetty1 Jan 1 14:03:36 mail kernel: error attempted to write to tty [0x(null)] = NULL Jan 1 14:03:36 mail kernel: opening tty1... Jan 1 14:03:36 mail kernel: tty1 vhangup... Jan 1 14:03:36 mail kernel: tty_release_dev of tty1 (tty count=5)... Start/stop of mingetty on other ttys does not get the NULL write error. I am always getting the NULL error when starting mingetty on tty1. For example: Jan 1 14:04:11 mail root: restart mingetty2 Jan 1 14:04:11 mail init: tty2 main process (6029) killed by TERM signal Jan 1 14:04:11 mail kernel: tty_release_dev of tty7 (tty count=2)... Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... Jan 1 14:04:11 mail kernel: freeing tty structure... Jan 1 14:04:11 mail kernel: opening tty2... Jan 1 14:04:11 mail kernel: tty2 vhangup... Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... Jan 1 14:04:11 mail kernel: freeing tty structure... While starting/stopping tty2 and X was coming up, I also got this: Jan 1 14:04:11 mail root: restart mingetty2 Jan 1 14:04:11 mail init: tty2 main process (6029) killed by TERM signal Jan 1 14:04:11 mail kernel: tty_release_dev of tty7 (tty count=2)... Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... Jan 1 14:04:11 mail kernel: freeing tty structure... Jan 1 14:04:11 mail kernel: opening tty2... Jan 1 14:04:11 mail kernel: tty2 vhangup... Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... Jan 1 14:04:11 mail kernel: freeing tty structure... Jan 1 14:04:17 mail gnome-session[6832]: WARNING: Could not parse desktop file /home/mbreuer/.config/autostart/xfce4-tips-autostart.desktop: Key file does not have key 'Name' Jan 1 14:04:17 mail gnome-session[6832]: WARNING: could not read /home/mbreuer/.config/autostart/xfce4-tips-autostart.desktop Jan 1 14:04:19 mail pulseaudio[7002]: pid.c: Stale PID file, overwriting. Jan 1 14:04:20 mail kernel: opening tty2... Jan 1 14:04:20 mail kernel: opening pts2... Jan 1 14:04:20 mail kernel: opening pts2... Jan 1 14:04:20 mail kernel: tty_release_dev of pts2 (tty count=3)... Jan 1 14:04:20 mail kernel: opening pts2... Jan 1 14:04:20 mail kernel: tty_release_dev of pts2 (tty count=3)... Jan 1 14:04:20 mail kernel: opening pts3... Jan 1 14:04:20 mail kernel: opening pts3... Jan 1 14:04:20 mail kernel: tty_release_dev of pts3 (tty count=3)... Jan 1 14:04:20 mail kernel: opening pts3... Jan 1 14:04:20 mail kernel: tty_release_dev of pts3 (tty count=3)... Jan 1 14:04:20 mail kernel: opening pts4... Jan 1 14:04:20 mail kernel: opening pts4... Jan 1 14:04:20 mail kernel: tty_release_dev of pts4 (tty count=3)... Jan 1 14:04:20 mail kernel: opening pts4... Jan 1 14:04:21 mail kernel: tty_release_dev of pts4 (tty count=3)... Jan 1 14:04:21 mail kernel: DMA-API: debugging out of memory - disabling Jan 1 14:04:23 mail pulseaudio[7417]: pid.c: Daemon already running. Jan 1 14:04:24 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Jan 1 14:04:24 mail kernel: sky2 Tx ring pending=45...52 report=45 done=47 Jan 1 14:04:24 mail kernel: 44: 0x0: 0xf26a40be(9014) Jan 1 14:04:24 mail kernel: 46: 0x0: 0xf26a00be(9014) Jan 1 14:04:24 mail kernel: 48: 0x0: 0xf269c0be(9014) Jan 1 14:04:24 mail kernel: 50: 0x0: 0xf26980be(9014) Jan 1 14:04:24 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8 Jan 1 14:04:24 mail kernel: sky2 Tx ring pending=47...52 report=47 done=49 Jan 1 14:04:24 mail kernel: 46: 0x0: 0xf26a00be(9014) Jan 1 14:04:24 mail kernel: 48: 0x0: 0xf269c0be(9014) Jan 1 14:04:24 mail kernel: 50: 0x0: 0xf26980be(9014) Note the sky2 errors (this time not under load). It looks like there is something bad going on with the tty driver. On 01/01/2010 12:42 PM, Michael Breuer wrote: > More results... Seems that this is related to the tty (or maybe vt or > maybe console) driver. I had disabled mingetty on tty1 as a workaround > to a Fedora/KDE issue. When I reenabled it, I was then able to > recreate this problem (sky2) in runlevel 3. > > Retesting, it seems that the following conditions must be met in order > for the interrupt errors to happen: > > 1) mingetty running on tty1 before the start of the test run > 2) High sustained incoming load - 25MB/sec or more (as reported by > nethogs) > 3) I've only been able to recreate this using samba & a Windows 7 > backup - but there could be other triggers > > As it happens, much of my other data all came down to mingetty > starting or not starting on tty1 at specific points of test > preparation. At this point I can recreate 100% of the time if I ensure > that there is a mingetty running on tty1, regardless of runlevel; and > I can not recreate 100% of the time when there is no mingetty running > on tty1 at the start of the test. The results do not change if I start > or stop mingetty on tty1 during the test. It's possible that X running > on tty1 may also trigger the issue, but I haven't tested that. Note > that running mingetty on the other tty's does not affect the outcome > of the test. The test is also unaffected if I chvt to or away from > tty1 before the test. I do not have to be logged in at the console (on > any vt) to trigger the issue. > > In a nutshell, to recreate: make sure there is a mingetty running on > tty1, blast traffic into the system on eth0 (sky2). > > > > > On 12/31/2009 1:09 PM, Michael Breuer wrote: >> Did some more digging today... Haven't nailed it, but there's >> something going on with X and tty... >> >> Among other things, when I telinit 3 && telinit 5 the tty keeps >> switching between 7 and 8 (nothing else running on either tty). It >> would appear that somehow the tty deallocation isn't complete when X >> restarts. Also, X grabbing a tty seems to be a requisite step in >> recreating the sky2 issue. >> >> On 12/30/2009 1:10 PM, Stephen Hemminger wrote: >>> On Wed, 30 Dec 2009 10:40:56 -0500 >>> Michael Breuer<mbreuer@majjas.com> wrote: >>> >>>> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x40000008 >>>> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=31...40 report=35 >>>> done=35 >>>> Dec 30 10:37:14 mail kernel: 30: 0x0: 0xeb50c0ca(9014) >>>> Dec 30 10:37:14 mail kernel: 32: 0x0: 0xeb5080ca(9014) >>>> Dec 30 10:37:14 mail kernel: 34: 0x0: 0xeb5040ca(9014) >>>> Dec 30 10:37:14 mail kernel: 36: 0x0: 0xeb5000ca(9014) >>>> Dec 30 10:37:14 mail kernel: 38: 0x0: 0xeb4fc0ca(9014) >>> You must be using 9K MTU. Do you see the problem with smaller >>> MTU? >>> >>>> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x40000008 >>>> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=87...92 report=89 >>>> done=91 >>>> Dec 30 10:37:14 mail kernel: 86: 0x0: 0xeafd80ca(9014) >>>> Dec 30 10:37:14 mail kernel: 88: 0x0: 0xeafd40ca(9014) >>>> Dec 30 10:37:14 mail kernel: 90: 0x0: 0xeafd00ca(9014) >>>> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x8 >>>> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=39...42 report=39 >>>> done=41 >>>> Dec 30 10:37:15 mail kernel: 38: 0x0: 0xe5fec0ca(9014) >>>> Dec 30 10:37:15 mail kernel: 40: 0x0: 0xe5fe80ca(9014) >>>> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >>>> status=0x40000008 >>>> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=117...124 report=119 >>>> done=119 >>>> Dec 30 10:37:15 mail kernel: 116: 0x0: 0xe5d6c0ca(9014) >>>> Dec 30 10:37:15 mail kernel: 118: 0x0: 0xe5d680ca(9014) >>>> Dec 30 10:37:15 mail kernel: 120: 0x0: 0xe5d660ca(5938) >>>> Dec 30 10:37:15 mail kernel: 122: 0x0: 0xe5d640ca(4213) >> > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (tty NULL write) 2010-01-01 19:26 ` sky2 panic in 2.6.32.1 under load (tty NULL write) Michael Breuer @ 2010-01-01 20:34 ` Michael Breuer 2010-01-02 21:42 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2010-01-01 20:34 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev In single user mode: Looks like the first attempt to run mingetty on tty1 triggers a vhangup of more than tty1. Also, the initial console (shell, etc) were open on tty1 when I started mingetty on tty1. Looks like the subsequent vhangup is what broke things. Also, the vhangup issued from mingetty seems to have resulted in tty2 activity as well, although I don't see why it should have. Scenario: Started in single user mode. did initctl start tty2 and then switched to tty2 and logged in: Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty2 vhangup... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=1)... Jan 1 15:11:42 mail kernel: freeing tty structure... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty2 vhangup... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... Then (and note that time should have advanced, but didn't): initctl start tty1 (previously, the initial shell from single user mode was up): Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty1 vhangup... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=3)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=3)... Jan 1 15:11:42 mail kernel: tty1 vhangup... Jan 1 15:11:42 mail kernel: error attempted to write to tty [0x(null)] = NULL Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=4)... While I was doing this, I was watching dmesg, not /var/log/message. Didn't notice time was stuck... Then, I did telinit 5: Jan 1 15:11:42 mail kernel: sky2 eth1: enabling interface Jan 1 15:11:42 mail kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready Jan 1 15:11:42 mail kernel: sky2 eth1: Link is up at 100 Mbps, full duplex, flow control rx Jan 1 15:11:42 mail kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=4)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: IPv6 over IPv4 tunneling driver Jan 1 15:11:42 mail kernel: sit0: Disabled Privacy Extensions Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: SELinux: initialized (dev cgroup, type cgroup), uses genfs_contexts Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: audit rule for LSM 'dhclient_t' is invalid Jan 1 15:11:42 mail kernel: audit rule for LSM 'mcstransd_t' is invalid Jan 1 15:11:42 mail kernel: audit rule for LSM 'samba_t' is invalid Jan 1 15:11:42 mail kernel: audit rule for LSM 'dhclient_t' is invalid Jan 1 15:11:42 mail kernel: audit rule for LSM 'mcstransd_t' is invalid Jan 1 15:11:42 mail kernel: audit rule for LSM 'samba_t' is invalid Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty2... Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=4)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail cpuspeed: Enabling ondemand cpu frequency scaling governor Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:42 mail kdump: kexec: loaded kdump kernel Jan 1 15:11:42 mail kdump: started up Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... Jan 1 15:11:42 mail kernel: opening tty1... Jan 1 15:11:43 mail kernel: tty_release_dev of tty1 (tty count=2)... This is when time increments again ... time jumped ahead and was OK by the time ntp started. Looks like system time was OK, but syslog was backlogged. On 01/01/2010 02:26 PM, Michael Breuer wrote: > Running with TTY_DEBUG_HANGUP: > > On boot - lots of stuff opening and closing tty1. Once up and stable > (runlevel 3), I did initctl stop tty1, and then initctl start tty1. > This is what was logged: > > Jan 1 14:03:21 mail root: close mingetty1 > Jan 1 14:03:21 mail kernel: tty_release_dev of tty1 (tty count=6)... > Jan 1 14:03:21 mail kernel: tty_release_dev of tty1 (tty count=5)... > Jan 1 14:03:21 mail kernel: tty1 vhangup... > Jan 1 14:03:21 mail init: tty1 main process (6031) killed by TERM signal > Jan 1 14:03:36 mail root: start mingetty1 > Jan 1 14:03:36 mail kernel: error attempted to write to tty > [0x(null)] = NULL > Jan 1 14:03:36 mail kernel: opening tty1... > Jan 1 14:03:36 mail kernel: tty1 vhangup... > Jan 1 14:03:36 mail kernel: tty_release_dev of tty1 (tty count=5)... > > Start/stop of mingetty on other ttys does not get the NULL write > error. I am always getting the NULL error when starting mingetty on tty1. > > For example: > Jan 1 14:04:11 mail root: restart mingetty2 > Jan 1 14:04:11 mail init: tty2 main process (6029) killed by TERM signal > Jan 1 14:04:11 mail kernel: tty_release_dev of tty7 (tty count=2)... > Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... > Jan 1 14:04:11 mail kernel: freeing tty structure... > Jan 1 14:04:11 mail kernel: opening tty2... > Jan 1 14:04:11 mail kernel: tty2 vhangup... > Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... > Jan 1 14:04:11 mail kernel: freeing tty structure... > > While starting/stopping tty2 and X was coming up, I also got this: > > Jan 1 14:04:11 mail root: restart mingetty2 > Jan 1 14:04:11 mail init: tty2 main process (6029) killed by TERM signal > Jan 1 14:04:11 mail kernel: tty_release_dev of tty7 (tty count=2)... > Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... > Jan 1 14:04:11 mail kernel: freeing tty structure... > Jan 1 14:04:11 mail kernel: opening tty2... > Jan 1 14:04:11 mail kernel: tty2 vhangup... > Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... > Jan 1 14:04:11 mail kernel: freeing tty structure... > Jan 1 14:04:17 mail gnome-session[6832]: WARNING: Could not parse > desktop file > /home/mbreuer/.config/autostart/xfce4-tips-autostart.desktop: Key file > does not have key 'Name' > Jan 1 14:04:17 mail gnome-session[6832]: WARNING: could not read > /home/mbreuer/.config/autostart/xfce4-tips-autostart.desktop > Jan 1 14:04:19 mail pulseaudio[7002]: pid.c: Stale PID file, > overwriting. > Jan 1 14:04:20 mail kernel: opening tty2... > Jan 1 14:04:20 mail kernel: opening pts2... > Jan 1 14:04:20 mail kernel: opening pts2... > Jan 1 14:04:20 mail kernel: tty_release_dev of pts2 (tty count=3)... > Jan 1 14:04:20 mail kernel: opening pts2... > Jan 1 14:04:20 mail kernel: tty_release_dev of pts2 (tty count=3)... > Jan 1 14:04:20 mail kernel: opening pts3... > Jan 1 14:04:20 mail kernel: opening pts3... > Jan 1 14:04:20 mail kernel: tty_release_dev of pts3 (tty count=3)... > Jan 1 14:04:20 mail kernel: opening pts3... > Jan 1 14:04:20 mail kernel: tty_release_dev of pts3 (tty count=3)... > Jan 1 14:04:20 mail kernel: opening pts4... > Jan 1 14:04:20 mail kernel: opening pts4... > Jan 1 14:04:20 mail kernel: tty_release_dev of pts4 (tty count=3)... > Jan 1 14:04:20 mail kernel: opening pts4... > Jan 1 14:04:21 mail kernel: tty_release_dev of pts4 (tty count=3)... > Jan 1 14:04:21 mail kernel: DMA-API: debugging out of memory - disabling > Jan 1 14:04:23 mail pulseaudio[7417]: pid.c: Daemon already running. > Jan 1 14:04:24 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x8 > Jan 1 14:04:24 mail kernel: sky2 Tx ring pending=45...52 report=45 > done=47 > Jan 1 14:04:24 mail kernel: 44: 0x0: 0xf26a40be(9014) > Jan 1 14:04:24 mail kernel: 46: 0x0: 0xf26a00be(9014) > Jan 1 14:04:24 mail kernel: 48: 0x0: 0xf269c0be(9014) > Jan 1 14:04:24 mail kernel: 50: 0x0: 0xf26980be(9014) > Jan 1 14:04:24 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x8 > Jan 1 14:04:24 mail kernel: sky2 Tx ring pending=47...52 report=47 > done=49 > Jan 1 14:04:24 mail kernel: 46: 0x0: 0xf26a00be(9014) > Jan 1 14:04:24 mail kernel: 48: 0x0: 0xf269c0be(9014) > Jan 1 14:04:24 mail kernel: 50: 0x0: 0xf26980be(9014) > > Note the sky2 errors (this time not under load). > > It looks like there is something bad going on with the tty driver. > > > On 01/01/2010 12:42 PM, Michael Breuer wrote: >> More results... Seems that this is related to the tty (or maybe vt or >> maybe console) driver. I had disabled mingetty on tty1 as a >> workaround to a Fedora/KDE issue. When I reenabled it, I was then >> able to recreate this problem (sky2) in runlevel 3. >> >> Retesting, it seems that the following conditions must be met in >> order for the interrupt errors to happen: >> >> 1) mingetty running on tty1 before the start of the test run >> 2) High sustained incoming load - 25MB/sec or more (as reported by >> nethogs) >> 3) I've only been able to recreate this using samba & a Windows 7 >> backup - but there could be other triggers >> >> As it happens, much of my other data all came down to mingetty >> starting or not starting on tty1 at specific points of test >> preparation. At this point I can recreate 100% of the time if I >> ensure that there is a mingetty running on tty1, regardless of >> runlevel; and I can not recreate 100% of the time when there is no >> mingetty running on tty1 at the start of the test. The results do not >> change if I start or stop mingetty on tty1 during the test. It's >> possible that X running on tty1 may also trigger the issue, but I >> haven't tested that. Note that running mingetty on the other tty's >> does not affect the outcome of the test. The test is also unaffected >> if I chvt to or away from tty1 before the test. I do not have to be >> logged in at the console (on any vt) to trigger the issue. >> >> In a nutshell, to recreate: make sure there is a mingetty running on >> tty1, blast traffic into the system on eth0 (sky2). >> >> >> >> >> On 12/31/2009 1:09 PM, Michael Breuer wrote: >>> Did some more digging today... Haven't nailed it, but there's >>> something going on with X and tty... >>> >>> Among other things, when I telinit 3 && telinit 5 the tty keeps >>> switching between 7 and 8 (nothing else running on either tty). It >>> would appear that somehow the tty deallocation isn't complete when X >>> restarts. Also, X grabbing a tty seems to be a requisite step in >>> recreating the sky2 issue. >>> >>> On 12/30/2009 1:10 PM, Stephen Hemminger wrote: >>>> On Wed, 30 Dec 2009 10:40:56 -0500 >>>> Michael Breuer<mbreuer@majjas.com> wrote: >>>> >>>>> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >>>>> status=0x40000008 >>>>> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=31...40 >>>>> report=35 done=35 >>>>> Dec 30 10:37:14 mail kernel: 30: 0x0: 0xeb50c0ca(9014) >>>>> Dec 30 10:37:14 mail kernel: 32: 0x0: 0xeb5080ca(9014) >>>>> Dec 30 10:37:14 mail kernel: 34: 0x0: 0xeb5040ca(9014) >>>>> Dec 30 10:37:14 mail kernel: 36: 0x0: 0xeb5000ca(9014) >>>>> Dec 30 10:37:14 mail kernel: 38: 0x0: 0xeb4fc0ca(9014) >>>> You must be using 9K MTU. Do you see the problem with smaller >>>> MTU? >>>> >>>>> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >>>>> status=0x40000008 >>>>> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=87...92 >>>>> report=89 done=91 >>>>> Dec 30 10:37:14 mail kernel: 86: 0x0: 0xeafd80ca(9014) >>>>> Dec 30 10:37:14 mail kernel: 88: 0x0: 0xeafd40ca(9014) >>>>> Dec 30 10:37:14 mail kernel: 90: 0x0: 0xeafd00ca(9014) >>>>> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >>>>> status=0x8 >>>>> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=39...42 >>>>> report=39 done=41 >>>>> Dec 30 10:37:15 mail kernel: 38: 0x0: 0xe5fec0ca(9014) >>>>> Dec 30 10:37:15 mail kernel: 40: 0x0: 0xe5fe80ca(9014) >>>>> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >>>>> status=0x40000008 >>>>> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=117...124 >>>>> report=119 >>>>> done=119 >>>>> Dec 30 10:37:15 mail kernel: 116: 0x0: 0xe5d6c0ca(9014) >>>>> Dec 30 10:37:15 mail kernel: 118: 0x0: 0xe5d680ca(9014) >>>>> Dec 30 10:37:15 mail kernel: 120: 0x0: 0xe5d660ca(5938) >>>>> Dec 30 10:37:15 mail kernel: 122: 0x0: 0xe5d640ca(4213) >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (tty NULL write) 2010-01-01 20:34 ` Michael Breuer @ 2010-01-02 21:42 ` Michael Breuer 0 siblings, 0 replies; 145+ messages in thread From: Michael Breuer @ 2010-01-02 21:42 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev Made some more progress - hopefully someone can point me in the right direction.... Put extra debugging printk's into tty_io.c and learned that when init starts the mingetty's, tty_release_dev is called for ptm0, and the tty structure is freed (according to the debug log message) - but up to this point, it had never been allocated. It looks like this is happening as the result of vhangup for pts0 being invoked while in the middle of closing (tty_release_dev) several tty1's. I'm guessing that's where something is being stepped on, but haven't tracked it down quite yet. Regardless, it's after this that I can generate sky2 network errors, and not before. This does not seem to happen if I remove /etc/events.d/tty1 before boot. On 1/1/2010 3:34 PM, Michael Breuer wrote: > In single user mode: > > Looks like the first attempt to run mingetty on tty1 triggers a > vhangup of more than tty1. Also, the initial console (shell, etc) were > open on tty1 when I started mingetty on tty1. Looks like the > subsequent vhangup is what broke things. Also, the vhangup issued from > mingetty seems to have resulted in tty2 activity as well, although I > don't see why it should have. > > Scenario: > > Started in single user mode. > did initctl start tty2 and then switched to tty2 and logged in: > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty2 vhangup... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=1)... > Jan 1 15:11:42 mail kernel: freeing tty structure... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty2 vhangup... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... > > Then (and note that time should have advanced, but didn't): initctl > start tty1 (previously, the initial shell from single user mode was up): > > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty1 vhangup... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=3)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=3)... > Jan 1 15:11:42 mail kernel: tty1 vhangup... > Jan 1 15:11:42 mail kernel: error attempted to write to tty > [0x(null)] = NULL > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=3)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=4)... > > While I was doing this, I was watching dmesg, not /var/log/message. > Didn't notice time was stuck... > Then, I did telinit 5: > Jan 1 15:11:42 mail kernel: sky2 eth1: enabling interface > Jan 1 15:11:42 mail kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready > Jan 1 15:11:42 mail kernel: sky2 eth1: Link is up at 100 Mbps, full > duplex, flow control rx > Jan 1 15:11:42 mail kernel: ADDRCONF(NETDEV_CHANGE): eth1: link > becomes ready > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=4)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: IPv6 over IPv4 tunneling driver > Jan 1 15:11:42 mail kernel: sit0: Disabled Privacy Extensions > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: SELinux: initialized (dev cgroup, type > cgroup), uses genfs_contexts > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: audit rule for LSM 'dhclient_t' is invalid > Jan 1 15:11:42 mail kernel: audit rule for LSM 'mcstransd_t' is invalid > Jan 1 15:11:42 mail kernel: audit rule for LSM 'samba_t' is invalid > Jan 1 15:11:42 mail kernel: audit rule for LSM 'dhclient_t' is invalid > Jan 1 15:11:42 mail kernel: audit rule for LSM 'mcstransd_t' is invalid > Jan 1 15:11:42 mail kernel: audit rule for LSM 'samba_t' is invalid > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty2... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty2 (tty count=4)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail cpuspeed: Enabling ondemand cpu frequency scaling > governor > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:42 mail kdump: kexec: loaded kdump kernel > Jan 1 15:11:42 mail kdump: started up > Jan 1 15:11:42 mail kernel: tty_release_dev of tty1 (tty count=2)... > Jan 1 15:11:42 mail kernel: opening tty1... > Jan 1 15:11:43 mail kernel: tty_release_dev of tty1 (tty count=2)... > > This is when time increments again ... time jumped ahead and was OK by > the time ntp started. Looks like system time was OK, but syslog was > backlogged. > > On 01/01/2010 02:26 PM, Michael Breuer wrote: >> Running with TTY_DEBUG_HANGUP: >> >> On boot - lots of stuff opening and closing tty1. Once up and stable >> (runlevel 3), I did initctl stop tty1, and then initctl start tty1. >> This is what was logged: >> >> Jan 1 14:03:21 mail root: close mingetty1 >> Jan 1 14:03:21 mail kernel: tty_release_dev of tty1 (tty count=6)... >> Jan 1 14:03:21 mail kernel: tty_release_dev of tty1 (tty count=5)... >> Jan 1 14:03:21 mail kernel: tty1 vhangup... >> Jan 1 14:03:21 mail init: tty1 main process (6031) killed by TERM >> signal >> Jan 1 14:03:36 mail root: start mingetty1 >> Jan 1 14:03:36 mail kernel: error attempted to write to tty >> [0x(null)] = NULL >> Jan 1 14:03:36 mail kernel: opening tty1... >> Jan 1 14:03:36 mail kernel: tty1 vhangup... >> Jan 1 14:03:36 mail kernel: tty_release_dev of tty1 (tty count=5)... >> >> Start/stop of mingetty on other ttys does not get the NULL write >> error. I am always getting the NULL error when starting mingetty on >> tty1. >> >> For example: >> Jan 1 14:04:11 mail root: restart mingetty2 >> Jan 1 14:04:11 mail init: tty2 main process (6029) killed by TERM >> signal >> Jan 1 14:04:11 mail kernel: tty_release_dev of tty7 (tty count=2)... >> Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... >> Jan 1 14:04:11 mail kernel: freeing tty structure... >> Jan 1 14:04:11 mail kernel: opening tty2... >> Jan 1 14:04:11 mail kernel: tty2 vhangup... >> Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... >> Jan 1 14:04:11 mail kernel: freeing tty structure... >> >> While starting/stopping tty2 and X was coming up, I also got this: >> >> Jan 1 14:04:11 mail root: restart mingetty2 >> Jan 1 14:04:11 mail init: tty2 main process (6029) killed by TERM >> signal >> Jan 1 14:04:11 mail kernel: tty_release_dev of tty7 (tty count=2)... >> Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... >> Jan 1 14:04:11 mail kernel: freeing tty structure... >> Jan 1 14:04:11 mail kernel: opening tty2... >> Jan 1 14:04:11 mail kernel: tty2 vhangup... >> Jan 1 14:04:11 mail kernel: tty_release_dev of tty2 (tty count=1)... >> Jan 1 14:04:11 mail kernel: freeing tty structure... >> Jan 1 14:04:17 mail gnome-session[6832]: WARNING: Could not parse >> desktop file >> /home/mbreuer/.config/autostart/xfce4-tips-autostart.desktop: Key >> file does not have key 'Name' >> Jan 1 14:04:17 mail gnome-session[6832]: WARNING: could not read >> /home/mbreuer/.config/autostart/xfce4-tips-autostart.desktop >> Jan 1 14:04:19 mail pulseaudio[7002]: pid.c: Stale PID file, >> overwriting. >> Jan 1 14:04:20 mail kernel: opening tty2... >> Jan 1 14:04:20 mail kernel: opening pts2... >> Jan 1 14:04:20 mail kernel: opening pts2... >> Jan 1 14:04:20 mail kernel: tty_release_dev of pts2 (tty count=3)... >> Jan 1 14:04:20 mail kernel: opening pts2... >> Jan 1 14:04:20 mail kernel: tty_release_dev of pts2 (tty count=3)... >> Jan 1 14:04:20 mail kernel: opening pts3... >> Jan 1 14:04:20 mail kernel: opening pts3... >> Jan 1 14:04:20 mail kernel: tty_release_dev of pts3 (tty count=3)... >> Jan 1 14:04:20 mail kernel: opening pts3... >> Jan 1 14:04:20 mail kernel: tty_release_dev of pts3 (tty count=3)... >> Jan 1 14:04:20 mail kernel: opening pts4... >> Jan 1 14:04:20 mail kernel: opening pts4... >> Jan 1 14:04:20 mail kernel: tty_release_dev of pts4 (tty count=3)... >> Jan 1 14:04:20 mail kernel: opening pts4... >> Jan 1 14:04:21 mail kernel: tty_release_dev of pts4 (tty count=3)... >> Jan 1 14:04:21 mail kernel: DMA-API: debugging out of memory - >> disabling >> Jan 1 14:04:23 mail pulseaudio[7417]: pid.c: Daemon already running. >> Jan 1 14:04:24 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x8 >> Jan 1 14:04:24 mail kernel: sky2 Tx ring pending=45...52 report=45 >> done=47 >> Jan 1 14:04:24 mail kernel: 44: 0x0: 0xf26a40be(9014) >> Jan 1 14:04:24 mail kernel: 46: 0x0: 0xf26a00be(9014) >> Jan 1 14:04:24 mail kernel: 48: 0x0: 0xf269c0be(9014) >> Jan 1 14:04:24 mail kernel: 50: 0x0: 0xf26980be(9014) >> Jan 1 14:04:24 mail kernel: sky2 0000:06:00.0: error interrupt >> status=0x8 >> Jan 1 14:04:24 mail kernel: sky2 Tx ring pending=47...52 report=47 >> done=49 >> Jan 1 14:04:24 mail kernel: 46: 0x0: 0xf26a00be(9014) >> Jan 1 14:04:24 mail kernel: 48: 0x0: 0xf269c0be(9014) >> Jan 1 14:04:24 mail kernel: 50: 0x0: 0xf26980be(9014) >> >> Note the sky2 errors (this time not under load). >> >> It looks like there is something bad going on with the tty driver. >> >> >> On 01/01/2010 12:42 PM, Michael Breuer wrote: >>> More results... Seems that this is related to the tty (or maybe vt >>> or maybe console) driver. I had disabled mingetty on tty1 as a >>> workaround to a Fedora/KDE issue. When I reenabled it, I was then >>> able to recreate this problem (sky2) in runlevel 3. >>> >>> Retesting, it seems that the following conditions must be met in >>> order for the interrupt errors to happen: >>> >>> 1) mingetty running on tty1 before the start of the test run >>> 2) High sustained incoming load - 25MB/sec or more (as reported by >>> nethogs) >>> 3) I've only been able to recreate this using samba & a Windows 7 >>> backup - but there could be other triggers >>> >>> As it happens, much of my other data all came down to mingetty >>> starting or not starting on tty1 at specific points of test >>> preparation. At this point I can recreate 100% of the time if I >>> ensure that there is a mingetty running on tty1, regardless of >>> runlevel; and I can not recreate 100% of the time when there is no >>> mingetty running on tty1 at the start of the test. The results do >>> not change if I start or stop mingetty on tty1 during the test. It's >>> possible that X running on tty1 may also trigger the issue, but I >>> haven't tested that. Note that running mingetty on the other tty's >>> does not affect the outcome of the test. The test is also unaffected >>> if I chvt to or away from tty1 before the test. I do not have to be >>> logged in at the console (on any vt) to trigger the issue. >>> >>> In a nutshell, to recreate: make sure there is a mingetty running on >>> tty1, blast traffic into the system on eth0 (sky2). >>> >>> >>> >>> >>> On 12/31/2009 1:09 PM, Michael Breuer wrote: >>>> Did some more digging today... Haven't nailed it, but there's >>>> something going on with X and tty... >>>> >>>> Among other things, when I telinit 3 && telinit 5 the tty keeps >>>> switching between 7 and 8 (nothing else running on either tty). It >>>> would appear that somehow the tty deallocation isn't complete when >>>> X restarts. Also, X grabbing a tty seems to be a requisite step in >>>> recreating the sky2 issue. >>>> >>>> On 12/30/2009 1:10 PM, Stephen Hemminger wrote: >>>>> On Wed, 30 Dec 2009 10:40:56 -0500 >>>>> Michael Breuer<mbreuer@majjas.com> wrote: >>>>> >>>>>> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >>>>>> status=0x40000008 >>>>>> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=31...40 >>>>>> report=35 done=35 >>>>>> Dec 30 10:37:14 mail kernel: 30: 0x0: 0xeb50c0ca(9014) >>>>>> Dec 30 10:37:14 mail kernel: 32: 0x0: 0xeb5080ca(9014) >>>>>> Dec 30 10:37:14 mail kernel: 34: 0x0: 0xeb5040ca(9014) >>>>>> Dec 30 10:37:14 mail kernel: 36: 0x0: 0xeb5000ca(9014) >>>>>> Dec 30 10:37:14 mail kernel: 38: 0x0: 0xeb4fc0ca(9014) >>>>> You must be using 9K MTU. Do you see the problem with smaller >>>>> MTU? >>>>> >>>>>> Dec 30 10:37:14 mail kernel: sky2 0000:06:00.0: error interrupt >>>>>> status=0x40000008 >>>>>> Dec 30 10:37:14 mail kernel: sky2 Tx ring pending=87...92 >>>>>> report=89 done=91 >>>>>> Dec 30 10:37:14 mail kernel: 86: 0x0: 0xeafd80ca(9014) >>>>>> Dec 30 10:37:14 mail kernel: 88: 0x0: 0xeafd40ca(9014) >>>>>> Dec 30 10:37:14 mail kernel: 90: 0x0: 0xeafd00ca(9014) >>>>>> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >>>>>> status=0x8 >>>>>> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=39...42 >>>>>> report=39 done=41 >>>>>> Dec 30 10:37:15 mail kernel: 38: 0x0: 0xe5fec0ca(9014) >>>>>> Dec 30 10:37:15 mail kernel: 40: 0x0: 0xe5fe80ca(9014) >>>>>> Dec 30 10:37:15 mail kernel: sky2 0000:06:00.0: error interrupt >>>>>> status=0x40000008 >>>>>> Dec 30 10:37:15 mail kernel: sky2 Tx ring pending=117...124 >>>>>> report=119 >>>>>> done=119 >>>>>> Dec 30 10:37:15 mail kernel: 116: 0x0: 0xe5d6c0ca(9014) >>>>>> Dec 30 10:37:15 mail kernel: 118: 0x0: 0xe5d680ca(9014) >>>>>> Dec 30 10:37:15 mail kernel: 120: 0x0: 0xe5d660ca(5938) >>>>>> Dec 30 10:37:15 mail kernel: 122: 0x0: 0xe5d640ca(4213) >>>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-kernel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >> >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-25 16:28 ` Michael Breuer 2009-12-25 23:22 ` Stephen Hemminger @ 2009-12-29 19:15 ` Jarek Poplawski 2009-12-29 19:20 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Jarek Poplawski @ 2009-12-29 19:15 UTC (permalink / raw) To: Michael Breuer Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev, Stephen Hemminger Michael Breuer wrote, On 12/25/2009 05:28 PM: > More data points - I'm able to reliably recreate this now. > While I thought it was coincidence, each and every time I hit this issue > there is a DHCP renew event immediately before the first error. Maybe it's totally unconnected and some different NetworkManager, but FYI, there was reported a strange interaction of KNetworkManager vs 2.6.32: http://bugzilla.kernel.org/show_bug.cgi?id=14794 Jarek P. > The crash occurs while under load - in my case seems that the traffic is > actually IPV6 (hadn't noticed that before). > I ran nethogs on a remote display - the reported rx rate on the IPV6 smb > connection at the time of the lockup was 33889.688 KB/sec on a 1gbit > nic. I've got two events like this - don't recall if the earlier one was > the exact same # - but it was in the ballpark. > ... >>> Dec 23 17:33:29 mail NetworkManager:<info> (eth0): carrier now ON >>> (device state 1) >>> Dec 23 17:33:30 mail named[4284]: client 114.80.99.2#44695: view >>> External: query (cache) './NS/IN' denied >>> Dec 23 17:33:38 mail named[4284]: error (connection refused) resolving >>> '121.93.27.216.blackholes.mail-abuse.org/A/IN': 168.61.5.195#53 >>> Dec 23 17:33:44 mail kernel: sky2 eth0: tx timeout >>> Dec 23 17:33:44 mail kernel: sky2 eth0: disabling interface >>> Dec 23 17:33:44 mail kernel: sky2 eth0: enabling interface >>> Dec 23 17:33:44 mail NetworkManager:<info> (eth0): carrier now OFF >>> (device state 1) >>> Dec 23 17:33:49 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >>> duplex, flow control both >>> Dec 23 17:33:49 mail NetworkManager:<info> (eth0): carrier now ON >>> (device state 1) >>> Dec 23 17:33:52 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0 >>> Dec 23 17:33:52 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via >>> eth0 >>> Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] >>> lib/util_sock.c:539(read_fd_with_timeout) >>> Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] >>> lib/util_sock.c:1491(get_peer_addr_internal) >>> Dec 23 17:34:01 mail smbd[24783]: getpeername failed. Error was >>> Transport endpoint is not connected >>> Dec 23 17:34:01 mail smbd[24783]: read_fd_with_timeout: client 0.0.0.0 >>> read error = Connection reset by peer. >>> Dec 23 17:34:04 mail abrt: Kerneloops: Reported 1 kernel oopses to Abrt >>> Dec 23 17:34:04 mail abrtd: Directory 'kerneloops-1261607644-1' creation >>> detected >>> Dec 23 17:34:04 mail abrtd: Getting local universal unique identification >>> Dec 23 17:34:04 mail abrtd: New crash, saving >>> Dec 23 17:34:04 mail abrtd: Activation of plugin 'RunApp' was not >>> successful: Plugin 'RunApp' is not registered >>> Dec 23 17:34:04 mail kernel: sky2 eth0: tx timeout >>> Dec 23 17:34:04 mail kernel: sky2 eth0: disabling interface >>> Dec 23 17:34:04 mail kernel: sky2 eth0: enabling interface >>> Dec 23 17:34:04 mail NetworkManager:<info> (eth0): carrier now OFF >>> (device state 1) >>> Dec 23 17:34:09 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >>> duplex, flow control both >>> Dec 23 17:34:09 mail NetworkManager:<info> (eth0): carrier now ON >>> (device state 1) >>> Dec 23 17:34:11 mail logger: ping of potter failed >>> Dec 23 17:34:19 mail kernel: sky2 eth0: tx timeout >>> Dec 23 17:34:19 mail kernel: sky2 eth0: disabling interface >>> Dec 23 17:34:19 mail kernel: sky2 eth0: enabling interface >>> Dec 23 17:34:19 mail NetworkManager:<info> (eth0): carrier now OFF >>> (device state 1) >>> Dec 23 17:34:22 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >>> duplex, flow control both >>> Dec 23 17:34:22 mail NetworkManager:<info> (eth0): carrier now ON >>> (device state 1) ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-29 19:15 ` sky2 panic in 2.6.32.1 under load (new oops) Jarek Poplawski @ 2009-12-29 19:20 ` Michael Breuer 0 siblings, 0 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-29 19:20 UTC (permalink / raw) To: Jarek Poplawski Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev, Stephen Hemminger On 12/29/2009 2:15 PM, Jarek Poplawski wrote: > Michael Breuer wrote, On 12/25/2009 05:28 PM: > > >> More data points - I'm able to reliably recreate this now. >> While I thought it was coincidence, each and every time I hit this issue >> there is a DHCP renew event immediately before the first error. >> > > Maybe it's totally unconnected and some different NetworkManager, but FYI, > there was reported a strange interaction of KNetworkManager vs 2.6.32: > http://bugzilla.kernel.org/show_bug.cgi?id=14794 > > Jarek P. > Interesting, but I don't think it's related. The interface in question is not under the control of NetworkManager; I'm not using KDE, and the connection isn't dropping - just getting DHCPINFORM from the attached client that is generating load. > >> The crash occurs while under load - in my case seems that the traffic is >> actually IPV6 (hadn't noticed that before). >> I ran nethogs on a remote display - the reported rx rate on the IPV6 smb >> connection at the time of the lockup was 33889.688 KB/sec on a 1gbit >> nic. I've got two events like this - don't recall if the earlier one was >> the exact same # - but it was in the ballpark. >> >> > ... > >>>> Dec 23 17:33:29 mail NetworkManager:<info> (eth0): carrier now ON >>>> (device state 1) >>>> Dec 23 17:33:30 mail named[4284]: client 114.80.99.2#44695: view >>>> External: query (cache) './NS/IN' denied >>>> Dec 23 17:33:38 mail named[4284]: error (connection refused) resolving >>>> '121.93.27.216.blackholes.mail-abuse.org/A/IN': 168.61.5.195#53 >>>> Dec 23 17:33:44 mail kernel: sky2 eth0: tx timeout >>>> Dec 23 17:33:44 mail kernel: sky2 eth0: disabling interface >>>> Dec 23 17:33:44 mail kernel: sky2 eth0: enabling interface >>>> Dec 23 17:33:44 mail NetworkManager:<info> (eth0): carrier now OFF >>>> (device state 1) >>>> Dec 23 17:33:49 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >>>> duplex, flow control both >>>> Dec 23 17:33:49 mail NetworkManager:<info> (eth0): carrier now ON >>>> (device state 1) >>>> Dec 23 17:33:52 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0 >>>> Dec 23 17:33:52 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via >>>> eth0 >>>> Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] >>>> lib/util_sock.c:539(read_fd_with_timeout) >>>> Dec 23 17:34:01 mail smbd[24783]: [2009/12/23 17:34:01, 0] >>>> lib/util_sock.c:1491(get_peer_addr_internal) >>>> Dec 23 17:34:01 mail smbd[24783]: getpeername failed. Error was >>>> Transport endpoint is not connected >>>> Dec 23 17:34:01 mail smbd[24783]: read_fd_with_timeout: client 0.0.0.0 >>>> read error = Connection reset by peer. >>>> Dec 23 17:34:04 mail abrt: Kerneloops: Reported 1 kernel oopses to Abrt >>>> Dec 23 17:34:04 mail abrtd: Directory 'kerneloops-1261607644-1' creation >>>> detected >>>> Dec 23 17:34:04 mail abrtd: Getting local universal unique identification >>>> Dec 23 17:34:04 mail abrtd: New crash, saving >>>> Dec 23 17:34:04 mail abrtd: Activation of plugin 'RunApp' was not >>>> successful: Plugin 'RunApp' is not registered >>>> Dec 23 17:34:04 mail kernel: sky2 eth0: tx timeout >>>> Dec 23 17:34:04 mail kernel: sky2 eth0: disabling interface >>>> Dec 23 17:34:04 mail kernel: sky2 eth0: enabling interface >>>> Dec 23 17:34:04 mail NetworkManager:<info> (eth0): carrier now OFF >>>> (device state 1) >>>> Dec 23 17:34:09 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >>>> duplex, flow control both >>>> Dec 23 17:34:09 mail NetworkManager:<info> (eth0): carrier now ON >>>> (device state 1) >>>> Dec 23 17:34:11 mail logger: ping of potter failed >>>> Dec 23 17:34:19 mail kernel: sky2 eth0: tx timeout >>>> Dec 23 17:34:19 mail kernel: sky2 eth0: disabling interface >>>> Dec 23 17:34:19 mail kernel: sky2 eth0: enabling interface >>>> Dec 23 17:34:19 mail NetworkManager:<info> (eth0): carrier now OFF >>>> (device state 1) >>>> Dec 23 17:34:22 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full >>>> duplex, flow control both >>>> Dec 23 17:34:22 mail NetworkManager:<info> (eth0): carrier now ON >>>> (device state 1) >>>> ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-24 7:01 ` Andrew Morton 2009-12-24 19:18 ` Michael Breuer 2009-12-25 16:28 ` Michael Breuer @ 2009-12-30 8:07 ` Stephen Hemminger 2009-12-30 15:36 ` Michael Breuer 2 siblings, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2009-12-30 8:07 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev On Wed, 23 Dec 2009 23:01:02 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > Dec 23 17:32:58 mail kernel: DRHD: handling fault status reg 2 > > Dec 23 17:32:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] > > fault addr fff8423fe000 > > Dec 23 17:32:58 mail kernel: DMAR:[fault reason 06] PTE Read access is > > not set > > Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: error interrupt > > status=0x80000000 > > Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) > > Dec 23 17:33:06 mail kernel: ------------[ cut here ]------------ > > Dec 23 17:33:06 mail kernel: WARNING: at net/sched/sch_generic.c:255 > > dev_watchdog+0xf3/0x161() > > Dec 23 17:33:06 mail kernel: Hardware name: System Product Name > > Dec 23 17:33:06 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit Just to reiterate, these are consistent with a DMA error on transmit. Since is 4 seconds after the DHCPACK, they probably are not related, except that when DHCP renew may have caused more traffic. Are you using something like network console or kgdb, or anything that could be using some unusual network I/O? ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load (new oops) 2009-12-30 8:07 ` Stephen Hemminger @ 2009-12-30 15:36 ` Michael Breuer 0 siblings, 0 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-30 15:36 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Andrew Morton, Berck E. Nash, linux-kernel, netdev What has been running on eth0 (usual for me ... perhaps relevant to the issue) when I've seen the errors. As noted previously, there is a correlation with dhcp, and maybe one with dns over ipv6. Console is local (nouveau kms). As per Stephen's other email, I'm rerunning with kernel debug enabled. I'll also try nomodeset and could if need by (but a real PITA) can swap video cards with an ATI card. ipv6 appletalk dhcpd smb dns (bind9) X11 (accessing box via cygwin) sendmail imap mediatomb On 12/30/2009 3:07 AM, Stephen Hemminger wrote: > On Wed, 23 Dec 2009 23:01:02 -0800 > Andrew Morton<akpm@linux-foundation.org> wrote: > > >>> Dec 23 17:32:58 mail kernel: DRHD: handling fault status reg 2 >>> Dec 23 17:32:58 mail kernel: DMAR:[DMA Read] Request device [06:00.0] >>> fault addr fff8423fe000 >>> Dec 23 17:32:58 mail kernel: DMAR:[fault reason 06] PTE Read access is >>> not set >>> Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: error interrupt >>> status=0x80000000 >>> Dec 23 17:32:58 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) >>> Dec 23 17:33:06 mail kernel: ------------[ cut here ]------------ >>> Dec 23 17:33:06 mail kernel: WARNING: at net/sched/sch_generic.c:255 >>> dev_watchdog+0xf3/0x161() >>> Dec 23 17:33:06 mail kernel: Hardware name: System Product Name >>> Dec 23 17:33:06 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit >>> > Just to reiterate, these are consistent with a DMA error on transmit. > Since is 4 seconds after the DHCPACK, they probably are not related, except > that when DHCP renew may have caused more traffic. > > Are you using something like network console or kgdb, or anything that could > be using some unusual network I/O? > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-21 23:52 sky2 panic in 2.6.32.1 under load Berck E. Nash 2009-12-22 0:09 ` Michael Breuer @ 2009-12-22 0:52 ` Daniel Hazelton 2009-12-24 6:58 ` Andrew Morton 2 siblings, 0 replies; 145+ messages in thread From: Daniel Hazelton @ 2009-12-22 0:52 UTC (permalink / raw) To: Berck E. Nash; +Cc: linux-kernel On Monday 21 December 2009 06:52:10 pm Berck E. Nash wrote: > Since 2.6.32, I've been getting kernel panics under heavy network load > (bittorrent usage). > > [ 8673.345873] sky2 eth0: receiver hang detected > [ 8673.350368] sky2 eth0: disabling interface > [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000010 > [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 > [ 8673.373838] Oops: 0002 [#1] SMP > [ 8673.373838] last sysfs file: > /sys/devices/platform/coretemp.1/temp1_label [ 8673.383900] CPU 0 > [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl > auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer > hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 > snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec > snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss > snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common > snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd > v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 > usbcore tveeprom > [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 > #3 P5W DH Deluxe > [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] > sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 > [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: > 0000000000000000 > [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: > 0000000000000000 > [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: > 0000000000000000 > [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: > ffff88009bf3ab48 > [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: > 00000000955fa87e > [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) > knlGS:0000000000000000 > [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: > 00000000000026f0 > [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > 0000000000000400 > [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo ffff8800bf8b4000, > task ffff8800bf8a8650) > [ 8673.539212] Stack: > [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 > 000000428122a659 > [ 8673.548658] <0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 > ffffffff813eaf30 > [ 8673.556929] <0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 > ffffffff81231de8 > [ 8673.565432] Call Trace: > [ 8673.565432] <IRQ> > [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 > [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 > [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 > [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 > [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e > [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b > [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 > [ 8673.609833] <EOI> > [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b > [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 > [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 > [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 > [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 > [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 > [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 > [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 > c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 > 4b 20 <4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b > [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.674902] RSP <ffff880001603df8> > [ 8673.674902] CR2: 0000000000000010 > [ 8673.682956] hda-intel: IRQ timing workaround is activated for card > #0. Suggest a bigger bdl_pos_adj. > [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- > [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt > [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D 2.6.32.1 #3 > [ 8673.683021] Call Trace: > [ 8673.683023] <IRQ> [<ffffffff8129839d>] panic+0x75/0x11c > [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e > [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd > [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 > [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 > [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 > [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 > [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 > [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 [sky2] > [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 > [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 > [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 > [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 > [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e > [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b > [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 > [ 8673.683079] <EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b > [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 > [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 > [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 > [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 > [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 > [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 > > Note that sometimes I get the receiver hang and things continue without > the crash: > > [ 1219.346571] sky2 eth0: receiver hang detected > [ 1219.351145] sky2 eth0: disabling interface > [ 1219.357389] sky2 eth0: enabling interface > [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow > control both > > Please CC as I'm not subscribed. > > Thanks, > Berck Nash I can confirm a version of this happens - with a full Panic - as early as 2.6.25 (I have had the same machine collapse under a hack-attempt numerous times because of this) DRH > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-21 23:52 sky2 panic in 2.6.32.1 under load Berck E. Nash 2009-12-22 0:09 ` Michael Breuer 2009-12-22 0:52 ` sky2 panic in 2.6.32.1 under load Daniel Hazelton @ 2009-12-24 6:58 ` Andrew Morton 2009-12-24 16:03 ` Berck Nash 2009-12-24 16:10 ` Michael Breuer 2 siblings, 2 replies; 145+ messages in thread From: Andrew Morton @ 2009-12-24 6:58 UTC (permalink / raw) To: Berck E. Nash; +Cc: linux-kernel, netdev, Stephen Hemminger, Michael Breuer On Mon, 21 Dec 2009 16:52:10 -0700 "Berck E. Nash" <flyboy@gmail.com> wrote: > Since 2.6.32, I've been getting kernel panics under heavy network load > (bittorrent usage). Let's cc the right list and developer. This is a 2.6.31->2.6.32 regression? > [ 8673.345873] sky2 eth0: receiver hang detected > [ 8673.350368] sky2 eth0: disabling interface > [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000010 Please fix the wordwrapping for next time. > [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 > [ 8673.373838] Oops: 0002 [#1] SMP > [ 8673.373838] last sysfs file: /sys/devices/platform/coretemp.1/temp1_label > [ 8673.383900] CPU 0 > [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl > auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer > hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 > snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec > snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss > snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common > snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd > v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 > usbcore tveeprom > [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 > #3 P5W DH Deluxe > [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] > sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 > [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: > 0000000000000000 > [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: > 0000000000000000 > [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: > 0000000000000000 > [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: > ffff88009bf3ab48 > [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: > 00000000955fa87e > [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) > knlGS:0000000000000000 > [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: > 00000000000026f0 > [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > 0000000000000400 > [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo ffff8800bf8b4000, > task ffff8800bf8a8650) > [ 8673.539212] Stack: > [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 > 000000428122a659 > [ 8673.548658] <0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 > ffffffff813eaf30 > [ 8673.556929] <0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 > ffffffff81231de8 > [ 8673.565432] Call Trace: > [ 8673.565432] <IRQ> > [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 > [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 > [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 > [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 > [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e > [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b > [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 > [ 8673.609833] <EOI> > [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b > [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 > [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 > [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 > [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 > [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 > [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 > [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 > c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 > 4b 20 <4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b > [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.674902] RSP <ffff880001603df8> > [ 8673.674902] CR2: 0000000000000010 > [ 8673.682956] hda-intel: IRQ timing workaround is activated for card > #0. Suggest a bigger bdl_pos_adj. > [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- > [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt > [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D 2.6.32.1 #3 > [ 8673.683021] Call Trace: > [ 8673.683023] <IRQ> [<ffffffff8129839d>] panic+0x75/0x11c > [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e > [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd > [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 > [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 > [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 > [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 > [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 > [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 [sky2] > [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 [sky2] > [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 > [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 > [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 > [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 > [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e > [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b > [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 > [ 8673.683079] <EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b > [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 > [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 > [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 > [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 > [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 > [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 > > Note that sometimes I get the receiver hang and things continue without > the crash: > > [ 1219.346571] sky2 eth0: receiver hang detected > [ 1219.351145] sky2 eth0: disabling interface > [ 1219.357389] sky2 eth0: enabling interface > [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow > control both > > Please CC as I'm not subscribed. > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-24 6:58 ` Andrew Morton @ 2009-12-24 16:03 ` Berck Nash 2009-12-24 16:28 ` Daniel Hazelton 2009-12-24 16:10 ` Michael Breuer 1 sibling, 1 reply; 145+ messages in thread From: Berck Nash @ 2009-12-24 16:03 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, netdev, Stephen Hemminger Andrew Morton wrote: > On Mon, 21 Dec 2009 16:52:10 -0700 "Berck E. Nash" <flyboy@gmail.com> wrote: > >> Since 2.6.32, I've been getting kernel panics under heavy network load >> (bittorrent usage). > > Let's cc the right list and developer. > > This is a 2.6.31->2.6.32 regression? I believe so. Since it's intermittent and difficult to reproduce, it's possible (but unlikely) that I simply never triggered it under 2.6.31. >> [ 8673.345873] sky2 eth0: receiver hang detected >> [ 8673.350368] sky2 eth0: disabling interface >> [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at >> 0000000000000010 > > Please fix the wordwrapping for next time. Sorry! >> [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] >> [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 >> [ 8673.373838] Oops: 0002 [#1] SMP >> [ 8673.373838] last sysfs file: /sys/devices/platform/coretemp.1/temp1_label >> [ 8673.383900] CPU 0 >> [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl >> auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer >> hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 >> snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec >> snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss >> snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common >> snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd >> v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 >> usbcore tveeprom >> [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 >> #3 P5W DH Deluxe >> [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] >> sky2_xmit_frame+0x321/0x5d8 [sky2] >> [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 >> [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: >> 0000000000000000 >> [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: >> 0000000000000000 >> [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: >> 0000000000000000 >> [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: >> ffff88009bf3ab48 >> [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: >> 00000000955fa87e >> [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) >> knlGS:0000000000000000 >> [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >> [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: >> 00000000000026f0 >> [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 >> [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: >> 0000000000000400 >> [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo ffff8800bf8b4000, >> task ffff8800bf8a8650) >> [ 8673.539212] Stack: >> [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 >> 000000428122a659 >> [ 8673.548658] <0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 >> ffffffff813eaf30 >> [ 8673.556929] <0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 >> ffffffff81231de8 >> [ 8673.565432] Call Trace: >> [ 8673.565432] <IRQ> >> [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >> [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >> [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >> [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >> [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >> [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >> [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >> [ 8673.609833] <EOI> >> [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b >> [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >> [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >> [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 >> [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >> [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >> [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >> [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 >> c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 >> 4b 20 <4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b >> [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] >> [ 8673.674902] RSP <ffff880001603df8> >> [ 8673.674902] CR2: 0000000000000010 >> [ 8673.682956] hda-intel: IRQ timing workaround is activated for card >> #0. Suggest a bigger bdl_pos_adj. >> [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- >> [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt >> [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D 2.6.32.1 #3 >> [ 8673.683021] Call Trace: >> [ 8673.683023] <IRQ> [<ffffffff8129839d>] panic+0x75/0x11c >> [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e >> [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd >> [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 >> [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 >> [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 >> [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 >> [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 >> [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 [sky2] >> [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 [sky2] >> [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >> [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >> [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >> [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >> [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >> [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >> [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >> [ 8673.683079] <EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b >> [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >> [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >> [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 >> [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >> [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >> [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >> >> Note that sometimes I get the receiver hang and things continue without >> the crash: >> >> [ 1219.346571] sky2 eth0: receiver hang detected >> [ 1219.351145] sky2 eth0: disabling interface >> [ 1219.357389] sky2 eth0: enabling interface >> [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow >> control both >> >> Please CC as I'm not subscribed. >> > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-24 16:03 ` Berck Nash @ 2009-12-24 16:28 ` Daniel Hazelton 2009-12-24 22:21 ` Stephen Hemminger 0 siblings, 1 reply; 145+ messages in thread From: Daniel Hazelton @ 2009-12-24 16:28 UTC (permalink / raw) To: Berck Nash; +Cc: Andrew Morton, linux-kernel, netdev, Stephen Hemminger On Thursday 24 December 2009 11:03:56 am Berck Nash wrote: > Andrew Morton wrote: > > On Mon, 21 Dec 2009 16:52:10 -0700 "Berck E. Nash" <flyboy@gmail.com> wrote: > >> Since 2.6.32, I've been getting kernel panics under heavy network load > >> (bittorrent usage). > > > > Let's cc the right list and developer. > > > > This is a 2.6.31->2.6.32 regression? > > I believe so. Since it's intermittent and difficult to reproduce, it's > possible (but unlikely) that I simply never triggered it under 2.6.31. This is far from new. I have seen this under 2.6.27 when at least one botnet has been pointed at a server of mine and told to gain access. It has happened four times in the last six to eight months - and I have no easy way to capture the logs. But the oops that was posted looks very, very similar to what I've seen. It's always an allocation error in the transmit path that leads to the panic. Because this is a production machine that I do not have a way to take down and do testing with I've not reported the problem before. DRH ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-24 16:28 ` Daniel Hazelton @ 2009-12-24 22:21 ` Stephen Hemminger 2009-12-24 22:42 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Stephen Hemminger @ 2009-12-24 22:21 UTC (permalink / raw) To: Daniel Hazelton; +Cc: Berck Nash, Andrew Morton, linux-kernel, netdev On Thu, 24 Dec 2009 11:28:57 -0500 Daniel Hazelton <dhazelton@enter.net> wrote: > On Thursday 24 December 2009 11:03:56 am Berck Nash wrote: > > Andrew Morton wrote: > > > On Mon, 21 Dec 2009 16:52:10 -0700 "Berck E. Nash" <flyboy@gmail.com> > wrote: > > >> Since 2.6.32, I've been getting kernel panics under heavy network load > > >> (bittorrent usage). > > > > > > Let's cc the right list and developer. > > > > > > This is a 2.6.31->2.6.32 regression? > > > > I believe so. Since it's intermittent and difficult to reproduce, it's > > possible (but unlikely) that I simply never triggered it under 2.6.31. > > This is far from new. I have seen this under 2.6.27 when at least one botnet > has been pointed at a server of mine and told to gain access. It has happened > four times in the last six to eight months - and I have no easy way to capture > the logs. But the oops that was posted looks very, very similar to what I've > seen. > > It's always an allocation error in the transmit path that leads to the panic. > Because this is a production machine that I do not have a way to take down and > do testing with I've not reported the problem before. > Even though I wrote/maintain the sky driver, I don't work for SysKonnect, and only have access to a limited set of information: the technical manuals (under NDA), and the vendor sk98lin driver. The sky2 driver imitates the receiver timeout of the sk98lin driver; other people have told me that the FIFO hardware implementation is buggy and when it gets full, it gets stuck. Probably the equivalent of a software FIFO where the developer forgets to reserve a slot so that head == tail can mean both empty and full! The workaround with a timer is prone to errors when traffic keeps going, also the vendor doesn't really provide clear instructions on how to unlock it. I do not have access to the hardware errata describing the problem. If I did a more minimal solution would be possible. The easiest advice is avoid sky2 chips with FIFO for any heavy traffic, the next advice is make sure receive flow control is enabled so that receiver doesn't get overrun. If tx timeouts are an issue use a rate limiter like TBF. Do not use the chip with 10 or 100 mbit since the transmitter is more prone to get overrun. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-24 22:21 ` Stephen Hemminger @ 2009-12-24 22:42 ` Michael Breuer 2009-12-25 0:06 ` Daniel Hazelton 0 siblings, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-24 22:42 UTC (permalink / raw) To: Stephen Hemminger Cc: Daniel Hazelton, Berck Nash, Andrew Morton, linux-kernel, netdev On 12/24/2009 5:21 PM, Stephen Hemminger wrote: > On Thu, 24 Dec 2009 11:28:57 -0500 > Daniel Hazelton<dhazelton@enter.net> wrote: > > >> On Thursday 24 December 2009 11:03:56 am Berck Nash wrote: >> >>> Andrew Morton wrote: >>> >>>> On Mon, 21 Dec 2009 16:52:10 -0700 "Berck E. Nash"<flyboy@gmail.com> >>>> >> wrote: >> >>>>> Since 2.6.32, I've been getting kernel panics under heavy network load >>>>> (bittorrent usage). >>>>> >>>> Let's cc the right list and developer. >>>> >>>> This is a 2.6.31->2.6.32 regression? >>>> >>> I believe so. Since it's intermittent and difficult to reproduce, it's >>> possible (but unlikely) that I simply never triggered it under 2.6.31. >>> >> This is far from new. I have seen this under 2.6.27 when at least one botnet >> has been pointed at a server of mine and told to gain access. It has happened >> four times in the last six to eight months - and I have no easy way to capture >> the logs. But the oops that was posted looks very, very similar to what I've >> seen. >> >> It's always an allocation error in the transmit path that leads to the panic. >> Because this is a production machine that I do not have a way to take down and >> do testing with I've not reported the problem before. >> >> > Even though I wrote/maintain the sky driver, I don't work for SysKonnect, and > only have access to a limited set of information: > the technical manuals (under NDA), and the vendor sk98lin driver. The sky2 driver > imitates the receiver timeout of the sk98lin driver; other people have told me > that the FIFO hardware implementation is buggy and when it gets full, it gets > stuck. Probably the equivalent of a software FIFO where the developer forgets to > reserve a slot so that head == tail can mean both empty and full! > > The workaround with a timer is prone to errors when traffic keeps going, also > the vendor doesn't really provide clear instructions on how to unlock it. I do > not have access to the hardware errata describing the problem. If I did a more > minimal solution would be possible. > > The easiest advice is avoid sky2 chips with FIFO for any heavy traffic, the next > advice is make sure receive flow control is enabled so that receiver doesn't > get overrun. If tx timeouts are an issue use a rate limiter like TBF. Do not use > the chip with 10 or 100 mbit since the transmitter is more prone to get overrun. > For this particular issue, I'm only seeing problems when running at 1000 mbit. 100 appears stable. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-24 22:42 ` Michael Breuer @ 2009-12-25 0:06 ` Daniel Hazelton 0 siblings, 0 replies; 145+ messages in thread From: Daniel Hazelton @ 2009-12-25 0:06 UTC (permalink / raw) To: Michael Breuer Cc: Stephen Hemminger, Berck Nash, Andrew Morton, linux-kernel, netdev On Thursday 24 December 2009 05:42:08 pm Michael Breuer wrote: > On 12/24/2009 5:21 PM, Stephen Hemminger wrote: > > On Thu, 24 Dec 2009 11:28:57 -0500 > > > > Daniel Hazelton<dhazelton@enter.net> wrote: > >> On Thursday 24 December 2009 11:03:56 am Berck Nash wrote: > >>> Andrew Morton wrote: > >>>> On Mon, 21 Dec 2009 16:52:10 -0700 "Berck E. Nash"<flyboy@gmail.com> > >> > >> wrote: > >>>>> Since 2.6.32, I've been getting kernel panics under heavy network > >>>>> load (bittorrent usage). > >>>> > >>>> Let's cc the right list and developer. > >>>> > >>>> This is a 2.6.31->2.6.32 regression? > >>> > >>> I believe so. Since it's intermittent and difficult to reproduce, it's > >>> possible (but unlikely) that I simply never triggered it under 2.6.31. > >> > >> This is far from new. I have seen this under 2.6.27 when at least one > >> botnet has been pointed at a server of mine and told to gain access. It > >> has happened four times in the last six to eight months - and I have no > >> easy way to capture the logs. But the oops that was posted looks very, > >> very similar to what I've seen. > >> > >> It's always an allocation error in the transmit path that leads to the > >> panic. Because this is a production machine that I do not have a way to > >> take down and do testing with I've not reported the problem before. > > > > Even though I wrote/maintain the sky driver, I don't work for SysKonnect, > > and only have access to a limited set of information: > > the technical manuals (under NDA), and the vendor sk98lin driver. The > > sky2 driver imitates the receiver timeout of the sk98lin driver; other > > people have told me that the FIFO hardware implementation is buggy and > > when it gets full, it gets stuck. Probably the equivalent of a software > > FIFO where the developer forgets to reserve a slot so that head == tail > > can mean both empty and full! > > > > The workaround with a timer is prone to errors when traffic keeps going, > > also the vendor doesn't really provide clear instructions on how to > > unlock it. I do not have access to the hardware errata describing the > > problem. If I did a more minimal solution would be possible. > > > > The easiest advice is avoid sky2 chips with FIFO for any heavy traffic, > > the next advice is make sure receive flow control is enabled so that > > receiver doesn't get overrun. If tx timeouts are an issue use a rate > > limiter like TBF. Do not use the chip with 10 or 100 mbit since the > > transmitter is more prone to get overrun. > > For this particular issue, I'm only seeing problems when running at 1000 > mbit. 100 appears stable. > Not here - it is crashing under 100. I do have a different NIC available for that system and will likely switch to it when I have a chance to work on upgrading the install there. The reason I am using the Sky NIC on that system is because there are, apparently, two different NIC's on the board itself - an nForce one and a Sky2 one... DRH ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-24 6:58 ` Andrew Morton 2009-12-24 16:03 ` Berck Nash @ 2009-12-24 16:10 ` Michael Breuer 2009-12-24 16:16 ` Berck Nash 1 sibling, 1 reply; 145+ messages in thread From: Michael Breuer @ 2009-12-24 16:10 UTC (permalink / raw) To: Andrew Morton; +Cc: Berck E. Nash, linux-kernel, netdev, Stephen Hemminger On 12/24/2009 1:58 AM, Andrew Morton wrote: > On Mon, 21 Dec 2009 16:52:10 -0700 "Berck E. Nash"<flyboy@gmail.com> wrote: > > >> Since 2.6.32, I've been getting kernel panics under heavy network load >> (bittorrent usage). >> > Let's cc the right list and developer. > > This is a 2.6.31->2.6.32 regression? > I don't think so. I saw this on 2.6.31 - first kernel I put on this system. At the time, I was having issues with intel_iommu and thought that the culprit. The initial reply to this thread from Daniel Hazelton stated that this has been happening since 2.6.25. And apologies for the word wrap. I upgraded Thunderbird :(. Oddly, I don't see the word wrap when reading my own message to the list. Must be some new setting I missed. > >> [ 8673.345873] sky2 eth0: receiver hang detected >> [ 8673.350368] sky2 eth0: disabling interface >> [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at >> 0000000000000010 >> > Please fix the wordwrapping for next time. > > >> [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] >> [ 8673.364675] PGD a09a6067 PUD a09a5067 PMD 0 >> [ 8673.373838] Oops: 0002 [#1] SMP >> [ 8673.373838] last sysfs file: /sys/devices/platform/coretemp.1/temp1_label >> [ 8673.383900] CPU 0 >> [ 8673.383900] Modules linked in: nfsd exportfs nfs lockd nfs_acl >> auth_rpcgss sunrpc nls_cp437 msdos fat kvm_intel kvm fuse snd_rtctimer >> hwmon_vid nvidia(P) usbhid tuner_simple tuner_types wm8775 tda9887 >> snd_hda_codec_realtek tda8290 snd_hda_intel tuner snd_hda_codec >> snd_pcm_oss snd_mixer_oss snd_pcm cx25840 snd_seq_dummy snd_seq_oss >> snd_seq_midi_event ivtv snd_seq i2c_algo_bit cx2341x v4l2_common >> snd_timer videodev uhci_hcd snd_seq_device v4l1_compat ehci_hcd snd >> v4l2_compat_ioctl32 i2c_i801 soundcore snd_page_alloc floppy sky2 >> usbcore tveeprom >> [ 8673.435268] Pid: 4, comm: ksoftirqd/0 Tainted: P 2.6.32.1 >> #3 P5W DH Deluxe >> [ 8673.438943] RIP: 0010:[<ffffffffa00373d3>] [<ffffffffa00373d3>] >> sky2_xmit_frame+0x321/0x5d8 [sky2] >> [ 8673.448539] RSP: 0018:ffff880001603df8 EFLAGS: 00010246 >> [ 8673.458678] RAX: 0000000000000000 RBX: ffff8800bd201d40 RCX: >> 0000000000000000 >> [ 8673.465016] RDX: 0000000000000000 RSI: 000000000000008c RDI: >> 0000000000000000 >> [ 8673.472396] RBP: ffff880001603e48 R08: 0000000000000001 R09: >> 0000000000000000 >> [ 8673.474083] R10: 000000000000001e R11: 0000000000000ef8 R12: >> ffff88009bf3ab48 >> [ 8673.484317] R13: ffff8800bef4d000 R14: 0000000000000000 R15: >> 00000000955fa87e >> [ 8673.492381] FS: 0000000000000000(0000) GS:ffff880001600000(0000) >> knlGS:0000000000000000 >> [ 8673.501448] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >> [ 8673.509707] CR2: 0000000000000010 CR3: 00000000a09bf000 CR4: >> 00000000000026f0 >> [ 8673.514146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 >> [ 8673.520661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: >> 0000000000000400 >> [ 8673.528842] Process ksoftirqd/0 (pid: 4, threadinfo ffff8800bf8b4000, >> task ffff8800bf8a8650) >> [ 8673.539212] Stack: >> [ 8673.539212] ffff8800992f5980 ffff8800955fa87e ffff8800bd201800 >> 000000428122a659 >> [ 8673.548658]<0> ffff880001603e48 ffff88009bf3ab48 ffff8800992f5980 >> ffffffff813eaf30 >> [ 8673.556929]<0> ffff8800bd201800 ffff8800bef03940 ffff880001603e98 >> ffffffff81231de8 >> [ 8673.565432] Call Trace: >> [ 8673.565432]<IRQ> >> [ 8673.565432] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >> [ 8673.574516] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >> [ 8673.579515] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >> [ 8673.584559] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >> [ 8673.592673] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >> [ 8673.598719] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >> [ 8673.601787] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >> [ 8673.609833]<EOI> >> [ 8673.610994] [<ffffffff8100d907>] do_softirq+0x33/0x6b >> [ 8673.614419] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >> [ 8673.620082] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >> [ 8673.620082] [<ffffffff8104aee7>] kthread+0x7a/0x82 >> [ 8673.629310] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >> [ 8673.634137] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >> [ 8673.639492] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >> [ 8673.647639] Code: 06 00 00 00 00 89 08 66 c7 40 04 00 00 c6 40 06 01 >> c6 40 07 9f 41 0f b7 c6 48 89 c7 48 c1 e0 03 48 c1 e7 05 48 89 f9 48 03 >> 4b 20<4c> 89 79 10 48 c7 41 08 01 00 00 00 8b 75 cc 89 71 18 48 03 7b >> [ 8673.667726] RIP [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8 [sky2] >> [ 8673.674902] RSP<ffff880001603df8> >> [ 8673.674902] CR2: 0000000000000010 >> [ 8673.682956] hda-intel: IRQ timing workaround is activated for card >> #0. Suggest a bigger bdl_pos_adj. >> [ 8673.683015] ---[ end trace 2c387977cf1c7c9a ]--- >> [ 8673.683017] Kernel panic - not syncing: Fatal exception in interrupt >> [ 8673.683020] Pid: 4, comm: ksoftirqd/0 Tainted: P D 2.6.32.1 #3 >> [ 8673.683021] Call Trace: >> [ 8673.683023]<IRQ> [<ffffffff8129839d>] panic+0x75/0x11c >> [ 8673.683030] [<ffffffff8100e9a7>] oops_end+0x81/0x8e >> [ 8673.683033] [<ffffffff81026e5f>] no_context+0x1ee/0x1fd >> [ 8673.683036] [<ffffffff81032a6d>] ? find_busiest_group+0x387/0x8d8 >> [ 8673.683039] [<ffffffff81026fe0>] __bad_area_nosemaphore+0x172/0x195 >> [ 8673.683041] [<ffffffff81027011>] bad_area_nosemaphore+0xe/0x10 >> [ 8673.683044] [<ffffffff8102724b>] do_page_fault+0x114/0x252 >> [ 8673.683047] [<ffffffff8129a97f>] page_fault+0x1f/0x30 >> [ 8673.683054] [<ffffffffa00373d3>] ? sky2_xmit_frame+0x321/0x5d8 [sky2] >> [ 8673.683058] [<ffffffffa00371b8>] ? sky2_xmit_frame+0x106/0x5d8 [sky2] >> [ 8673.683061] [<ffffffff81231de8>] dev_hard_start_xmit+0x21c/0x2b7 >> [ 8673.683065] [<ffffffff81041746>] ? run_timer_softirq+0x4f/0x1c9 >> [ 8673.683067] [<ffffffff81240252>] sch_direct_xmit+0x5e/0x154 >> [ 8673.683070] [<ffffffff81240404>] __qdisc_run+0xbc/0xd5 >> [ 8673.683072] [<ffffffff8122fb19>] net_tx_action+0xbb/0x10e >> [ 8673.683075] [<ffffffff8103dc16>] __do_softirq+0x91/0x11b >> [ 8673.683078] [<ffffffff8100be9c>] call_softirq+0x1c/0x28 >> [ 8673.683079]<EOI> [<ffffffff8100d907>] do_softirq+0x33/0x6b >> [ 8673.683083] [<ffffffff8103d71e>] ksoftirqd+0x60/0xd7 >> [ 8673.683085] [<ffffffff8103d6be>] ? ksoftirqd+0x0/0xd7 >> [ 8673.683088] [<ffffffff8104aee7>] kthread+0x7a/0x82 >> [ 8673.683090] [<ffffffff8100bd9a>] child_rip+0xa/0x20 >> [ 8673.683093] [<ffffffff8104ae6d>] ? kthread+0x0/0x82 >> [ 8673.683095] [<ffffffff8100bd90>] ? child_rip+0x0/0x20 >> >> Note that sometimes I get the receiver hang and things continue without >> the crash: >> >> [ 1219.346571] sky2 eth0: receiver hang detected >> [ 1219.351145] sky2 eth0: disabling interface >> [ 1219.357389] sky2 eth0: enabling interface >> [ 1222.572980] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow >> control both >> >> Please CC as I'm not subscribed. >> >> ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-24 16:10 ` Michael Breuer @ 2009-12-24 16:16 ` Berck Nash 2009-12-24 16:26 ` Michael Breuer 0 siblings, 1 reply; 145+ messages in thread From: Berck Nash @ 2009-12-24 16:16 UTC (permalink / raw) To: Michael Breuer; +Cc: Andrew Morton, linux-kernel, netdev, Stephen Hemminger Michael Breuer wrote: >>> Since 2.6.32, I've been getting kernel panics under heavy network load >>> (bittorrent usage). >>> >> Let's cc the right list and developer. >> >> This is a 2.6.31->2.6.32 regression? >> > I don't think so. I saw this on 2.6.31 - first kernel I put on this > system. At the time, I was having issues with intel_iommu and thought > that the culprit. > The initial reply to this thread from Daniel Hazelton stated that this > has been happening since 2.6.25. Your oops appears to be different than mine. Hazelton did not provide any logs, and he said "a version of this" so I don't know if his is the same or not. ^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: sky2 panic in 2.6.32.1 under load 2009-12-24 16:16 ` Berck Nash @ 2009-12-24 16:26 ` Michael Breuer 0 siblings, 0 replies; 145+ messages in thread From: Michael Breuer @ 2009-12-24 16:26 UTC (permalink / raw) To: Berck Nash; +Cc: Andrew Morton, linux-kernel, netdev, Stephen Hemminger On 12/24/2009 11:16 AM, Berck Nash wrote: > Michael Breuer wrote: > >>>> Since 2.6.32, I've been getting kernel panics under heavy network load >>>> (bittorrent usage). >>>> >>>> >>> Let's cc the right list and developer. >>> >>> This is a 2.6.31->2.6.32 regression? >>> >>> >> I don't think so. I saw this on 2.6.31 - first kernel I put on this >> system. At the time, I was having issues with intel_iommu and thought >> that the culprit. >> The initial reply to this thread from Daniel Hazelton stated that this >> has been happening since 2.6.25. >> > Your oops appears to be different than mine. Hazelton did not provide > any logs, and he said "a version of this" so I don't know if his is the > same or not. > Actually, I have two different oops, and a couple of crashes without any oops at all. All were produced under the same conditions, all pointing at sky2. I'd guess at memory corruption under load with subsequent divergent manifestations. The real question is whether the bug is in the kernel or the NIC firmware. There were similar reports online of this behavior affecting Vista, however those were resolved by moving to later firmware. I've recreated the crash on the later firmware however. Summary of conditions that appear to be required for the lockup/oops/panic(s): 1. Running at 1Gb. 2. High traffic into the system (out seems ok). 3. For me, the crash only occurs when smbd is the application generating the traffic. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 145+ messages in thread
end of thread, other threads:[~2010-01-21 19:49 UTC | newest] Thread overview: 145+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-12-21 23:52 sky2 panic in 2.6.32.1 under load Berck E. Nash 2009-12-22 0:09 ` Michael Breuer 2009-12-22 18:50 ` Michael Breuer 2009-12-23 22:54 ` sky2 panic in 2.6.32.1 under load (new oops) Michael Breuer 2009-12-24 7:01 ` Andrew Morton 2009-12-24 19:18 ` Michael Breuer 2009-12-24 22:27 ` Stephen Hemminger 2009-12-25 16:28 ` Michael Breuer 2009-12-25 23:22 ` Stephen Hemminger 2009-12-26 3:23 ` Michael Breuer 2009-12-26 17:57 ` Stephen Hemminger 2009-12-26 20:37 ` Michael Breuer 2009-12-26 22:05 ` [PATCH] sky2: make sure ethernet header is in transmit skb Stephen Hemminger 2009-12-27 3:44 ` David Miller 2009-12-27 4:11 ` David Miller 2010-01-04 5:32 ` David Miller 2010-01-04 16:40 ` Stephen Hemminger 2010-01-04 17:02 ` Michael Breuer 2010-01-05 23:07 ` [PATCH] af_packet: Don't use skb after dev_queue_xmit() Jarek Poplawski 2010-01-05 23:16 ` Michael Breuer 2010-01-05 23:29 ` Jarek Poplawski 2010-01-06 2:36 ` Michael Breuer 2010-01-06 7:22 ` Jarek Poplawski 2010-01-06 9:15 ` [PATCH alt.2] " Jarek Poplawski 2010-01-06 14:49 ` Stephen Hemminger 2010-01-06 19:40 ` Jarek Poplawski 2010-01-06 19:49 ` [PATCH] " Michael Breuer 2010-01-06 20:22 ` Jarek Poplawski 2010-01-06 20:33 ` Michael Breuer 2010-01-06 21:09 ` Jarek Poplawski 2010-01-06 21:32 ` Michael Breuer 2010-01-06 21:10 ` Stephen Hemminger 2010-01-06 21:20 ` Michael Breuer 2010-01-06 23:26 ` Michael Breuer 2010-01-07 2:42 ` Michael Breuer 2010-01-07 4:00 ` Michael Breuer 2010-01-07 4:53 ` Stephen Hemminger 2010-01-07 5:10 ` Michael Breuer 2010-01-07 5:32 ` Michael Breuer 2010-01-07 5:54 ` Michael Breuer 2010-01-07 7:20 ` Michael Breuer 2010-01-07 7:47 ` Jarek Poplawski 2010-01-07 7:55 ` Michael Breuer 2010-01-07 8:21 ` Jarek Poplawski 2010-01-07 15:03 ` Michael Breuer 2010-01-07 17:56 ` Jarek Poplawski 2010-01-07 18:17 ` Jarek Poplawski 2010-01-07 15:05 ` Michael Breuer 2010-01-07 18:01 ` Jarek Poplawski 2010-01-07 18:19 ` Michael Breuer 2010-01-07 18:35 ` Jarek Poplawski 2010-01-07 18:40 ` Michael Breuer 2010-01-07 18:43 ` Michael Breuer 2010-01-07 18:50 ` Jarek Poplawski 2010-01-07 19:36 ` Jarek Poplawski 2010-01-07 19:55 ` Michael Breuer 2010-01-07 20:22 ` Jarek Poplawski 2010-01-07 23:11 ` Michael Breuer 2010-01-08 7:45 ` Jarek Poplawski 2010-01-08 16:40 ` Michael Breuer 2010-01-08 21:29 ` Jarek Poplawski 2010-01-08 21:48 ` Michael Breuer 2010-01-08 22:02 ` Jarek Poplawski 2010-01-09 4:45 ` Michael Breuer 2010-01-09 5:44 ` Michael Breuer 2010-01-09 12:28 ` Jarek Poplawski 2010-01-09 18:34 ` Michael Breuer 2010-01-13 20:39 ` Michael Breuer 2010-01-13 21:09 ` Jarek Poplawski 2010-01-13 21:16 ` Michael Breuer 2010-01-13 21:34 ` Jarek Poplawski 2010-01-17 16:26 ` Michael Breuer 2010-01-17 22:17 ` Jarek Poplawski 2010-01-17 22:34 ` Michael Breuer 2010-01-17 23:05 ` Jarek Poplawski 2010-01-17 23:15 ` Michael Breuer 2010-01-18 7:30 ` Jarek Poplawski 2010-01-18 16:29 ` Michael Breuer 2010-01-18 20:46 ` Jarek Poplawski 2010-01-18 20:56 ` Michael Breuer 2010-01-18 21:00 ` Stephen Hemminger 2010-01-18 21:06 ` Jarek Poplawski 2010-01-18 21:24 ` Michael Breuer 2010-01-18 21:50 ` Jarek Poplawski 2010-01-18 21:25 ` Jarek Poplawski 2010-01-18 21:39 ` Michael Breuer 2010-01-18 22:08 ` Jarek Poplawski 2010-01-18 22:17 ` Jarek Poplawski 2010-01-18 22:47 ` Michael Breuer 2010-01-19 5:46 ` Michael Breuer 2010-01-19 8:41 ` Jarek Poplawski 2010-01-19 15:28 ` Michael Breuer 2010-01-21 19:48 ` Michael Breuer 2010-01-19 10:47 ` Jarek Poplawski 2010-01-19 15:47 ` Michael Breuer 2010-01-19 19:59 ` Jarek Poplawski 2010-01-19 20:06 ` Michael Breuer 2010-01-19 20:29 ` Jarek Poplawski 2010-01-19 22:45 ` Jarek Poplawski 2010-01-20 1:01 ` Michael Breuer 2010-01-20 1:10 ` Stephen Hemminger 2010-01-21 16:14 ` Stefan Richter 2010-01-21 16:50 ` Stefan Richter 2010-01-18 22:25 ` Michael Breuer 2010-01-18 22:40 ` Jarek Poplawski 2009-12-27 17:03 ` sky2 panic in 2.6.32.1 under load (new oops) Michael Breuer 2009-12-27 18:22 ` Stephen Hemminger 2009-12-27 19:39 ` Michael Breuer 2009-12-29 17:30 ` Stephen Hemminger 2009-12-29 17:39 ` Michael Breuer 2009-12-29 18:38 ` Michael Breuer 2009-12-29 18:54 ` Michael Breuer 2009-12-29 19:49 ` Stephen Hemminger 2009-12-29 20:41 ` Michael Breuer 2009-12-30 7:23 ` Michael Breuer 2009-12-30 7:58 ` Stephen Hemminger 2009-12-30 17:49 ` Michael Breuer 2009-12-30 19:15 ` audit.c skb - tty race condition - was " Michael Breuer 2009-12-30 20:44 ` Michael Breuer 2009-12-30 21:15 ` Michael Breuer 2009-12-30 21:21 ` Michael Breuer 2009-12-30 7:59 ` Stephen Hemminger 2009-12-30 15:40 ` Michael Breuer 2009-12-30 18:10 ` Stephen Hemminger 2009-12-30 18:37 ` Michael Breuer 2009-12-31 18:09 ` Michael Breuer 2009-12-31 18:24 ` Stephen Hemminger 2010-01-01 17:42 ` Michael Breuer 2010-01-01 19:26 ` sky2 panic in 2.6.32.1 under load (tty NULL write) Michael Breuer 2010-01-01 20:34 ` Michael Breuer 2010-01-02 21:42 ` Michael Breuer 2009-12-29 19:15 ` sky2 panic in 2.6.32.1 under load (new oops) Jarek Poplawski 2009-12-29 19:20 ` Michael Breuer 2009-12-30 8:07 ` Stephen Hemminger 2009-12-30 15:36 ` Michael Breuer 2009-12-22 0:52 ` sky2 panic in 2.6.32.1 under load Daniel Hazelton 2009-12-24 6:58 ` Andrew Morton 2009-12-24 16:03 ` Berck Nash 2009-12-24 16:28 ` Daniel Hazelton 2009-12-24 22:21 ` Stephen Hemminger 2009-12-24 22:42 ` Michael Breuer 2009-12-25 0:06 ` Daniel Hazelton 2009-12-24 16:10 ` Michael Breuer 2009-12-24 16:16 ` Berck Nash 2009-12-24 16:26 ` Michael Breuer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).