linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.22-rc5: pdflush oops under heavy disk load
@ 2007-06-22  0:07 Jay L. T. Cornwall
  2007-06-22 14:47 ` Chuck Ebbert
  0 siblings, 1 reply; 24+ messages in thread
From: Jay L. T. Cornwall @ 2007-06-22  0:07 UTC (permalink / raw)
  To: linux-kernel

Hi,

Kernel version: 2.6.22-rc5 (confirmed also on 2.6.20)
Kernel config : Ubuntu 7.04 default (SMP)

Relevant hardware:
  Asus P5K (Intel P35 chipset)
  Core 2 Duo E6600 2.4GHz
  Western Digital 10KRPM 150GB HDD on JMicron 20360/20363 AHCI

Netconsoled dump:

[  724.350222] general protection fault: 0000 [1] SMP
[  724.350413] CPU 1
[  724.350520] Modules linked in: usb_storage libusual netconsole
binfmt_misc rfcomm l2cap bluetooth ppdev capability commoncap
acpi_cpufreq cpufreq_stats cpufreq_userspace cpufreq_ondemand
cpufreq_conservative cpufreq_powersave freq_table video container
battery dock asus_acpi ac sbs button af_packet nls_utf8 ntfs w83627ehf
i2c_isa parport_pc lp parport fuse mt2060 snd_hda_intel snd_pcm_oss
snd_mixer_oss snd_pcm cx22702 snd_seq_dummy snd_seq_oss dvb_usb_dib0700
dib7000m dib7000p dvb_usb cx88_dvb cx88_vp3054_i2c snd_seq_midi
snd_rawmidi video_buf_dvb dvb_core ipv6 snd_seq_midi_event snd_seq
snd_timer dvb_pll cx8800 cx8802 cx88xx sr_mod ir_common snd_seq_device
cdrom i2c_algo_bit dib3000mc dibx000_common tveeprom atl1 usbhid psmouse
videodev compat_ioctl32 hid mii i2c_core v4l2_common v4l1_compat
btcx_risc video_buf serio_raw snd soundcore pcspkr shpchp pci_hotplug
snd_page_alloc intel_agp tsdev evdev ext3 jbd mbcache sg sd_mod
pata_jmicron ata_generic ata_piix ahci libata scsi_mod ehci_hcd generic
uhci_hcd usbcore thermal processor fan
[  724.355028] Pid: 199, comm: pdflush Not tainted 2.6.22-rc5-edge #1
[  724.355125] RIP: 0010:[<ffffffff880f1b44>]  [<ffffffff880f1b44>]
:ext3:walk_page_buffers+0x34/0x90
[  724.355305] RSP: 0018:ffff8101322e7bb0  EFLAGS: 00010202
[  724.355394] RAX: 0000000000000000 RBX: 000000009d8145bd RCX:
0000000000001000
[  724.355491] RDX: 000000009d8145bd RSI: 908553557cc5eb6f RDI:
ffff81012e1052a0
[  724.355587] RBP: 000000003b028b7a R08: 0000000000000000 R09:
ffffffff880f1ba0
[  724.355684] R10: 0000000000000000 R11: 0000000000000001 R12:
000000009d8145bd
[  724.355780] R13: 908553557cc5eb6f R14: ffff8100369a5200 R15:
0000000000000000
[  724.357278] FS:  0000000000000000(0000) GS:ffff81013b07cac0(0000)
knlGS:0000000000000000
[  724.357410] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  724.357501] CR2: 00002b776e178000 CR3: 000000013a245000 CR4:
00000000000006e0
[  724.357598] Process pdflush (pid: 199, threadinfo ffff8101322e6000,
task ffff81013b15aaa0)
[  724.357730] Stack:  ffffffff880f1ba0 0000000000001000
ffff81012e1052a0 ffff81013de27c38
[  724.358031]  ffff81012e1052a0 000000002e1052a0 ffff8100369a5200
ffff8101322e7e50
[  724.358292]  000000000000000e ffffffff880f4fca ffff81012e545b08
0000000000000003
[  724.358489] Call Trace:
[  724.358638]  [<ffffffff880f1ba0>] :ext3:bget_one+0x0/0x10
[  724.358742]  [<ffffffff880f4fca>] :ext3:ext3_ordered_writepage+0xea/0x190
[  724.358846]  [<ffffffff8027413a>] __writepage+0xa/0x30
[  724.358937]  [<ffffffff80274744>] write_cache_pages+0x224/0x350
[  724.359030]  [<ffffffff80274130>] __writepage+0x0/0x30
[  724.359147]  [<ffffffff802748cb>] do_writepages+0x2b/0x40
[  724.359239]  [<ffffffff802b8046>] __writeback_single_inode+0xa6/0x3e0
[  724.359348]  [<ffffffff802b8796>] sync_sb_inodes+0x1f6/0x2f0
[  724.359445]  [<ffffffff802b8d2f>] writeback_inodes+0xbf/0x100
[  724.359542]  [<ffffffff80274de9>] background_writeout+0xa9/0xe0
[  724.359648]  [<ffffffff802752f0>] pdflush+0x0/0x220
[  724.359739]  [<ffffffff80275430>] pdflush+0x140/0x220
[  724.359829]  [<ffffffff80274d40>] background_writeout+0x0/0xe0
[  724.359927]  [<ffffffff8024ac7b>] kthread+0x4b/0x80
[  724.360018]  [<ffffffff8020aca8>] child_rip+0xa/0x12
[  724.360120]  [<ffffffff8024ac30>] kthread+0x0/0x80
[  724.360208]  [<ffffffff8020ac9e>] child_rip+0x0/0x12
[  724.360298]
[  724.360369]
[  724.360370] Code: 4c 8b 6e 08 41 8d 1c 14 76 39 89 d8 44 29 e0 3b 44
24 08 73
[  724.361260] RIP  [<ffffffff880f1b44>] :ext3:walk_page_buffers+0x34/0x90
[  724.361395]  RSP <ffff8101322e7bb0>

The system runs stably under light load. Heavy disk writes, here induced
by 200Mbit scp's onto the drive, cause the oops within a minute or two.
It's entirely reproducible and appears to give the same trace each time.

I'll have a go at digging up the root of this problem, but anyone with
more experience is welcome to pitch in!

-- 
Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/
PhD Student
Imperial College London

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-22  0:07 2.6.22-rc5: pdflush oops under heavy disk load Jay L. T. Cornwall
@ 2007-06-22 14:47 ` Chuck Ebbert
  2007-06-22 15:04   ` Jay L. T. Cornwall
  2007-06-24 22:51   ` 2.6.22-rc5: pdflush oops under heavy disk load Jesper Juhl
  0 siblings, 2 replies; 24+ messages in thread
From: Chuck Ebbert @ 2007-06-22 14:47 UTC (permalink / raw)
  To: Jay L. T. Cornwall; +Cc: linux-kernel

On 06/21/2007 08:07 PM, Jay L. T. Cornwall wrote:
> Hi,
> 
> Kernel version: 2.6.22-rc5 (confirmed also on 2.6.20)
> Kernel config : Ubuntu 7.04 default (SMP)
> 
> Relevant hardware:
>   Asus P5K (Intel P35 chipset)
>   Core 2 Duo E6600 2.4GHz
>   Western Digital 10KRPM 150GB HDD on JMicron 20360/20363 AHCI
> 
> Netconsoled dump:
> 
> [  724.350222] general protection fault: 0000 [1] SMP
> [  724.350413] CPU 1
> [  724.350520] Modules linked in: usb_storage libusual netconsole
> binfmt_misc rfcomm l2cap bluetooth ppdev capability commoncap
> acpi_cpufreq cpufreq_stats cpufreq_userspace cpufreq_ondemand
> cpufreq_conservative cpufreq_powersave freq_table video container
> battery dock asus_acpi ac sbs button af_packet nls_utf8 ntfs w83627ehf
> i2c_isa parport_pc lp parport fuse mt2060 snd_hda_intel snd_pcm_oss
> snd_mixer_oss snd_pcm cx22702 snd_seq_dummy snd_seq_oss dvb_usb_dib0700
> dib7000m dib7000p dvb_usb cx88_dvb cx88_vp3054_i2c snd_seq_midi
> snd_rawmidi video_buf_dvb dvb_core ipv6 snd_seq_midi_event snd_seq
> snd_timer dvb_pll cx8800 cx8802 cx88xx sr_mod ir_common snd_seq_device
> cdrom i2c_algo_bit dib3000mc dibx000_common tveeprom atl1 usbhid psmouse
> videodev compat_ioctl32 hid mii i2c_core v4l2_common v4l1_compat
> btcx_risc video_buf serio_raw snd soundcore pcspkr shpchp pci_hotplug
> snd_page_alloc intel_agp tsdev evdev ext3 jbd mbcache sg sd_mod
> pata_jmicron ata_generic ata_piix ahci libata scsi_mod ehci_hcd generic
> uhci_hcd usbcore thermal processor fan
> [  724.355028] Pid: 199, comm: pdflush Not tainted 2.6.22-rc5-edge #1
> [  724.355125] RIP: 0010:[<ffffffff880f1b44>]  [<ffffffff880f1b44>]
> :ext3:walk_page_buffers+0x34/0x90

Step 1: run fsck on the filesystem.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-22 14:47 ` Chuck Ebbert
@ 2007-06-22 15:04   ` Jay L. T. Cornwall
  2007-06-23 12:14     ` Jay L. T. Cornwall
  2007-06-24 22:51   ` 2.6.22-rc5: pdflush oops under heavy disk load Jesper Juhl
  1 sibling, 1 reply; 24+ messages in thread
From: Jay L. T. Cornwall @ 2007-06-22 15:04 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel

Chuck Ebbert wrote:

> On 06/21/2007 08:07 PM, Jay L. T. Cornwall wrote:
>> [  724.350222] general protection fault: 0000 [1] SMP
>> [  724.350413] CPU 1
>> <snip>
>> [  724.355028] Pid: 199, comm: pdflush Not tainted 2.6.22-rc5-edge #1
>> [  724.355125] RIP: 0010:[<ffffffff880f1b44>]  [<ffffffff880f1b44>]
>> :ext3:walk_page_buffers+0x34/0x90

> Step 1: run fsck on the filesystem.

Already done. The filesystem came back as clean after the first oops,
but I forced a recheck with fsck to be safe - it found no problems.

This is reproducible on a clean filesystem.

-- 
Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/
PhD Student
Imperial College London

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-22 15:04   ` Jay L. T. Cornwall
@ 2007-06-23 12:14     ` Jay L. T. Cornwall
  2007-06-23 17:23       ` Andrew Morton
  2007-06-24 17:59       ` Jay Cliburn
  0 siblings, 2 replies; 24+ messages in thread
From: Jay L. T. Cornwall @ 2007-06-23 12:14 UTC (permalink / raw)
  To: linux-kernel

Jay L. T. Cornwall wrote:

> Already done. The filesystem came back as clean after the first oops,
> but I forced a recheck with fsck to be safe - it found no problems.
> 
> This is reproducible on a clean filesystem.

Following up on this, I've now extracted another oops (at the bottom of
this mail).

The common factor here seems to be the buffer_head circular list leading
to invalid pointers in bh->b_this_page.

I'm beginning to suspect the Attansic L1 Gigabit Etherner driver (marked
as EXPERIMENTAL in 2.6.22-rc5). I can't reproduce these panics on
disk-to-disk copies or SCP across the localhost interface. However, SCP
from a server onto either of two different HDDs hits these oopses fairly
quickly.

Is it even possible for the Ethernet driver to corrupt ext3 data
structures, short of trashing memory?

[  628.135241] general protection fault: 0000 [1] SMP
[  628.135422] CPU 1
[  628.135522] Modules linked in: usb_storage libusual netconsole
binfmt_misc rfcomm l2cap bluetooth ppdev capability commoncap
acpi_cpufreq cpufreq_stats cpufreq_userspace cpufreq_ondemand
cpufreq_conservative cpufreq_powersave freq_table video container
battery dock asus_acpi ac sbs button af_packet ipv6 nls_utf8 ntfs
w83627ehf i2c_isa parport_pc lp parport fuse snd_hda_intel snd_pcm_oss
snd_mixer_oss mt2060 snd_pcm snd_seq_dummy cx22702 snd_seq_oss cx88_dvb
cx88_vp3054_i2c video_buf_dvb snd_seq_midi snd_rawmidi
snd_seq_midi_event snd_seq cx8800 nls_cp437 dvb_usb_dib0700 dib7000m
dib7000p dvb_usb cx8802 cx88xx dvb_core cifs ir_common dvb_pll
i2c_algo_bit dib3000mc nvidia(P) dibx000_common snd_timer tveeprom atl1
compat_ioctl32 i2c_core videodev mii psmouse snd_seq_device v4l1_compat
video_buf v4l2_common btcx_risc pcspkr shpchp snd soundcore
snd_page_alloc intel_agp pci_hotplug serio_raw tsdev evdev sr_mod cdrom
ext3 jbd mbcache sg sd_mod pata_jmicron usbhid hid ata_generic ata_piix
ahci libata scsi_mod generic ehci_hcd uhci_hcd usbcore thermal processor fan
[  628.139866] Pid: 201, comm: kswapd0 Tainted: P       2.6.22-rc5-edge #1
[  628.139952] RIP: 0010:[<ffffffff802929ee>]  [<ffffffff802929ee>]
free_block+0x10e/0x160
[  628.140108] RSP: 0018:ffff8101322ebaf0  EFLAGS: 00010046
[  628.140190] RAX: 3eac08c8be1ff284 RBX: ffff810039616f68 RCX:
ffff810127524c00
[  628.140278] RDX: bc10c1d4beae1915 RSI: ffff810039616000 RDI:
00000000dc050844
[  628.140366] RBP: ffff8101322ebb40 R08: ffff81013b07cc07 R09:
00000000ffffffe9
[  628.140455] R10: ffff81013c68e1c0 R11: ffffffff88108740 R12:
ffff81013b81b800
[  628.140542] R13: 0000000000000001 R14: 0000000000000001 R15:
0000000000000640
[  628.140630] FS:  0000000000000000(0000) GS:ffff81013b07cac0(0000)
knlGS:0000000000000000
[  628.140750] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  628.140832] CR2: 00002ae5699e0000 CR3: 00000001001c5000 CR4:
00000000000006e0
[  628.140921] Process kswapd0 (pid: 201, threadinfo ffff8101322ea000,
task ffff81013b15a3c0)
[  628.141039] Stack:  ffff81000000e380 ffff81013b078400
0000000000000640 0000000000000640
[  628.141320]  ffff81013b81b800 0000000000000246 ffff8101322ebd90
ffffffff80292846
[  628.141563]  ffff810039616f68 ffff810039616f68 ffff810039616f68
ffff81012ad8d598
[  628.141746] Call Trace:
[  628.141886]  [<ffffffff80292846>] kmem_cache_free+0x1b6/0x1d0
[  628.141979]  [<ffffffff802bbf00>] free_buffer_head+0x20/0x50
[  628.142063]  [<ffffffff802bc334>] try_to_free_buffers+0x64/0xa0
[  628.142153]  [<ffffffff8027824a>] shrink_inactive_list+0x82a/0x960
[  628.142274]  [<ffffffff802776f1>] shrink_active_list+0x421/0x4e0
[  628.142395]  [<ffffffff8027844b>] shrink_zone+0xcb/0x140
[  628.142484]  [<ffffffff8027909a>] kswapd+0x3ea/0x560
[  628.142578]  [<ffffffff8024b040>] autoremove_wake_function+0x0/0x30
[  628.142679]  [<ffffffff80278cb0>] kswapd+0x0/0x560
[  628.142762]  [<ffffffff8024ac7b>] kthread+0x4b/0x80
[  628.142846]  [<ffffffff8020aca8>] child_rip+0xa/0x12
[  628.142942]  [<ffffffff8024ac30>] kthread+0x0/0x80
[  628.143023]  [<ffffffff8020ac9e>] child_rip+0x0/0x12
[  628.143106]
[  628.143171]
[  628.143172] Code: 48 89 30 0f 85 44 ff ff ff 48 83 c4 08 5b 5d 41 5c
41 5d 41
[  628.144015] RIP  [<ffffffff802929ee>] free_block+0x10e/0x160
[  628.144130]  RSP <ffff8101322ebaf0>

-- 
Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/
PhD Student
Imperial College London

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-23 12:14     ` Jay L. T. Cornwall
@ 2007-06-23 17:23       ` Andrew Morton
  2007-06-24  9:57         ` (Last oops is Tainted: P) " Oleg Verych
  2007-06-24 17:59       ` Jay Cliburn
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2007-06-23 17:23 UTC (permalink / raw)
  To: Jay L. T. Cornwall; +Cc: linux-kernel

On Sat, 23 Jun 2007 13:14:40 +0100 "Jay L.  T.  Cornwall" <jay@esuna.co.uk>
wrote:

> Jay L. T. Cornwall wrote:
> 
> > Already done. The filesystem came back as clean after the first oops,
> > but I forced a recheck with fsck to be safe - it found no problems.
> > 
> > This is reproducible on a clean filesystem.
> 
> Following up on this, I've now extracted another oops (at the bottom of
> this mail).
> 
> The common factor here seems to be the buffer_head circular list leading
> to invalid pointers in bh->b_this_page.
> 
> I'm beginning to suspect the Attansic L1 Gigabit Etherner driver (marked
> as EXPERIMENTAL in 2.6.22-rc5). I can't reproduce these panics on
> disk-to-disk copies or SCP across the localhost interface. However, SCP
> from a server onto either of two different HDDs hits these oopses fairly
> quickly.

That sounds like a good theory: you're getting easily-hit oopses in one of
the kernel's most-used codepaths which hasn't chanbged much in a long
time.  So Something Odd Has Happened.

> Is it even possible for the Ethernet driver to corrupt ext3 data
> structures, short of trashing memory?

I suppose so.

I'd suggest that you enable every kernel debugging feature you can get your
hands on (in the Kernel Hacking menu) and see if that turns anything up.

Failing that, if you can whack a different network card in that machine it
would help to firm or deny your suspicion.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* (Last oops is Tainted: P) Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-23 17:23       ` Andrew Morton
@ 2007-06-24  9:57         ` Oleg Verych
  2007-06-24 10:10           ` Jay L. T. Cornwall
  0 siblings, 1 reply; 24+ messages in thread
From: Oleg Verych @ 2007-06-24  9:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jay L. T. Cornwall, Chuck Ebbert, linux-kernel

* From: Andrew Morton
* Newsgroups: linux.kernel
* Date: Sat, 23 Jun 2007 10:23:18 -0700
>
> On Sat, 23 Jun 2007 13:14:40 +0100 "Jay L.  T.  Cornwall" <jay@esuna.co.uk>
> wrote:
>
>> I'm beginning to suspect the Attansic L1 Gigabit Etherner driver (marked
>> as EXPERIMENTAL in 2.6.22-rc5). I can't reproduce these panics on
>> disk-to-disk copies or SCP across the localhost interface. However, SCP
>> from a server onto either of two different HDDs hits these oopses fairly
>> quickly.
>
> That sounds like a good theory: you're getting easily-hit oopses in one of
> the kernel's most-used codepaths which hasn't chanbged much in a long
> time.  So Something Odd Has Happened.

Maybe this time it's just "Tainted: P"?

|-*- <467D0EB0.9030100@esuna.co.uk> -*-
...
i2c_algo_bit dib3000mc nvidia(P) dibx000_common snd_timer tveeprom atl1
compat_ioctl32 i2c_core videodev mii psmouse snd_seq_device v4l1_compat
video_buf v4l2_common btcx_risc pcspkr shpchp snd soundcore
snd_page_alloc intel_agp pci_hotplug serio_raw tsdev evdev sr_mod cdrom
ext3 jbd mbcache sg sd_mod pata_jmicron usbhid hid ata_generic ata_piix
ahci libata scsi_mod generic ehci_hcd uhci_hcd usbcore thermal
processor fan
[  628.139866] Pid: 201, comm: kswapd0 Tainted: P
...
|-*-

And oops have no ext3, like prev. one.

[ as you know we have no automatic noise tracking system, and ]
[ developers were not so productive in last discussion of it  ]

Jay, check your oops against "Tainted: P" flag, which is not supported
here, and not drop persons, who assisted you from the CC list.
____

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: (Last oops is Tainted: P) Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-24  9:57         ` (Last oops is Tainted: P) " Oleg Verych
@ 2007-06-24 10:10           ` Jay L. T. Cornwall
  2007-06-24 10:57             ` Oleg Verych
  0 siblings, 1 reply; 24+ messages in thread
From: Jay L. T. Cornwall @ 2007-06-24 10:10 UTC (permalink / raw)
  To: Oleg Verych; +Cc: Andrew Morton, Chuck Ebbert, linux-kernel

Oleg Verych wrote:

>> That sounds like a good theory: you're getting easily-hit oopses in one of
>> the kernel's most-used codepaths which hasn't chanbged much in a long
>> time.  So Something Odd Has Happened.

> Maybe this time it's just "Tainted: P"?

That'sthe NVIDIA module, which isn't doing much with X shut down
regardless. It was bad form to forget this, of course, but is unrelated
to the problem.

> And oops have no ext3, like prev. one.

I know. This isn't ext3 related and I'm fairly certain drivers/net/atl1
is trashing... something. Perhaps the page table because:

[  153.785325] Bad page state in process 'scp'
[  153.785327] page:ffff81000308d020 flags:0x0040ad41dc050845
mapping:53dfe57d17cc59cf mapcount:16885953 count:292554304
[  153.785329] Trying to fix it up, but a reboot is needed

This one dismisses a reference counting issue because the page data here
looks like garbage. And a panic in VLC, playing a video across the
network hits a similar problem:

[ 9194.281809]  [<ffffffff802849e3>] page_remove_rmap+0x53/0x110
[ 9194.281819]  [<ffffffff8027c32c>] unmap_vmas+0x4ec/0x7c0
[ 9194.281852]  [<ffffffff802807ac>] unmap_region+0xcc/0x170
[ 9194.281867]  [<ffffffff8028160a>] do_munmap+0x22a/0x2f0
[ 9194.281877]  [<ffffffff80439ee2>] __down_write_nested+0x12/0xb0
[ 9194.281892]  [<ffffffff802ef936>] sys_shmdt+0xb6/0x150
[ 9194.281903]  [<ffffffff80209e8e>] system_call+0x7e/0x83
[ 9194.281921]
[ 9194.281924]
[ 9194.281925] Code: 48 2b ba 98 21 00 00 48 c1 ff 03 48 0f af f8 48 03
ba a8 21
[ 9194.281973] RIP  [<ffffffff80271f99>] page_to_pfn+0x19/0x40

> Jay, check your oops against "Tainted: P" flag, which is not supported
> here, and not drop persons, who assisted you from the CC list.

My apologies, I had thought the etiquette was to only include
maintainers on the CC list.

I'll try and locate a maintainer for the Attansic driver a bit later,
but I've only seen people loosely related to it. In any case we may as
well let this thread die because it's not related to a filesystem bug
(which the CC list is presumably interested in).

-- 
Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/
PhD Student
Imperial College London

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: (Last oops is Tainted: P) Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-24 10:10           ` Jay L. T. Cornwall
@ 2007-06-24 10:57             ` Oleg Verych
  0 siblings, 0 replies; 24+ messages in thread
From: Oleg Verych @ 2007-06-24 10:57 UTC (permalink / raw)
  To: Jay L. T. Cornwall; +Cc: Andrew Morton, Chuck Ebbert, linux-kernel

On Sun, Jun 24, 2007 at 11:10:09AM +0100, Jay L. T. Cornwall wrote:
> Oleg Verych wrote:
> 
> >> That sounds like a good theory: you're getting easily-hit oopses in one of
> >> the kernel's most-used codepaths which hasn't chanbged much in a long
> >> time.  So Something Odd Has Happened.
> 
> > Maybe this time it's just "Tainted: P"?
> 
> That'sthe NVIDIA module, which isn't doing much with X shut down
> regardless. It was bad form to forget this, of course, but is unrelated
> to the problem.
> 
> > And oops have no ext3, like prev. one.
> 
> I know. This isn't ext3 related and I'm fairly certain drivers/net/atl1
> is trashing... something. Perhaps the page table because:

Last oops log was with tainting (as subject reflects), before that i've
saw ext3 and "run fsck" reply. Thus, really clean oops log with all
details, not that you have just posted may be useful.

> [  153.785325] Bad page state in process 'scp'
> [  153.785327] page:ffff81000308d020 flags:0x0040ad41dc050845
> mapping:53dfe57d17cc59cf mapcount:16885953 count:292554304
> [  153.785329] Trying to fix it up, but a reboot is needed
> 
> This one dismisses a reference counting issue because the page data here
> looks like garbage. And a panic in VLC, playing a video across the
> network hits a similar problem:

OK, i see now you are in Windows now, but i will try to ask you about
making testcase using `netcat' or `curl'. If hardware is in trouble,
probably network stressing could trigger that. And clean *one* test
script and no X (or other stuff) will surely help.

[ Netiquette here is being voluntary noise filter, after joining any   ]
[ thread, because reply-to-all is the way of communication in the LKML ]
____

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-23 12:14     ` Jay L. T. Cornwall
  2007-06-23 17:23       ` Andrew Morton
@ 2007-06-24 17:59       ` Jay Cliburn
  2007-06-24 20:31         ` Jay L. T. Cornwall
  1 sibling, 1 reply; 24+ messages in thread
From: Jay Cliburn @ 2007-06-24 17:59 UTC (permalink / raw)
  To: Jay L. T. Cornwall; +Cc: linux-kernel

On Sat, 23 Jun 2007 13:14:40 +0100
"Jay L. T. Cornwall" <jay@esuna.co.uk> wrote:


> The common factor here seems to be the buffer_head circular list
> leading to invalid pointers in bh->b_this_page.
> 
> I'm beginning to suspect the Attansic L1 Gigabit Etherner driver
> (marked as EXPERIMENTAL in 2.6.22-rc5). I can't reproduce these
> panics on disk-to-disk copies or SCP across the localhost interface.
> However, SCP from a server onto either of two different HDDs hits
> these oopses fairly quickly.

How much RAM is installed in your machine?  If it's 4GB or more, does
your problem go away if you boot with mem=3000M?

Jay

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-24 17:59       ` Jay Cliburn
@ 2007-06-24 20:31         ` Jay L. T. Cornwall
  2007-06-24 21:45           ` Jay Cliburn
  0 siblings, 1 reply; 24+ messages in thread
From: Jay L. T. Cornwall @ 2007-06-24 20:31 UTC (permalink / raw)
  To: Jay Cliburn; +Cc: linux-kernel

Jay Cliburn wrote:

>> The common factor here seems to be the buffer_head circular list
>> leading to invalid pointers in bh->b_this_page.
>>
>> I'm beginning to suspect the Attansic L1 Gigabit Etherner driver
>> (marked as EXPERIMENTAL in 2.6.22-rc5). I can't reproduce these
>> panics on disk-to-disk copies or SCP across the localhost interface.
>> However, SCP from a server onto either of two different HDDs hits
>> these oopses fairly quickly.

> How much RAM is installed in your machine?  If it's 4GB or more, does
> your problem go away if you boot with mem=3000M?

Intriguing. Yes, this machine has 4GB of RAM. If I boot with mem=3000M
the problem does indeed go away - I can't induce an oops even after
transferring tens of GB across the interface.

I'm not sure I follow why that would be the case, except that it relates
to pci_map_page behaviour. But I guess you have an inkling?

-- 
Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/
PhD Student
Imperial College London

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-24 20:31         ` Jay L. T. Cornwall
@ 2007-06-24 21:45           ` Jay Cliburn
  2007-06-25 12:16             ` Attansic L1 page corruption (was: 2.6.22-rc5: pdflush oops under heavy disk load) Jay L. T. Cornwall
  0 siblings, 1 reply; 24+ messages in thread
From: Jay Cliburn @ 2007-06-24 21:45 UTC (permalink / raw)
  To: Jay L. T. Cornwall; +Cc: linux-kernel, kronos.it, Chris Snook

On Sun, 24 Jun 2007 21:31:36 +0100
"Jay L. T. Cornwall" <jay@esuna.co.uk> wrote:

> Jay Cliburn wrote:
> 
> >> The common factor here seems to be the buffer_head circular list
> >> leading to invalid pointers in bh->b_this_page.
> >>
> >> I'm beginning to suspect the Attansic L1 Gigabit Etherner driver
> >> (marked as EXPERIMENTAL in 2.6.22-rc5). I can't reproduce these
> >> panics on disk-to-disk copies or SCP across the localhost
> >> interface. However, SCP from a server onto either of two different
> >> HDDs hits these oopses fairly quickly.
> 
> > How much RAM is installed in your machine?  If it's 4GB or more,
> > does your problem go away if you boot with mem=3000M?
> 
> Intriguing. Yes, this machine has 4GB of RAM. If I boot with mem=3000M
> the problem does indeed go away - I can't induce an oops even after
> transferring tens of GB across the interface.
> 
> I'm not sure I follow why that would be the case, except that it
> relates to pci_map_page behaviour. But I guess you have an inkling?
> 

For reasons not yet clear to me, it appears the L1 driver has a bug or
the device itself has trouble with DMA in high memory.  This patch,
drafted by Luca Tettamanti, is being explored as a workaround.  I'd be
interested to know if it fixes your problem.

[Aside: For future reference, atl1-devel@lists.sourceforge.net is a
mailing list devoted to L1 driver development.]

Jay



diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c
index 6862c11..a600601 100644
--- a/drivers/net/atl1/atl1_main.c
+++ b/drivers/net/atl1/atl1_main.c
@@ -2104,15 +2104,12 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
 	if (err)
 		return err;
 
-	err = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
+	err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
 	if (err) {
-		err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
-		if (err) {
-			dev_err(&pdev->dev, "no usable DMA configuration\n");
-			goto err_dma;
-		}
-		pci_using_64 = false;
+		dev_err(&pdev->dev, "no usable DMA configuration\n");
+		goto err_dma;
 	}
+	pci_using_64 = false;
 	/* Mark all PCI regions associated with PCI device
 	 * pdev as being reserved by owner atl1_driver_name
 	 */


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: 2.6.22-rc5: pdflush oops under heavy disk load
  2007-06-22 14:47 ` Chuck Ebbert
  2007-06-22 15:04   ` Jay L. T. Cornwall
@ 2007-06-24 22:51   ` Jesper Juhl
  1 sibling, 0 replies; 24+ messages in thread
From: Jesper Juhl @ 2007-06-24 22:51 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Jay L. T. Cornwall, linux-kernel

On 22/06/07, Chuck Ebbert <cebbert@redhat.com> wrote:
[snip]
>
> Step 1: run fsck on the filesystem.
>
I agree that running fsck on the filesystem is a good idea, but still,
even a corrupt filesystem should never be able to cause an Oops. In
fact, nothing done from userspace should be able to cause an Oops.

-- 
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Attansic L1 page corruption (was: 2.6.22-rc5: pdflush oops under heavy disk load)
  2007-06-24 21:45           ` Jay Cliburn
@ 2007-06-25 12:16             ` Jay L. T. Cornwall
  2007-06-25 12:42               ` Attansic L1 page corruption Jay Cliburn
  2007-06-25 12:58               ` Attansic L1 page corruption (was: 2.6.22-rc5: pdflush oops under heavy disk load) Luca
  0 siblings, 2 replies; 24+ messages in thread
From: Jay L. T. Cornwall @ 2007-06-25 12:16 UTC (permalink / raw)
  To: Jay Cliburn; +Cc: linux-kernel, kronos.it, Chris Snook

Jay Cliburn wrote:

> For reasons not yet clear to me, it appears the L1 driver has a bug or
> the device itself has trouble with DMA in high memory.  This patch,
> drafted by Luca Tettamanti, is being explored as a workaround.  I'd be
> interested to know if it fixes your problem.

Yes, it certainly seems to. Now running with this patch and 4GB active,
I've transferred about 15GB with no problem so far. It usually oopses
after a GB or two.

I guess it's not an ideal solution, architecturally speaking, but it's a
good deal better than an unstable driver. If there's any other patches
you'd like me to test or traces to capture, I'm happy to help out.
Otherwise I'll run with this one for now since it does the job!

Thanks.

-- 
Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/
PhD Student
Imperial College London

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Attansic L1 page corruption
  2007-06-25 12:16             ` Attansic L1 page corruption (was: 2.6.22-rc5: pdflush oops under heavy disk load) Jay L. T. Cornwall
@ 2007-06-25 12:42               ` Jay Cliburn
  2007-06-25 21:18                 ` [PATCH] atl1: disable 64bit DMA Luca Tettamanti
  2007-06-25 12:58               ` Attansic L1 page corruption (was: 2.6.22-rc5: pdflush oops under heavy disk load) Luca
  1 sibling, 1 reply; 24+ messages in thread
From: Jay Cliburn @ 2007-06-25 12:42 UTC (permalink / raw)
  To: Jay L. T. Cornwall; +Cc: linux-kernel, kronos.it, Chris Snook

Jay L. T. Cornwall wrote:
> Jay Cliburn wrote:
> 
>> For reasons not yet clear to me, it appears the L1 driver has a bug or
>> the device itself has trouble with DMA in high memory.  This patch,
>> drafted by Luca Tettamanti, is being explored as a workaround.  I'd be
>> interested to know if it fixes your problem.
> 
> Yes, it certainly seems to. Now running with this patch and 4GB active,
> I've transferred about 15GB with no problem so far. It usually oopses
> after a GB or two.
> 
> I guess it's not an ideal solution, architecturally speaking, but it's a
> good deal better than an unstable driver. If there's any other patches
> you'd like me to test or traces to capture, I'm happy to help out.
> Otherwise I'll run with this one for now since it does the job!

Okay Jay, thanks.

Luca, would you please submit your patch to Jeff Garzik and netdev?

Jay


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Attansic L1 page corruption (was: 2.6.22-rc5: pdflush oops under heavy disk load)
  2007-06-25 12:16             ` Attansic L1 page corruption (was: 2.6.22-rc5: pdflush oops under heavy disk load) Jay L. T. Cornwall
  2007-06-25 12:42               ` Attansic L1 page corruption Jay Cliburn
@ 2007-06-25 12:58               ` Luca
  1 sibling, 0 replies; 24+ messages in thread
From: Luca @ 2007-06-25 12:58 UTC (permalink / raw)
  To: Jay L. T. Cornwall; +Cc: Jay Cliburn, linux-kernel, Chris Snook, huang xiong

On 6/25/07, Jay L. T. Cornwall <jay@esuna.co.uk> wrote:
> Jay Cliburn wrote:
>
> > For reasons not yet clear to me, it appears the L1 driver has a bug or
> > the device itself has trouble with DMA in high memory.  This patch,
> > drafted by Luca Tettamanti, is being explored as a workaround.  I'd be
> > interested to know if it fixes your problem.
>
> Yes, it certainly seems to. Now running with this patch and 4GB active,
> I've transferred about 15GB with no problem so far. It usually oopses
> after a GB or two.
>
> I guess it's not an ideal solution, architecturally speaking, but it's a
> good deal better than an unstable driver.

It may cause a "bounce" (i.e. data is copied to another buffer in
lower memory) when a skb is allocated in high memory. Furthermore - at
least on AMD systems - it should be possible to use the IOMMU to remap
the memory to a bus address < 4GB.

Xiong can you comment on this issue? To recap: users are seeing hard
locks when L1 driver does a DMA to/from a high memory area (physical
address > 4GB). Limiting DMA to the lower 4GB with:

pci_set_dma_mask(pdev, DMA_32BIT_MASK);

cures the issue. Does L1 have any know problem decoding 64 addresses?

Luca

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] atl1: disable 64bit DMA
  2007-06-25 12:42               ` Attansic L1 page corruption Jay Cliburn
@ 2007-06-25 21:18                 ` Luca Tettamanti
  2007-06-25 21:36                   ` Chris Snook
  2007-06-27  0:16                   ` Jay Cliburn
  0 siblings, 2 replies; 24+ messages in thread
From: Luca Tettamanti @ 2007-06-25 21:18 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jay Cliburn, Jay L. T. Cornwall, linux-kernel, Chris Snook, netdev

Il Mon, Jun 25, 2007 at 07:42:44AM -0500, Jay Cliburn ha scritto: 
> Jay L. T. Cornwall wrote:
> >Jay Cliburn wrote:
> >
> >>For reasons not yet clear to me, it appears the L1 driver has a bug or
> >>the device itself has trouble with DMA in high memory.  This patch,
> >>drafted by Luca Tettamanti, is being explored as a workaround.  I'd be
> >>interested to know if it fixes your problem.
> >
> >Yes, it certainly seems to. Now running with this patch and 4GB active,
> >I've transferred about 15GB with no problem so far. It usually oopses
> >after a GB or two.
> >
> >I guess it's not an ideal solution, architecturally speaking, but it's a
> >good deal better than an unstable driver. If there's any other patches
> >you'd like me to test or traces to capture, I'm happy to help out.
> >Otherwise I'll run with this one for now since it does the job!
> 
> Okay Jay, thanks.
> 
> Luca, would you please submit your patch to Jeff Garzik and netdev?

Hi Jeff,
a couple of users reported hard lockups when using L1 NICs on machines
with 4GB or more of RAM. We're still waiting official confirmation from
the vendor, but it seems that L1 has problems doing DMA to/from high
memory (physical address above the 4GB limit). Passing 32bit DMA mask
cures the problem.

Signed-Off-By: Luca Tettamanti <kronos.it@gmail.com>

---
I think that the patch should be included in 2.6.22.

 drivers/net/atl1/atl1_main.c |   15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c
index 6862c11..a730f15 100644
--- a/drivers/net/atl1/atl1_main.c
+++ b/drivers/net/atl1/atl1_main.c
@@ -2097,21 +2097,16 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
 	struct net_device *netdev;
 	struct atl1_adapter *adapter;
 	static int cards_found = 0;
-	bool pci_using_64 = true;
 	int err;
 
 	err = pci_enable_device(pdev);
 	if (err)
 		return err;
 
-	err = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
+	err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
 	if (err) {
-		err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
-		if (err) {
-			dev_err(&pdev->dev, "no usable DMA configuration\n");
-			goto err_dma;
-		}
-		pci_using_64 = false;
+		dev_err(&pdev->dev, "no usable DMA configuration\n");
+		goto err_dma;
 	}
 	/* Mark all PCI regions associated with PCI device
 	 * pdev as being reserved by owner atl1_driver_name
@@ -2176,7 +2171,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
 
 	netdev->ethtool_ops = &atl1_ethtool_ops;
 	adapter->bd_number = cards_found;
-	adapter->pci_using_64 = pci_using_64;
 
 	/* setup the private structure */
 	err = atl1_sw_init(adapter);
@@ -2193,9 +2187,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
 	 */
 	/* netdev->features |= NETIF_F_TSO; */
 
-	if (pci_using_64)
-		netdev->features |= NETIF_F_HIGHDMA;
-
 	netdev->features |= NETIF_F_LLTX;
 
 	/*


Luca
-- 
Non ho ancora capito se il mio cane e` maschio o femmina:
quando fa la pipi` si chiude in bagno

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] atl1: disable 64bit DMA
  2007-06-25 21:18                 ` [PATCH] atl1: disable 64bit DMA Luca Tettamanti
@ 2007-06-25 21:36                   ` Chris Snook
  2007-06-25 21:51                     ` Jay L. T. Cornwall
  2007-06-27  0:16                   ` Jay Cliburn
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Snook @ 2007-06-25 21:36 UTC (permalink / raw)
  To: Luca Tettamanti
  Cc: Jeff Garzik, Jay Cliburn, Jay L. T. Cornwall, linux-kernel, netdev

Luca Tettamanti wrote:
> Il Mon, Jun 25, 2007 at 07:42:44AM -0500, Jay Cliburn ha scritto: 
>> Jay L. T. Cornwall wrote:
>>> Jay Cliburn wrote:
>>>
>>>> For reasons not yet clear to me, it appears the L1 driver has a bug or
>>>> the device itself has trouble with DMA in high memory.  This patch,
>>>> drafted by Luca Tettamanti, is being explored as a workaround.  I'd be
>>>> interested to know if it fixes your problem.
>>> Yes, it certainly seems to. Now running with this patch and 4GB active,
>>> I've transferred about 15GB with no problem so far. It usually oopses
>>> after a GB or two.
>>>
>>> I guess it's not an ideal solution, architecturally speaking, but it's a
>>> good deal better than an unstable driver. If there's any other patches
>>> you'd like me to test or traces to capture, I'm happy to help out.
>>> Otherwise I'll run with this one for now since it does the job!
>> Okay Jay, thanks.
>>
>> Luca, would you please submit your patch to Jeff Garzik and netdev?
> 
> Hi Jeff,
> a couple of users reported hard lockups when using L1 NICs on machines
> with 4GB or more of RAM. We're still waiting official confirmation from
> the vendor, but it seems that L1 has problems doing DMA to/from high
> memory (physical address above the 4GB limit). Passing 32bit DMA mask
> cures the problem.
> 
> Signed-Off-By: Luca Tettamanti <kronos.it@gmail.com>
> 
> ---
> I think that the patch should be included in 2.6.22.
> 
>  drivers/net/atl1/atl1_main.c |   15 +++------------
>  1 file changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c
> index 6862c11..a730f15 100644
> --- a/drivers/net/atl1/atl1_main.c
> +++ b/drivers/net/atl1/atl1_main.c
> @@ -2097,21 +2097,16 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
>  	struct net_device *netdev;
>  	struct atl1_adapter *adapter;
>  	static int cards_found = 0;
> -	bool pci_using_64 = true;
>  	int err;
>  
>  	err = pci_enable_device(pdev);
>  	if (err)
>  		return err;
>  
> -	err = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
> +	err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
>  	if (err) {
> -		err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
> -		if (err) {
> -			dev_err(&pdev->dev, "no usable DMA configuration\n");
> -			goto err_dma;
> -		}
> -		pci_using_64 = false;
> +		dev_err(&pdev->dev, "no usable DMA configuration\n");
> +		goto err_dma;
>  	}
>  	/* Mark all PCI regions associated with PCI device
>  	 * pdev as being reserved by owner atl1_driver_name
> @@ -2176,7 +2171,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
>  
>  	netdev->ethtool_ops = &atl1_ethtool_ops;
>  	adapter->bd_number = cards_found;
> -	adapter->pci_using_64 = pci_using_64;
>  
>  	/* setup the private structure */
>  	err = atl1_sw_init(adapter);
> @@ -2193,9 +2187,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
>  	 */
>  	/* netdev->features |= NETIF_F_TSO; */
>  
> -	if (pci_using_64)
> -		netdev->features |= NETIF_F_HIGHDMA;
> -
>  	netdev->features |= NETIF_F_LLTX;
>  
>  	/*
> 
> 
> Luca

What boards have we seen this on?  It's quite possible this is:

a) an iommu-related problem specific to AMD or specific to Intel
b) a BIOS problem that atl1 happens to be a victim of

I'd rather not disable this unconditionally if we can get more 
information about why it's breaking.  Doing so might just end up 
covering up the most obvious manifestation of a larger problem.

	-- Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] atl1: disable 64bit DMA
  2007-06-25 21:36                   ` Chris Snook
@ 2007-06-25 21:51                     ` Jay L. T. Cornwall
  2007-06-25 21:57                       ` Chris Snook
  0 siblings, 1 reply; 24+ messages in thread
From: Jay L. T. Cornwall @ 2007-06-25 21:51 UTC (permalink / raw)
  To: Chris Snook
  Cc: Luca Tettamanti, Jeff Garzik, Jay Cliburn, linux-kernel, netdev

Chris Snook wrote:

> What boards have we seen this on?  It's quite possible this is:

I can reproduce on an Asus P5K with a Core 2 Duo E6600.

lspci identifies the controller as:
  02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit
  Ethernet Adapter (rev b0)

dmesg notes the PCI-DMA mapping implementation:
  PCI-DMA: Using software bounce buffering for IO (SWIOTLB)

-- 
Jay L. T. Cornwall, http://www.esuna.co.uk/~jay/
PhD Student
Imperial College London

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] atl1: disable 64bit DMA
  2007-06-25 21:51                     ` Jay L. T. Cornwall
@ 2007-06-25 21:57                       ` Chris Snook
  2007-06-25 23:00                         ` Jay Cliburn
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Snook @ 2007-06-25 21:57 UTC (permalink / raw)
  To: Jay L. T. Cornwall
  Cc: Luca Tettamanti, Jeff Garzik, Jay Cliburn, linux-kernel, netdev

Jay L. T. Cornwall wrote:
> Chris Snook wrote:
> 
>> What boards have we seen this on?  It's quite possible this is:
> 
> I can reproduce on an Asus P5K with a Core 2 Duo E6600.
> 
> lspci identifies the controller as:
>   02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit
>   Ethernet Adapter (rev b0)
> 
> dmesg notes the PCI-DMA mapping implementation:
>   PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> 

I had a hunch this was on Intel.  I'd rather just disable this when swiotlb is 
in use, unless we get more complaints.  It's probably ultimately a BIOS quirk 
anyway.

	-- Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] atl1: disable 64bit DMA
  2007-06-25 21:57                       ` Chris Snook
@ 2007-06-25 23:00                         ` Jay Cliburn
  2007-06-25 23:17                           ` Jeff Garzik
  0 siblings, 1 reply; 24+ messages in thread
From: Jay Cliburn @ 2007-06-25 23:00 UTC (permalink / raw)
  To: Chris Snook
  Cc: Jay L. T. Cornwall, Luca Tettamanti, Jeff Garzik, linux-kernel, netdev

On Mon, 25 Jun 2007 17:57:20 -0400
Chris Snook <csnook@redhat.com> wrote:

> Jay L. T. Cornwall wrote:
> > Chris Snook wrote:
> > 
> >> What boards have we seen this on?  It's quite possible this is:
> > 
> > I can reproduce on an Asus P5K with a Core 2 Duo E6600.
> > 
> > lspci identifies the controller as:
> >   02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit
> >   Ethernet Adapter (rev b0)
> > 
> > dmesg notes the PCI-DMA mapping implementation:
> >   PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> > 
> 
> I had a hunch this was on Intel.  I'd rather just disable this when
> swiotlb is in use, unless we get more complaints.  It's probably
> ultimately a BIOS quirk anyway.

So far we have reports from both camps:

Asus M2N8-VMX (AM2):	1 report of lockup
http://sourceforge.net/mailarchive/forum.php?thread_name=46780384.063603.26165%40m12-15.163.com&forum_name=atl1-devel

Asus P5K (LGA775):	2 reports of lockups
http://sourceforge.net/mailarchive/forum.php?thread_name=467E7E34.4010603%40gmail.com&forum_name=atl1-devel
http://lkml.org/lkml/2007/6/25/107

The common denominator in these reports is 4GB RAM.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] atl1: disable 64bit DMA
  2007-06-25 23:00                         ` Jay Cliburn
@ 2007-06-25 23:17                           ` Jeff Garzik
  2007-06-25 23:40                             ` Chris Snook
  2007-06-26 21:12                             ` Luca
  0 siblings, 2 replies; 24+ messages in thread
From: Jeff Garzik @ 2007-06-25 23:17 UTC (permalink / raw)
  To: Jay Cliburn
  Cc: Chris Snook, Jay L. T. Cornwall, Luca Tettamanti, linux-kernel, netdev

Jay Cliburn wrote:
> On Mon, 25 Jun 2007 17:57:20 -0400
> Chris Snook <csnook@redhat.com> wrote:
> 
>> Jay L. T. Cornwall wrote:
>>> Chris Snook wrote:
>>>
>>>> What boards have we seen this on?  It's quite possible this is:
>>> I can reproduce on an Asus P5K with a Core 2 Duo E6600.
>>>
>>> lspci identifies the controller as:
>>>   02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit
>>>   Ethernet Adapter (rev b0)
>>>
>>> dmesg notes the PCI-DMA mapping implementation:
>>>   PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>>>
>> I had a hunch this was on Intel.  I'd rather just disable this when
>> swiotlb is in use, unless we get more complaints.  It's probably
>> ultimately a BIOS quirk anyway.
> 
> So far we have reports from both camps:
> 
> Asus M2N8-VMX (AM2):	1 report of lockup
> http://sourceforge.net/mailarchive/forum.php?thread_name=46780384.063603.26165%40m12-15.163.com&forum_name=atl1-devel
> 
> Asus P5K (LGA775):	2 reports of lockups
> http://sourceforge.net/mailarchive/forum.php?thread_name=467E7E34.4010603%40gmail.com&forum_name=atl1-devel
> http://lkml.org/lkml/2007/6/25/107
> 
> The common denominator in these reports is 4GB RAM.

Although its possible this device doesn't really support 64-bit, it's 
more likely that this is a platform problem of some sort, or a driver 
bug of some sort.  In the driver, maybe it has a problem when you 
-cross- a 4GB boundary, which is not uncommon.

	Jeff




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] atl1: disable 64bit DMA
  2007-06-25 23:17                           ` Jeff Garzik
@ 2007-06-25 23:40                             ` Chris Snook
  2007-06-26 21:12                             ` Luca
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Snook @ 2007-06-25 23:40 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jay Cliburn, Jay L. T. Cornwall, Luca Tettamanti, linux-kernel, netdev

Jeff Garzik wrote:
> Jay Cliburn wrote:
>> On Mon, 25 Jun 2007 17:57:20 -0400
>> Chris Snook <csnook@redhat.com> wrote:
>>
>>> Jay L. T. Cornwall wrote:
>>>> Chris Snook wrote:
>>>>
>>>>> What boards have we seen this on?  It's quite possible this is:
>>>> I can reproduce on an Asus P5K with a Core 2 Duo E6600.
>>>>
>>>> lspci identifies the controller as:
>>>>   02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit
>>>>   Ethernet Adapter (rev b0)
>>>>
>>>> dmesg notes the PCI-DMA mapping implementation:
>>>>   PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>>>>
>>> I had a hunch this was on Intel.  I'd rather just disable this when
>>> swiotlb is in use, unless we get more complaints.  It's probably
>>> ultimately a BIOS quirk anyway.
>>
>> So far we have reports from both camps:
>>
>> Asus M2N8-VMX (AM2):    1 report of lockup
>> http://sourceforge.net/mailarchive/forum.php?thread_name=46780384.063603.26165%40m12-15.163.com&forum_name=atl1-devel 
>>
>>
>> Asus P5K (LGA775):    2 reports of lockups
>> http://sourceforge.net/mailarchive/forum.php?thread_name=467E7E34.4010603%40gmail.com&forum_name=atl1-devel 
>>
>> http://lkml.org/lkml/2007/6/25/107
>>
>> The common denominator in these reports is 4GB RAM.
> 
> Although its possible this device doesn't really support 64-bit, it's 
> more likely that this is a platform problem of some sort, or a driver 
> bug of some sort.  In the driver, maybe it has a problem when you 
> -cross- a 4GB boundary, which is not uncommon.
> 
>     Jeff

I'm going on the record to say I don't trust the chipsets on these boards, and 
I'd like anyone having these problems to let us 
(atl1-devel@lists.sourceforge.net) know if they encounter similar problems with 
any other hardware.  That said, I'm not going to stand in the way of stability 
just because it *might* be someone else's fault.  I don't think limiting 
ourselves to dma32, at least while we track this down, is much of a loss on 
current hardware.

Acked-By: Chris Snook <csnook@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] atl1: disable 64bit DMA
  2007-06-25 23:17                           ` Jeff Garzik
  2007-06-25 23:40                             ` Chris Snook
@ 2007-06-26 21:12                             ` Luca
  1 sibling, 0 replies; 24+ messages in thread
From: Luca @ 2007-06-26 21:12 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jay Cliburn, Chris Snook, Jay L. T. Cornwall, linux-kernel, netdev

On 6/26/07, Jeff Garzik <jgarzik@pobox.com> wrote:
> Jay Cliburn wrote:
> > On Mon, 25 Jun 2007 17:57:20 -0400
> > Chris Snook <csnook@redhat.com> wrote:
> >
> >> Jay L. T. Cornwall wrote:
> >>> Chris Snook wrote:
> >>>
> >>>> What boards have we seen this on?  It's quite possible this is:
> >>> I can reproduce on an Asus P5K with a Core 2 Duo E6600.
> >>>
> >>> lspci identifies the controller as:
> >>>   02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit
> >>>   Ethernet Adapter (rev b0)
> >>>
> >>> dmesg notes the PCI-DMA mapping implementation:
> >>>   PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> >>>
> >> I had a hunch this was on Intel.  I'd rather just disable this when
> >> swiotlb is in use, unless we get more complaints.  It's probably
> >> ultimately a BIOS quirk anyway.
> >
> > So far we have reports from both camps:
> >
> > Asus M2N8-VMX (AM2):  1 report of lockup
> > http://sourceforge.net/mailarchive/forum.php?thread_name=46780384.063603.26165%40m12-15.163.com&forum_name=atl1-devel
> >
> > Asus P5K (LGA775):    2 reports of lockups
> > http://sourceforge.net/mailarchive/forum.php?thread_name=467E7E34.4010603%40gmail.com&forum_name=atl1-devel
> > http://lkml.org/lkml/2007/6/25/107
> >
> > The common denominator in these reports is 4GB RAM.
>
> Although its possible this device doesn't really support 64-bit, it's
> more likely that this is a platform problem of some sort, or a driver
> bug of some sort.  In the driver, maybe it has a problem when you
> -cross- a 4GB boundary, which is not uncommon.

I don't follow you :| What kind "common" mistakes should we check for
in the driver?

Luca

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] atl1: disable 64bit DMA
  2007-06-25 21:18                 ` [PATCH] atl1: disable 64bit DMA Luca Tettamanti
  2007-06-25 21:36                   ` Chris Snook
@ 2007-06-27  0:16                   ` Jay Cliburn
  1 sibling, 0 replies; 24+ messages in thread
From: Jay Cliburn @ 2007-06-27  0:16 UTC (permalink / raw)
  To: Luca Tettamanti
  Cc: Jeff Garzik, Jay L. T. Cornwall, linux-kernel, Chris Snook, netdev

On Mon, 25 Jun 2007 23:18:55 +0200
Luca Tettamanti <kronos.it@gmail.com> wrote:

> Il Mon, Jun 25, 2007 at 07:42:44AM -0500, Jay Cliburn ha scritto: 
> > Jay L. T. Cornwall wrote:
> > >Jay Cliburn wrote:
> > >
> > >>For reasons not yet clear to me, it appears the L1 driver has a
> > >>bug or the device itself has trouble with DMA in high memory.
> > >>This patch, drafted by Luca Tettamanti, is being explored as a
> > >>workaround.  I'd be interested to know if it fixes your problem.
> > >
> > >Yes, it certainly seems to. Now running with this patch and 4GB
> > >active, I've transferred about 15GB with no problem so far. It
> > >usually oopses after a GB or two.
> > >
> > >I guess it's not an ideal solution, architecturally speaking, but
> > >it's a good deal better than an unstable driver. If there's any
> > >other patches you'd like me to test or traces to capture, I'm
> > >happy to help out. Otherwise I'll run with this one for now since
> > >it does the job!
> > 
> > Okay Jay, thanks.
> > 
> > Luca, would you please submit your patch to Jeff Garzik and netdev?
> 
> Hi Jeff,
> a couple of users reported hard lockups when using L1 NICs on machines
> with 4GB or more of RAM. We're still waiting official confirmation
> from the vendor, but it seems that L1 has problems doing DMA to/from
> high memory (physical address above the 4GB limit). Passing 32bit DMA
> mask cures the problem.
> 
> Signed-Off-By: Luca Tettamanti <kronos.it@gmail.com>
> 
> ---
> I think that the patch should be included in 2.6.22.
> 
>  drivers/net/atl1/atl1_main.c |   15 +++------------
>  1 file changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/atl1/atl1_main.c
> b/drivers/net/atl1/atl1_main.c index 6862c11..a730f15 100644
> --- a/drivers/net/atl1/atl1_main.c
> +++ b/drivers/net/atl1/atl1_main.c
> @@ -2097,21 +2097,16 @@ static int __devinit atl1_probe(struct
> pci_dev *pdev, struct net_device *netdev;
>  	struct atl1_adapter *adapter;
>  	static int cards_found = 0;
> -	bool pci_using_64 = true;
>  	int err;
>  
>  	err = pci_enable_device(pdev);
>  	if (err)
>  		return err;
>  
> -	err = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
> +	err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
>  	if (err) {
> -		err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
> -		if (err) {
> -			dev_err(&pdev->dev, "no usable DMA
> configuration\n");
> -			goto err_dma;
> -		}
> -		pci_using_64 = false;
> +		dev_err(&pdev->dev, "no usable DMA configuration\n");
> +		goto err_dma;
>  	}
>  	/* Mark all PCI regions associated with PCI device
>  	 * pdev as being reserved by owner atl1_driver_name
> @@ -2176,7 +2171,6 @@ static int __devinit atl1_probe(struct pci_dev
> *pdev, 
>  	netdev->ethtool_ops = &atl1_ethtool_ops;
>  	adapter->bd_number = cards_found;
> -	adapter->pci_using_64 = pci_using_64;
>  
>  	/* setup the private structure */
>  	err = atl1_sw_init(adapter);
> @@ -2193,9 +2187,6 @@ static int __devinit atl1_probe(struct pci_dev
> *pdev, */
>  	/* netdev->features |= NETIF_F_TSO; */
>  
> -	if (pci_using_64)
> -		netdev->features |= NETIF_F_HIGHDMA;
> -
>  	netdev->features |= NETIF_F_LLTX;
>  
>  	/*

Acked-by: Jay Cliburn <jacliburn@bellsouth.net>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2007-06-27  0:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-22  0:07 2.6.22-rc5: pdflush oops under heavy disk load Jay L. T. Cornwall
2007-06-22 14:47 ` Chuck Ebbert
2007-06-22 15:04   ` Jay L. T. Cornwall
2007-06-23 12:14     ` Jay L. T. Cornwall
2007-06-23 17:23       ` Andrew Morton
2007-06-24  9:57         ` (Last oops is Tainted: P) " Oleg Verych
2007-06-24 10:10           ` Jay L. T. Cornwall
2007-06-24 10:57             ` Oleg Verych
2007-06-24 17:59       ` Jay Cliburn
2007-06-24 20:31         ` Jay L. T. Cornwall
2007-06-24 21:45           ` Jay Cliburn
2007-06-25 12:16             ` Attansic L1 page corruption (was: 2.6.22-rc5: pdflush oops under heavy disk load) Jay L. T. Cornwall
2007-06-25 12:42               ` Attansic L1 page corruption Jay Cliburn
2007-06-25 21:18                 ` [PATCH] atl1: disable 64bit DMA Luca Tettamanti
2007-06-25 21:36                   ` Chris Snook
2007-06-25 21:51                     ` Jay L. T. Cornwall
2007-06-25 21:57                       ` Chris Snook
2007-06-25 23:00                         ` Jay Cliburn
2007-06-25 23:17                           ` Jeff Garzik
2007-06-25 23:40                             ` Chris Snook
2007-06-26 21:12                             ` Luca
2007-06-27  0:16                   ` Jay Cliburn
2007-06-25 12:58               ` Attansic L1 page corruption (was: 2.6.22-rc5: pdflush oops under heavy disk load) Luca
2007-06-24 22:51   ` 2.6.22-rc5: pdflush oops under heavy disk load Jesper Juhl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).