linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.16.32 stuck in generic_file_aio_write()
@ 2006-11-29 12:41 Igmar Palsenberg
  2006-11-29 15:20 ` Igmar Palsenberg
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Igmar Palsenberg @ 2006-11-29 12:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: erich


Hi,

I've got a machine which occasionally locks up. I can still sysrq it from 
a serial console, so it's not entirely dead.

A sysrq-t learns me that it's got a large number of httpd processes stuck 
in D state :

httpd         D F7619440  2160 11635   2057         11636       (NOTLB)
dbb7ae14 cc9b0550 c33224a0 f7619440 de187604 00000000 000000b3 00000001
       000000b3 00000000 ffffffff d374a550 c33224a0 0005b8d8 f04af800 
000f75e7
       d374a550 cc9b0550 cc9b0678 ef7d33ec ef7d33e8 cc9b0550 ef7d33fc 
c041bf70
Call Trace:
 [<c041bf70>] __mutex_lock_slowpath+0x92/0x43e
 [<c0148f29>] generic_file_aio_write+0x5c/0xfa
 [<c0148f29>] generic_file_aio_write+0x5c/0xfa
 [<c0148f29>] generic_file_aio_write+0x5c/0xfa
 [<c01746c9>] permission+0xad/0xcb
 [<c01d9c4a>] ext3_file_write+0x3b/0xb0
 [<c0166777>] do_sync_write+0xd5/0x130
 [<c041d1bf>] _spin_unlock+0xb/0xf
 [<c0135c13>] autoremove_wake_function+0x0/0x4b
 [<c0166975>] vfs_write+0x1a3/0x1a8
 [<c0166a39>] sys_write+0x4b/0x74
 [<c0102c03>] sysenter_past_esp+0x54/0x75

After this, the machine is rendered useless (probably due to the fact that 
disk IO isn't working anymore).

The lock debugging gives me this :

D           httpd:11635 [cc9b0550, 116] blocked on mutex: [ef7d33e8] 
{inode_init_once}
.. held by:             httpd:  506 [d67e1000, 121]
... acquired at:               generic_file_aio_write+0x5c/0xfa 


I see similiar things as mentioned in http://lkml.org/lkml/2006/1/10/64, 
with the difference that I'm not running software RAID or SATA (it's an 
Areca ARC-1110).

I can't reproduce it until now, it 'just' happens. Can someone give me a 
pointer where to start looking ?

Erich, I've CC-ed you since the machine is running an Areca RAID config. 
It's also the only used disk subsystem in this machine.


Regards,


	Igmar


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-11-29 12:41 2.6.16.32 stuck in generic_file_aio_write() Igmar Palsenberg
@ 2006-11-29 15:20 ` Igmar Palsenberg
  2006-11-30  1:23 ` erich
  2006-12-01  5:22 ` Andrew Morton
  2 siblings, 0 replies; 21+ messages in thread
From: Igmar Palsenberg @ 2006-11-29 15:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: erich


Hi,

A followup. It crashed again, giving me :

arcmsr0: scsi id=0 lun=0 ccb='0xf7c984e0' poll command abort successfully
end_request: I/O error, dev sda, sector 3724719

and

sd 0:0:0:0: rejecting I/O to offline device
about 15k times.

I'll see if I can upgrade the RAID driver.



	Igmar


-- 
Igmar Palsenberg
JDI ICT

Zutphensestraatweg 85
6953 CJ Dieren
Tel: +31 (0)313 - 496741
Fax: +31 (0)313 - 420996
The Netherlands

mailto: i.palsenberg@jdi-ict.nl

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-11-29 12:41 2.6.16.32 stuck in generic_file_aio_write() Igmar Palsenberg
  2006-11-29 15:20 ` Igmar Palsenberg
@ 2006-11-30  1:23 ` erich
  2006-11-30  9:48   ` Igmar Palsenberg
  2006-12-01  5:22 ` Andrew Morton
  2 siblings, 1 reply; 21+ messages in thread
From: erich @ 2006-11-30  1:23 UTC (permalink / raw)
  To: Igmar Palsenberg; +Cc: linux kernel

Dear Igmar Palsenberg,

If you are working on arcmsr 1.20.00.13 for official kernel version.
This is the last version.
Could you check your RAID controller event and tell someting to me?
You can check "MBIOS"=>"Physical Drive Information"=>"View Drive 
Information"=>"Select The Drive"=>"Timeout Count"......
It could tell you which disk had bad behavior cause your RAID volume 
offline.
About the message dump from arcmsr, it said that your RAID volume had 
something wrong and kicked out from the system.
How about your RAID config?
Areca had new firmware released (1.42).
If you are working on "sg" device with scsi passthrough ioctl method to feed 
data into Areca's RAID volume.
You need to limit your data under 512 blocks (256K) each transfer.
The new firmware will enlarge it into 4096 blocks (2M) each transfer.
The firmware version 1.42 is on releasing procedure but not yet put it on 
Areca ftp site.
If you need it, please tell me again.

Best Regards
Erich Chen


----- Original Message ----- 
From: "Igmar Palsenberg" <i.palsenberg@jdi-ict.nl>
To: <linux-kernel@vger.kernel.org>
Cc: <erich@areca.com.tw>
Sent: Wednesday, November 29, 2006 8:41 PM
Subject: 2.6.16.32 stuck in generic_file_aio_write()


>
> Hi,
>
> I've got a machine which occasionally locks up. I can still sysrq it from
> a serial console, so it's not entirely dead.
>
> A sysrq-t learns me that it's got a large number of httpd processes stuck
> in D state :
>
> httpd         D F7619440  2160 11635   2057         11636       (NOTLB)
> dbb7ae14 cc9b0550 c33224a0 f7619440 de187604 00000000 000000b3 00000001
>       000000b3 00000000 ffffffff d374a550 c33224a0 0005b8d8 f04af800
> 000f75e7
>       d374a550 cc9b0550 cc9b0678 ef7d33ec ef7d33e8 cc9b0550 ef7d33fc
> c041bf70
> Call Trace:
> [<c041bf70>] __mutex_lock_slowpath+0x92/0x43e
> [<c0148f29>] generic_file_aio_write+0x5c/0xfa
> [<c0148f29>] generic_file_aio_write+0x5c/0xfa
> [<c0148f29>] generic_file_aio_write+0x5c/0xfa
> [<c01746c9>] permission+0xad/0xcb
> [<c01d9c4a>] ext3_file_write+0x3b/0xb0
> [<c0166777>] do_sync_write+0xd5/0x130
> [<c041d1bf>] _spin_unlock+0xb/0xf
> [<c0135c13>] autoremove_wake_function+0x0/0x4b
> [<c0166975>] vfs_write+0x1a3/0x1a8
> [<c0166a39>] sys_write+0x4b/0x74
> [<c0102c03>] sysenter_past_esp+0x54/0x75
>
> After this, the machine is rendered useless (probably due to the fact that
> disk IO isn't working anymore).
>
> The lock debugging gives me this :
>
> D           httpd:11635 [cc9b0550, 116] blocked on mutex: [ef7d33e8]
> {inode_init_once}
> .. held by:             httpd:  506 [d67e1000, 121]
> ... acquired at:               generic_file_aio_write+0x5c/0xfa
>
>
> I see similiar things as mentioned in http://lkml.org/lkml/2006/1/10/64,
> with the difference that I'm not running software RAID or SATA (it's an
> Areca ARC-1110).
>
> I can't reproduce it until now, it 'just' happens. Can someone give me a
> pointer where to start looking ?
>
> Erich, I've CC-ed you since the machine is running an Areca RAID config.
> It's also the only used disk subsystem in this machine.
>
>
> Regards,
>
>
> Igmar
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-11-30  1:23 ` erich
@ 2006-11-30  9:48   ` Igmar Palsenberg
  0 siblings, 0 replies; 21+ messages in thread
From: Igmar Palsenberg @ 2006-11-30  9:48 UTC (permalink / raw)
  To: erich; +Cc: linux kernel


Hi,

> If you are working on arcmsr 1.20.00.13 for official kernel version.
> This is the last version.

I'm already on that version. I'll see if I can upgrade to 2.6.19 today.

> Could you check your RAID controller event and tell someting to me?
> You can check "MBIOS"=>"Physical Drive Information"=>"View Drive 
> Information"=>"Select The Drive"=>"Timeout Count"......
> It could tell you which disk had bad behavior cause your RAID volume 
> offline.

I need to be in the BIOS right ? I couldn't find anything usefull with the 
cli32 tool.

> About the message dump from arcmsr, it said that your RAID volume had 
> something wrong and kicked out from the system.
> How about your RAID config?

CLI> disk info
Ch   ModelName        Serial#          FirmRev     Capacity  State
===============================================================================
 1   HDT722516DLA380  VDK71BTCDB90KE   V43OA91A     164.7GB  RaidSet 
Member(1)
 2   HDT722516DLA380  VDN71BTCDEPH7G   V43OA91A     164.7GB  RaidSet 
Member(1)
 3   HDT722516DLA380  VDN71BTCDES96G   V43OA91A     164.7GB  RaidSet 
Member(1)
 4   HDT722516DLA380  VDN71BTCDE15KG   V43OA91A     164.7GB  RaidSet 
Member(1)
===============================================================================

CLI> rsf info
Num Name             Disks TotalCap  FreeCap DiskChannels       State
===============================================================================
 1  Raid Set # 00        4  640.0GB    0.0GB 1234               Normal
===============================================================================

CLI> vsf info
 # Name             Raid# Level   Capacity Ch/Id/Lun  State
===============================================================================
 1 ARC-1110-VOL#00    1   Raid5    480.0GB 00/00/00   Normal
===============================================================================

A plain RAID 5 config with 4 disks. 


> Areca had new firmware released (1.42).
> If you are working on "sg" device with scsi passthrough ioctl method to feed 
> data into Areca's RAID volume.
> You need to limit your data under 512 blocks (256K) each transfer.
> The new firmware will enlarge it into 4096 blocks (2M) each transfer.
> The firmware version 1.42 is on releasing procedure but not yet put it on 
> Areca ftp site.

I don't use the sg driver at all. Is the upgrade worth it ? I usually 
don't mess with firmware unless being told to do so.

> If you need it, please tell me again.

Can you send it to me ? Installing it won't hurt I guess :)


Regards,


	Igmar


-- 
Igmar Palsenberg
JDI ICT

Zutphensestraatweg 85
6953 CJ Dieren
Tel: +31 (0)313 - 496741
Fax: +31 (0)313 - 420996
The Netherlands

mailto: i.palsenberg@jdi-ict.nl

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-11-29 12:41 2.6.16.32 stuck in generic_file_aio_write() Igmar Palsenberg
  2006-11-29 15:20 ` Igmar Palsenberg
  2006-11-30  1:23 ` erich
@ 2006-12-01  5:22 ` Andrew Morton
  2006-12-01  8:56   ` Igmar Palsenberg
  2 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2006-12-01  5:22 UTC (permalink / raw)
  To: Igmar Palsenberg; +Cc: linux-kernel, erich

On Wed, 29 Nov 2006 13:41:37 +0100 (CET)
Igmar Palsenberg <i.palsenberg@jdi-ict.nl> wrote:

> I've got a machine which occasionally locks up. I can still sysrq it from 
> a serial console, so it's not entirely dead.
> 
> A sysrq-t learns me that it's got a large number of httpd processes stuck 
> in D state :

There are known deadlocks in generic_file_write() in kernels up to and
including 2.6.17.  Pagefaults are involved and I'd need to see the entire
sysrq-T output to determine if you're hitting that bug.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-01  5:22 ` Andrew Morton
@ 2006-12-01  8:56   ` Igmar Palsenberg
  2006-12-04 21:03     ` Igmar Palsenberg
  0 siblings, 1 reply; 21+ messages in thread
From: Igmar Palsenberg @ 2006-12-01  8:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, erich



Hi,

> > I've got a machine which occasionally locks up. I can still sysrq it from 
> > a serial console, so it's not entirely dead.
> > 
> > A sysrq-t learns me that it's got a large number of httpd processes stuck 
> > in D state :
> 
> There are known deadlocks in generic_file_write() in kernels up to and
> including 2.6.17.  Pagefaults are involved and I'd need to see the entire
> sysrq-T output to determine if you're hitting that bug.

It's rather large, but for those who want to look at it : 
http://www.jdi-ict.nl/plain/serial-28112006.txt

There is also a dump from a day later, but halfway the Areca controller 
decided to kick out the array, on which a lot of unwritten data needed to 
be written :)

That dump is at http://www.jdi-ict.nl/plain/serial-29112006.txt


Regards,


	Igmar


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-01  8:56   ` Igmar Palsenberg
@ 2006-12-04 21:03     ` Igmar Palsenberg
  2006-12-06 15:17       ` Igmar Palsenberg
  0 siblings, 1 reply; 21+ messages in thread
From: Igmar Palsenberg @ 2006-12-04 21:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, erich


> It's rather large, but for those who want to look at it : 
> http://www.jdi-ict.nl/plain/serial-28112006.txt

The same problem, this time with 2.6.19. I've done a show tasks, a show 
locks, a show regs, and after that, a sync + reboot :)

Log is at http://www.jdi-ict.nl/plain/serial-04122006.txt .

If anyone needs more info : please tell me.

Regards,


	Igmar

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-04 21:03     ` Igmar Palsenberg
@ 2006-12-06 15:17       ` Igmar Palsenberg
  2006-12-06 15:40         ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Igmar Palsenberg @ 2006-12-06 15:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin


> > It's rather large, but for those who want to look at it : 
> > http://www.jdi-ict.nl/plain/serial-28112006.txt
> 
> The same problem, this time with 2.6.19. I've done a show tasks, a show 
> locks, a show regs, and after that, a sync + reboot :)
> 
> Log is at http://www.jdi-ict.nl/plain/serial-04122006.txt .
> 
> If anyone needs more info : please tell me.

Done some more digging : isn't http://lkml.org/lkml/2006/10/13/139 somehow 
related ? I do see pagefaults, and inode locks and mmap_locks. 

Regards,


	Igmar

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-06 15:17       ` Igmar Palsenberg
@ 2006-12-06 15:40         ` Andrew Morton
  2006-12-06 16:14           ` Igmar Palsenberg
  2006-12-07  9:58           ` Igmar Palsenberg
  0 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2006-12-06 15:40 UTC (permalink / raw)
  To: Igmar Palsenberg; +Cc: linux-kernel, npiggin

On Wed, 6 Dec 2006 16:17:10 +0100 (CET)
Igmar Palsenberg <i.palsenberg@jdi-ict.nl> wrote:

> 
> > > It's rather large, but for those who want to look at it : 
> > > http://www.jdi-ict.nl/plain/serial-28112006.txt
> > 
> > The same problem, this time with 2.6.19. I've done a show tasks, a show 
> > locks, a show regs, and after that, a sync + reboot :)
> > 
> > Log is at http://www.jdi-ict.nl/plain/serial-04122006.txt .
> > 
> > If anyone needs more info : please tell me.
> 
> Done some more digging : isn't http://lkml.org/lkml/2006/10/13/139 somehow 
> related ? I do see pagefaults, and inode locks and mmap_locks. 
> 

I thought it was, but from my look through yout 8-billion-task backtrace,
no task was stuck in D-state with the appropriate call trace.

So I don't know what's causing this.  In the first trace you have at least
four D-state kjournalds and a lot of processes stuck on an i_mutex.  I
guess it's consistent with an IO system which is losing completion
interrupts.  AFAICT in the second trace all you have is a lot of processes
stuck on i_mutex for no obvious reason - I don't know why that would
happen.

How long does it take for this to happen?

Yes, lockdep might find something.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-06 15:40         ` Andrew Morton
@ 2006-12-06 16:14           ` Igmar Palsenberg
  2006-12-07  9:58           ` Igmar Palsenberg
  1 sibling, 0 replies; 21+ messages in thread
From: Igmar Palsenberg @ 2006-12-06 16:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin


> > Done some more digging : isn't http://lkml.org/lkml/2006/10/13/139 somehow 
> > related ? I do see pagefaults, and inode locks and mmap_locks. 
> > 
> 
> I thought it was, but from my look through yout 8-billion-task backtrace,
> no task was stuck in D-state with the appropriate call trace.
> 
> So I don't know what's causing this.  In the first trace you have at least
> four D-state kjournalds and a lot of processes stuck on an i_mutex.  I
> guess it's consistent with an IO system which is losing completion
> interrupts. 

Hmm.. Is there any way to make sure ? I've got a second machine (almost 
identical), which doesn't show this.

The main difference is the running kernel. I've had them at the same 
kernel, at which bad machine still crashes.

/proc/interrupts

Bad machine  : 18:   11160637   11235698   IO-APIC-fasteoi   arcmsr
Good machine : 18:   61658630   79352227   IO-APIC-level  arcmsr

Bad machine is running 2.6.19, good is running 2.6.14.7-grsec, which 
probably accounts for these changes.

> AFAICT in the second trace all you have is a lot of processes
> stuck on i_mutex for no obvious reason - I don't know why that would
> happen.

It's consequent, also the traces.
 
> How long does it take for this to happen?

Days to a week tops. It does happen less frequent with the 2.6.19, 
2.6.16.32 triggered it almost daily.

> Yes, lockdep might find something.

I've enabled most debug options. I'll boot the other kernel tomorrow.



Regards,


	Igmar

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-06 15:40         ` Andrew Morton
  2006-12-06 16:14           ` Igmar Palsenberg
@ 2006-12-07  9:58           ` Igmar Palsenberg
  2006-12-07 12:29             ` Igmar Palsenberg
  1 sibling, 1 reply; 21+ messages in thread
From: Igmar Palsenberg @ 2006-12-07  9:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, erich


> I thought it was, but from my look through yout 8-billion-task backtrace,
> no task was stuck in D-state with the appropriate call trace.

I was afraid of that... Where is the lock on the i_mutex suppose 
to be released ? I can't grasp the codepath from within an interrupt back 
to the fs layer.
 
> So I don't know what's causing this.  In the first trace you have at least
> four D-state kjournalds and a lot of processes stuck on an i_mutex.  I
> guess it's consistent with an IO system which is losing completion
> interrupts.  AFAICT in the second trace all you have is a lot of processes
> stuck on i_mutex for no obvious reason - I don't know why that would
> happen.

Is there any way to see if it is missing interrupts ? Enabling the 
debugging in the areca driver isn't a good idea on this machine, it's a
heavely IO loaded machine, and the problem seems to take some time to occur.

I *does* happen less often with a 2.6.19 kernel however. 

The task dump takes > 10 seconds, which causes the softlock detector to 
trigger. Is there any objection to a patch which disables the lockup 
detector during the dump ? It isn't a big issue, since al it does is dump 
a stacktrace.

I've enabled most debugging now, I'll see of i can run both a disk and VM 
stresstest.

I'll put a .config and a dmesg of the machine booting at 
http://www.jdi-ict.nl/plain/ for those who want to look at it.


Regards,


	Igmar

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-07  9:58           ` Igmar Palsenberg
@ 2006-12-07 12:29             ` Igmar Palsenberg
  2006-12-14  8:15               ` Igmar Palsenberg
  0 siblings, 1 reply; 21+ messages in thread
From: Igmar Palsenberg @ 2006-12-07 12:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, erich


> I've enabled most debugging now, I'll see of i can run both a disk and VM 
> stresstest.

Running stress now :

stress -c 2 -i 2 -m 8 -d 8 --vm-bytes 20M --vm-hang 5 --hdd-bytes 20M

I'll see what this results in.
 
> I'll put a .config and a dmesg of the machine booting at 
> http://www.jdi-ict.nl/plain/ for those who want to look at it.

dmesg : http://www.jdi-ict.nl/plain/lnx01.dmesg
Kernel config : http://www.jdi-ict.nl/plain/lnx01.config



regards,


	Igmar

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-07 12:29             ` Igmar Palsenberg
@ 2006-12-14  8:15               ` Igmar Palsenberg
  2006-12-14  8:42                 ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Igmar Palsenberg @ 2006-12-14  8:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, erich


> > I'll put a .config and a dmesg of the machine booting at 
> > http://www.jdi-ict.nl/plain/ for those who want to look at it.
> 
> dmesg : http://www.jdi-ict.nl/plain/lnx01.dmesg
> Kernel config : http://www.jdi-ict.nl/plain/lnx01.config

Hmm.. Switching CONFIG_HZ from 1000 to 250 seems to 'fix' the problem. 
I haven't seen the issue in nearly a week now. This makes Andrew's theory 
about missing interrupts very likely.

Andrew / others : Is there a way to find out if it *is* missing 
interrupts ?


Regards,


	Igmar

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-14  8:15               ` Igmar Palsenberg
@ 2006-12-14  8:42                 ` Andrew Morton
  2006-12-14  8:55                   ` Igmar Palsenberg
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2006-12-14  8:42 UTC (permalink / raw)
  To: Igmar Palsenberg; +Cc: linux-kernel, npiggin, erich

On Thu, 14 Dec 2006 09:15:39 +0100 (CET)
Igmar Palsenberg <i.palsenberg@jdi-ict.nl> wrote:

> 
> > > I'll put a .config and a dmesg of the machine booting at 
> > > http://www.jdi-ict.nl/plain/ for those who want to look at it.
> > 
> > dmesg : http://www.jdi-ict.nl/plain/lnx01.dmesg
> > Kernel config : http://www.jdi-ict.nl/plain/lnx01.config
> 
> Hmm.. Switching CONFIG_HZ from 1000 to 250 seems to 'fix' the problem. 
> I haven't seen the issue in nearly a week now. This makes Andrew's theory 
> about missing interrupts very likely.
> 
> Andrew / others : Is there a way to find out if it *is* missing 
> interrupts ?
> 

umm, nasty.  What's in /proc/interrupts?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-14  8:42                 ` Andrew Morton
@ 2006-12-14  8:55                   ` Igmar Palsenberg
  2006-12-14  9:10                     ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Igmar Palsenberg @ 2006-12-14  8:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, erich


> > Hmm.. Switching CONFIG_HZ from 1000 to 250 seems to 'fix' the problem. 
> > I haven't seen the issue in nearly a week now. This makes Andrew's theory 
> > about missing interrupts very likely.
> > 
> > Andrew / others : Is there a way to find out if it *is* missing 
> > interrupts ?
> > 
> 
> umm, nasty.  What's in /proc/interrupts?

See below. The other machine is mostly identifical, except for i8042 
missing (probably due to running an older kernel, or small differences in 
the kernel config).

Regards,


	Igmar

[jdiict@lnx01 ~]$ cat /proc/interrupts
           CPU0       CPU1
  0:   73702693   74509271   IO-APIC-edge      timer
  1:          1          1   IO-APIC-edge      i8042
  4:       2289       8389   IO-APIC-edge      serial
  8:          0          1   IO-APIC-edge      rtc
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          3          1   IO-APIC-edge      i8042
 16:  203127788          0   IO-APIC-fasteoi   uhci_hcd:usb2, eth0
 17:        525        492   IO-APIC-fasteoi   uhci_hcd:usb4
 18:   13000070   67584889   IO-APIC-fasteoi   arcmsr
 19:          0          0   IO-APIC-fasteoi   ehci_hcd:usb1
 20:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
NMI:          0          0
LOC:  148127756  148133476
ERR:          0
MIS:          0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-14  8:55                   ` Igmar Palsenberg
@ 2006-12-14  9:10                     ` Andrew Morton
  2006-12-14  9:25                       ` Igmar Palsenberg
  2007-02-05 10:24                       ` Igmar Palsenberg
  0 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2006-12-14  9:10 UTC (permalink / raw)
  To: Igmar Palsenberg; +Cc: linux-kernel, npiggin, erich

On Thu, 14 Dec 2006 09:55:38 +0100 (CET)
Igmar Palsenberg <i.palsenberg@jdi-ict.nl> wrote:

> 
> > > Hmm.. Switching CONFIG_HZ from 1000 to 250 seems to 'fix' the problem. 
> > > I haven't seen the issue in nearly a week now. This makes Andrew's theory 
> > > about missing interrupts very likely.
> > > 
> > > Andrew / others : Is there a way to find out if it *is* missing 
> > > interrupts ?
> > > 
> > 
> > umm, nasty.  What's in /proc/interrupts?
> 
> See below. The other machine is mostly identifical, except for i8042 
> missing (probably due to running an older kernel, or small differences in 
> the kernel config).
> 

Does the other machine have the same problems?

Are you able to rule out a hardware failure?

> [jdiict@lnx01 ~]$ cat /proc/interrupts
>            CPU0       CPU1
>   0:   73702693   74509271   IO-APIC-edge      timer
>   1:          1          1   IO-APIC-edge      i8042
>   4:       2289       8389   IO-APIC-edge      serial
>   8:          0          1   IO-APIC-edge      rtc
>   9:          0          0   IO-APIC-fasteoi   acpi
>  12:          3          1   IO-APIC-edge      i8042
>  16:  203127788          0   IO-APIC-fasteoi   uhci_hcd:usb2, eth0
>  17:        525        492   IO-APIC-fasteoi   uhci_hcd:usb4
>  18:   13000070   67584889   IO-APIC-fasteoi   arcmsr
>  19:          0          0   IO-APIC-fasteoi   ehci_hcd:usb1
>  20:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
> NMI:          0          0
> LOC:  148127756  148133476
> ERR:          0
> MIS:          0

The disk interrupt is unshared, which rules out a few software problems, I
guess.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-14  9:10                     ` Andrew Morton
@ 2006-12-14  9:25                       ` Igmar Palsenberg
  2007-02-05 10:24                       ` Igmar Palsenberg
  1 sibling, 0 replies; 21+ messages in thread
From: Igmar Palsenberg @ 2006-12-14  9:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, erich


> > See below. The other machine is mostly identifical, except for i8042 
> > missing (probably due to running an older kernel, or small differences in 
> > the kernel config).
> > 
> 
> Does the other machine have the same problems?

No, but that machine has a lot less disk and networkactivity.
 
> Are you able to rule out a hardware failure?

100% ? No, but the hardware is relatively new (about a year old), and of 
good quality. It's hard to reprodure, so looking at it when it starts to 
fault isn't possible either :(

> The disk interrupt is unshared, which rules out a few software problems, I
> guess.

Indeed. Bah, I hate these kind of things :(



Regards,


	Igmar

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2006-12-14  9:10                     ` Andrew Morton
  2006-12-14  9:25                       ` Igmar Palsenberg
@ 2007-02-05 10:24                       ` Igmar Palsenberg
  2007-02-06  2:42                         ` erich
  1 sibling, 1 reply; 21+ messages in thread
From: Igmar Palsenberg @ 2007-02-05 10:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, erich


> Does the other machine have the same problems?

It does. It seems to depend on the interrupt frequency : Setting KERNEL_HZ=250
makes it ony appear once a month or so, with KERNEL_HZ=1000, it will 
occur within a week. It does happen a lot less with the other machine, 
which isn't under disk activity load as much as the other machine.
 
> Are you able to rule out a hardware failure?

Well.. It's too much coincidence that 2 (almost identical) machines show 
the same weard behaviour. What strikes me that only *disk* interrupts 
after a while don't get handled. The machine itself is alive, just all 
disk IO is blocked, which makes it pretty much useless. 

Erich, could this be some sort of hardware problem ? I know it's a PITA to 
reproduce, but setting CONFIG_HZ to 1000 and bashing the machine with 
diskactivity seems to help :)


Regards,


	Igmar

-- 
Igmar Palsenberg
JDI ICT

Zutphensestraatweg 85
6953 CJ Dieren
Tel: +31 (0)313 - 496741
Fax: +31 (0)313 - 420996
The Netherlands

mailto: i.palsenberg@jdi-ict.nl

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2007-02-05 10:24                       ` Igmar Palsenberg
@ 2007-02-06  2:42                         ` erich
  2007-02-12  9:26                           ` Igmar Palsenberg
  2007-02-19 13:25                           ` Igmar Palsenberg
  0 siblings, 2 replies; 21+ messages in thread
From: erich @ 2007-02-06  2:42 UTC (permalink / raw)
  To: Igmar Palsenberg; +Cc: Andrew Morton, linux kernel, npiggin

Dear Igmar Palsenberg,

I can not make sure it is hardware problem, but I have interest in this 
case's reproducing.
If you tell me your platform's construction, I will try it and give you good 
solution.
Does your RAID adapter's firmware version work on 1.42?
Areca firmware had fix some hardware bugs and rare sg length handle in this 
version.

Best Regards
Erich Chen

----- Original Message ----- 
From: "Igmar Palsenberg" <i.palsenberg@jdi-ict.nl>
To: "Andrew Morton" <akpm@osdl.org>
Cc: <linux-kernel@vger.kernel.org>; <npiggin@suse.de>; "erich" 
<erich@areca.com.tw>
Sent: Monday, February 05, 2007 6:24 PM
Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()


>
>> Does the other machine have the same problems?
>
> It does. It seems to depend on the interrupt frequency : Setting 
> KERNEL_HZ=250
> makes it ony appear once a month or so, with KERNEL_HZ=1000, it will
> occur within a week. It does happen a lot less with the other machine,
> which isn't under disk activity load as much as the other machine.
>
>> Are you able to rule out a hardware failure?
>
> Well.. It's too much coincidence that 2 (almost identical) machines show
> the same weard behaviour. What strikes me that only *disk* interrupts
> after a while don't get handled. The machine itself is alive, just all
> disk IO is blocked, which makes it pretty much useless.
>
> Erich, could this be some sort of hardware problem ? I know it's a PITA to
> reproduce, but setting CONFIG_HZ to 1000 and bashing the machine with
> diskactivity seems to help :)
>
>
> Regards,
>
>
> Igmar
>
> -- 
> Igmar Palsenberg
> JDI ICT
>
> Zutphensestraatweg 85
> 6953 CJ Dieren
> Tel: +31 (0)313 - 496741
> Fax: +31 (0)313 - 420996
> The Netherlands
>
> mailto: i.palsenberg@jdi-ict.nl 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2007-02-06  2:42                         ` erich
@ 2007-02-12  9:26                           ` Igmar Palsenberg
  2007-02-19 13:25                           ` Igmar Palsenberg
  1 sibling, 0 replies; 21+ messages in thread
From: Igmar Palsenberg @ 2007-02-12  9:26 UTC (permalink / raw)
  To: erich; +Cc: Andrew Morton, linux kernel, npiggin


Hi,

> I can not make sure it is hardware problem, but I have interest in this 
> case's reproducing.
> If you tell me your platform's construction, I will try it and give you good 
> solution.

The machines giving problems are almost identical when it comes to 
hardware specs :

Intel SE7520BD2 mainbord (SE7520 chipset)
Dual Intel Xeon 2.8 Ghz (other machine : Dual Xeon 3.2 Ghz)
4 GB PC3200 ECC (400 Mhz) Corsair (other machine : 2GB PC3200 ECC)

> Does your RAID adapter's firmware version work on 1.42?
> Areca firmware had fix some hardware bugs and rare sg length handle in this 
> version.

It's currently at 1.41. I'll see if I can upgrade it to 1.42. For now, 
I've put all available stacktraces when it hung on 
http://www.jdi-ict.nl/areca, together with a lspci -v -v and a copy of the 
kernel's .config

Please let me know if you need anything else.



Regards,


	Igmar




-- 
Igmar Palsenberg
JDI ICT

Zutphensestraatweg 85
6953 CJ Dieren
Tel: +31 (0)313 - 496741
Fax: +31 (0)313 - 420996
The Netherlands

mailto: i.palsenberg@jdi-ict.nl

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.16.32 stuck in generic_file_aio_write()
  2007-02-06  2:42                         ` erich
  2007-02-12  9:26                           ` Igmar Palsenberg
@ 2007-02-19 13:25                           ` Igmar Palsenberg
  1 sibling, 0 replies; 21+ messages in thread
From: Igmar Palsenberg @ 2007-02-19 13:25 UTC (permalink / raw)
  To: erich; +Cc: Andrew Morton, linux kernel, npiggin


Hi,

> I can not make sure it is hardware problem, but I have interest in this 
> case's reproducing.
> If you tell me your platform's construction, I will try it and give you good 
> solution.
> Does your RAID adapter's firmware version work on 1.42?
> Areca firmware had fix some hardware bugs and rare sg length handle in this 
> version.

I've hacked up the sysrq code so that it gives me another command : j , 
which dumps the current IRQ status on the console :

SysRq : Show IRQ status
......
Showing info for IRQ 14
status         :
depth          : 0
wake_depth     : 0
irq_count      : 38717
irqs_unhandled : 0

Showing info for IRQ 15
status         : DISABLED
depth          : 1
wake_depth     : 0
irq_count      : 22
irqs_unhandled : 0

which is a the (incomplete) result on my machine after loading a module 
that does disable_irq(15) on module load.

I've put the patch at http://www.jdi-ict.nl/areca/sysrq-j.patch
I'll do a follow-up when anything usefull comes out.


Regards,


	Igmar


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2007-02-19 13:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-29 12:41 2.6.16.32 stuck in generic_file_aio_write() Igmar Palsenberg
2006-11-29 15:20 ` Igmar Palsenberg
2006-11-30  1:23 ` erich
2006-11-30  9:48   ` Igmar Palsenberg
2006-12-01  5:22 ` Andrew Morton
2006-12-01  8:56   ` Igmar Palsenberg
2006-12-04 21:03     ` Igmar Palsenberg
2006-12-06 15:17       ` Igmar Palsenberg
2006-12-06 15:40         ` Andrew Morton
2006-12-06 16:14           ` Igmar Palsenberg
2006-12-07  9:58           ` Igmar Palsenberg
2006-12-07 12:29             ` Igmar Palsenberg
2006-12-14  8:15               ` Igmar Palsenberg
2006-12-14  8:42                 ` Andrew Morton
2006-12-14  8:55                   ` Igmar Palsenberg
2006-12-14  9:10                     ` Andrew Morton
2006-12-14  9:25                       ` Igmar Palsenberg
2007-02-05 10:24                       ` Igmar Palsenberg
2007-02-06  2:42                         ` erich
2007-02-12  9:26                           ` Igmar Palsenberg
2007-02-19 13:25                           ` Igmar Palsenberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).