linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* stability problems with 2.4.24/Software RAID/ext3
@ 2004-01-08 15:12 martin f krafft
  2004-01-08 17:03 ` Marcelo Tosatti
  2004-01-09 18:11 ` Martin Josefsson
  0 siblings, 2 replies; 13+ messages in thread
From: martin f krafft @ 2004-01-08 15:12 UTC (permalink / raw)
  To: linux kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 10053 bytes --]

Hi all,

I operate a groupware server which is giving me a very hard time.
It's an AMD Athlon XP 3000+ with 1Gb of RAM, and four 7200 UPM IDE
harddrives, two attached to the primary channels of the onboard
controller, and two to the primary channels of a Promise 20269 EIDE
controller. The kernel is a 2.4.24 with the configuration I placed
here:

  ftp://ftp.madduck.net/scratch/config-2.4.24-gaia.gz 

The system is configured with 7 Software-RAID and three swap partitions:

  md1:  /boot      (ext3) RAID 1 spanning hda1 and hdc1
  md5:  /          (ext3) RAID 5 hda5/hdc5/hde5 with hdg5 as a spare
  md6:  /usr       (ext3) RAID 5 hda6/hdc6/hde6 with hdg6 as a spare
  md7:  /var       (ext3) RAID 5 hda7/hdc7/hde7 with hdg7 as a spare
  md8:  /usr/local (ext3) RAID 5 hda8/hdc8/hde8 with hdg8 as a spare
  md9:  /home      (ext3) RAID 5 hda9/hdc9/hde9 with hdg9 as a spare
  md10: /tmp       (ext3) RAID 5 hda10/hdc10/hde10 with hdg10 as a spare

  hda2 holds a non-RAID rescue system with RAID 1/5 supporrt

  hdc2, hde2, hdg2 are swap partitions of 256 Mb each.
  
  hde1 and hdg1 are unused.

The individial harddisks are identically tweaked with hdparm:

  hdparm -A1 -B255 -c1 -d1 -p -u0 -W0 -Xudma6 /dev/hd{a,c,e,g}

See the end of this mail for details.

The system experiences severe stability problems, which I relate to
the filesystem, RAID, or controller code, because it's reproducible
with excessive disk operations. E.g., doing something like

  rsync -a --exclude /tmp / /tmp/dump

will most likely crash the system with a kernel oops. This kernel
oops is not recorded in the log, but I took it down as follows:
  
  kernel: Unable to handle kernel paging request at virtual address 00529610
  kernel:  printing eip:
  kernel: c01c7f41
  kernel: *pde = 00000000
  kernel: Oops: 0002
  kernel: CPU:    0
  kernel: EIP:    0010:[__remove_inode_queue+17/48]    Not tainted
  kernel: EFLAGS: 00010202
  kernel: eax: cef76320   ebx: cc529590   ecx: 00529610   edx : cc529540
  kernel: esi: cc529540   edi: c1e59510   ebp: cc4e7cc0   esp : f3a55e54
  kernel: ds: 0018   es: 0018   ss: 0018
  kernel: Process kjournald (pid: 24176, stackpage=f3a55000)
  kernel: Stack: 00000000 c01c862a cc529540 c02029d8 cc529540 c1e59ea0 c01fec42 cc529540 
  kernel:        00000040 f3a55ea4 00000d0d f7ee8280 f6965d34 00000000 00000000 00000000 
  kernel:        0000000f cb3b3840 e6e308a0 00000d0d cc0ec9c0 cc0eca40 cc0ec0c0 cc149bc0 
  kernel: Call Trace:    [__refile_buffer+106/112] [journal_free_journal_head+24/32] [journal_commit_transaction+4066/4352] [kjournald+263/464] [commit_timeout+0/16]
  kernel:   [arch_kernel_thread+43/64] [kjournald+0/464]
  kernel: 
  kernel: Code: 89 01 c7 43 04 00 00 00 00 c7 42 50 00 00 00 00 b8 09 00 00 

  kernel:  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000
  kernel:  printing eip:
  kernel: c01be950
  kernel: *pde = 00000000
  kernel: Oops: 0000
  kernel: CPU:    0
  kernel: EIP:    0010:[kmem_cache_reap+128/448]    Not tainted
  kernel: EFLAGS: 00010013
  kernel: eax: 00000000   ebx: 00000001   ecx: c1c0d348   edx : c1c0d358
  kernel: esi: 00000000   edi: 00000005   ebp: 00000000   esp : c1c33f38
  kernel: ds: 0018   es: 0018   ss: 0018
  kernel: Process kswapd (pid: 4, stackpage=c1c33000)
  kernel: Stack: c1240260 000001d0 c1c0d348 00000000 00000000 00000000 00000020 000001d0 
  kernel:        c0102aa0 00000006 c01bf646 00000006 c0102aa0 c0102aa0 000001d0 00000006 
  kernel:        c0102aa0 00000000 c01bf706 00000020 c0102aa0 c1c32000 c0102940 c01bf824 
  kernel: Call Trace:    [shrink_caches+38/176] [try_to_free_pages_zone+54/96] [kswapd_balance_pgdat+84/160] [kswapd_balance+25/48] [kswapd+141/176]

since the two crashes are related to kswapd and kjournald, I would
assume it's the underlying RAID code that's problematic. However,
maybe you can extract more information from the above crashes.

The following is a snapshot from `vmstat 1` prior to a regular
kernel panic, which resulted in a reboot (thanks to sys.kernel.panic
== 60):

 1  1  2  10184  12344  47020 749912   0   0     0  4344  382   308   0   1  99
 0  1  1  10184  12344  47020 749912   0   0     0  5936  395   334   0   2  98
 0  1  1  10184  12332  47020 749916   0   0     4  4808  379   330   0   3  97
 0  1  1  10184  12332  47020 749916   0   0     0  5008  342   277   1   0  99
 0  1  2  10184  12328  47024 749916   0   0     0  5120  330   293   0   4  96
 0  3  2  10184  12356  47040 750108   0   0    64  4772  367   360   0   3  97
 0  1  1  10184  12460  47052 749704   0   0  1220  6236  352   390   1   4  95
 0  1  1  10184  12044  47052 750096   0   0  2176  6772  371   497   6   5  89
 0  1  1  10184  12388  47060 749704   0   0   324  7732  367   376   0   6  94
 0  1  2  10184  12512  47068 749824   0   0    56  7448  365   312   0   1  99
 0  1  1  10184  12832  47080 749444   0   0   424  6648  368   363   0   3  97
 0  1  1  10184  11884  47092 750156   0   0  2304  7960  416   504   1   8  91
 2  0  1  10184  12708  47100 749284   0   0  1772  6836  370   462   5   4  91

Interestingly, just now, the machine crashed differently. `vmstat 1`
was still running, but new processes could not be started, after the
kernel reported a lot of oopses in user-space processes (e.g. rsync,
top, zsh), as well as some of the kjournald oopses like above.
I have included the footprint of the user-space program oopses
further down. `vmstat 1` was happily printing the following away,
when the system was already unusable. The b > 127 value is
interesting, as it has been continuously increasing (well, in
a non-decreasing way) after a certain point, and somewhere on the
way, the system reached the state of agnosia.

 0 127  2  16124  10304  43004 682188   0   0     0     0  109     7   0   0 100
 0 127  2  16124  10304  43004 682188   0   0     0     0  111     5   0   0 100
 0 127  2  16124  10304  43004 682188   0   0     0     0  114     9   0   0 100
 0 127  2  16124  10304  43004 682188   0   0     0     0  111     5   0   0 100
 0 127  2  16124  10304  43004 682188   0   0     0     0  115     9   0   0 100
 0 128  2  16124  10420  43004 682060   0   0     0     0  119    12   0   0 100
 0 128  2  16124  10420  43004 682060   0   0     0     0  122    11   0   0 100

Apart from these panics and hangups, the system also randomly issues
segfaults to processes, or reports a kernel oops. These take the
following form:

  kernel: kernel BUG at mmap.c:842!
  kernel: invalid operand: 0000
  kernel: CPU:    0
  kernel: EIP:    0010:[find_vma_prev+124/176]    Not tainted
  kernel: EFLAGS: 00010206
  kernel: eax: c7ce4dc0   ebx: c7ce4e40   ecx: c7ce4dd8   edx: c95fde90
  kernel: esi: 4e968000   edi: c7ce4658   ebp: d16b8ec0   esp: c95fde50
  kernel: ds: 0018   es: 0018   ss: 0018
  kernel: Process python2.1 (pid: 24868, stackpage=c95fd000)
  kernel: Stack: c7ce4e40 4e968000 00001000 d16b8ec0 c01b7104 d16b8ec0 4e968000 c95fde90 
  kernel:        c01d116d e70b82c0 4e93d000 00001000 c01b6a44 d16b8ec0 4e93d000 e70b82c0 
  kernel:        c7ce4dc0 c7ce4e40 00000000 4e968000 00001000 c01b6550 d16b8ec0 4e968000 
  kernel: Call Trace:    [do_munmap+132/432] [link_path_walk+1309/1776] [get_unmapped_area+164/320] [do_mmap_pgoff+400/1504] [old_mmap+269/336]
  kernel:   [system_call+51/80] [sys_fstat64+73/128] [system_call+77/80]
  kernel: 
  kernel: Code: 0f 0b 4a 03 80 86 34 c0 89 d8 5b 5e 5f 5d c3 39 5d 00 eb ea 

or:

  kernel: Unable to handle kernel paging request at virtual address 712e746b
  kernel:  printing eip:
  kernel: c01eb950
  kernel: *pde = 00000000
  kernel: Oops: 0000
  kernel: CPU:    0
  kernel: EIP:    0010:[proc_pid_stat+144/800]    Not tainted
  kernel: EFLAGS: 00010206
  kernel: eax: dd95e5ad   ebx: d0988500   ecx: d098851c   edx: 712e7463
  kernel: esi: f5138000   edi: d5ce25ad   ebp: 000003ff   esp: f3a9de14
  kernel: ds: 0018   es: 0018   ss: 0018
  kernel: Process top (pid: 26768, stackpage=f3a9d000)
  kernel: Stack: c01e9eb9 f5138000 c0361e64 cbc4f1c0 cbc4f1c0 c01ea17e e70b8c40 cbc4f1c0 
  kernel:        0000000b 00000004 f5138000 ffffffea fffffff4 cbc4f82c cbc4f7c0 e70b8c40 
  kernel:        c01d0b03 cbc4f7c0 e70b8c40 e70b8c40 e5ac300e fffffffe f3a9df0c c01d116d 
  kernel: Call Trace:    [proc_pid_make_inode+121/160] [proc_base_lookup+254/560] [real_lookup+195/240] [link_path_walk+1309/1776] [get_empty_filp+77/288]
  kernel:   [proc_info_read+87/272] [filp_open+98/112] [sys_read+163/304] [system_call+51/80] [sys_close+78/96] [system_call+77/80]
  kernel: 
  kernel: Code: 8b 42 08 2b 42 04 8b 52 0c 01 c7 85 d2 75 f1 ba ff ff ff ff 

Thanks for any hints or pointers!

hdparm configuration:

  multcount    = 16 (on)
  I/O support  =  1 (32-bit)
  unmaskirq    =  0 (off)
  using_dma    =  1 (on)
  keepsettings =  0 (off)
  nowerr       =  0 (off)
  readonly     =  0 (off)
  readahead    =  6 (on)
  geometry     = 238216/16/63, sectors = 240121728, start = 0
  busstate     =  1 (on)
  Model=Maxtor 6Y120L0, FwRev=YAR41BW0, SerialNo=Y31GHARE
  Config={ Fixed }
  RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=57
  BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
  CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=240121728
  IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
  PIO modes: pio0 pio1 pio2 pio3 pio4 
  DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
  AdvancedPM=yes: disabled (255) WriteCache=enabled
  Drive Supports : ataATA-1 ATA-2 ATA-3 ATA-4 ATA-5 ATA-6 ATA-7 

-- 
martin;              (greetings from the heart of the sun.)
  \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck
 
invalid/expired pgp subkeys? use subkeys.pgp.net as keyserver!
 
"the vast majority of our imports come from outside the country."  
                                                      - george w. bush 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-08 15:12 stability problems with 2.4.24/Software RAID/ext3 martin f krafft
@ 2004-01-08 17:03 ` Marcelo Tosatti
  2004-01-08 17:10   ` Marcelo Tosatti
  2004-01-08 17:37   ` Martin F Krafft
  2004-01-09 18:11 ` Martin Josefsson
  1 sibling, 2 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2004-01-08 17:03 UTC (permalink / raw)
  To: martin f krafft; +Cc: linux kernel mailing list, Bartlomiej Zolnierkiewicz



On Thu, 8 Jan 2004, martin f krafft wrote:

> Hi all,
>
> I operate a groupware server which is giving me a very hard time.
> It's an AMD Athlon XP 3000+ with 1Gb of RAM, and four 7200 UPM IDE
> harddrives, two attached to the primary channels of the onboard
> controller, and two to the primary channels of a Promise 20269 EIDE
> controller. The kernel is a 2.4.24 with the configuration I placed
> here:
>   ftp://ftp.madduck.net/scratch/config-2.4.24-gaia.gz
>
> The system is configured with 7 Software-RAID and three swap partitions:
>
>   md1:  /boot      (ext3) RAID 1 spanning hda1 and hdc1
>   md5:  /          (ext3) RAID 5 hda5/hdc5/hde5 with hdg5 as a spare
>   md6:  /usr       (ext3) RAID 5 hda6/hdc6/hde6 with hdg6 as a spare
>   md7:  /var       (ext3) RAID 5 hda7/hdc7/hde7 with hdg7 as a spare
>   md8:  /usr/local (ext3) RAID 5 hda8/hdc8/hde8 with hdg8 as a spare
>   md9:  /home      (ext3) RAID 5 hda9/hdc9/hde9 with hdg9 as a spare
>   md10: /tmp       (ext3) RAID 5 hda10/hdc10/hde10 with hdg10 as a spare
>
>   hda2 holds a non-RAID rescue system with RAID 1/5 supporrt
>
>   hdc2, hde2, hdg2 are swap partitions of 256 Mb each.
>
>   hde1 and hdg1 are unused.
>
> The individial harddisks are identically tweaked with hdparm:
>
>   hdparm -A1 -B255 -c1 -d1 -p -u0 -W0 -Xudma6 /dev/hd{a,c,e,g}
>
> See the end of this mail for details.
>
> The system experiences severe stability problems, which I relate to
> the filesystem, RAID, or controller code, because it's reproducible
> with excessive disk operations. E.g., doing something like
>
>   rsync -a --exclude /tmp / /tmp/dump

<snip>

> Interestingly, just now, the machine crashed differently. `vmstat 1`
> was still running, but new processes could not be started, after the
> kernel reported a lot of oopses in user-space processes (e.g. rsync,
> top, zsh), as well as some of the kjournald oopses like above.
> I have included the footprint of the user-space program oopses
> further down. `vmstat 1` was happily printing the following away,
> when the system was already unusable. The b > 127 value is
> interesting, as it has been continuously increasing (well, in
> a non-decreasing way) after a certain point, and somewhere on the
> way, the system reached the state of agnosia.
>
>  0 127  2  16124  10304  43004 682188   0   0     0     0  109     7   0   0 100
>  0 127  2  16124  10304  43004 682188   0   0     0     0  111     5   0   0 100
>  0 127  2  16124  10304  43004 682188   0   0     0     0  114     9   0   0 100
>  0 127  2  16124  10304  43004 682188   0   0     0     0  111     5   0   0 100
>  0 127  2  16124  10304  43004 682188   0   0     0     0  115     9   0   0 100
>  0 128  2  16124  10420  43004 682060   0   0     0     0  119    12   0   0 100
>  0 128  2  16124  10420  43004 682060   0   0     0     0  122    11   0   0 100
>
> Apart from these panics and hangups, the system also randomly issues
> segfaults to processes, or reports a kernel oops. These take the
> following form:

Hi Martin,

I can't help you much, but I believe your problem might be related to
faulty hardware. Have you checked if the memory OK ?

Try disabling DMA on the Promise?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-08 17:03 ` Marcelo Tosatti
@ 2004-01-08 17:10   ` Marcelo Tosatti
  2004-01-08 17:37     ` Martin F Krafft
  2004-01-09 10:26     ` martin f krafft
  2004-01-08 17:37   ` Martin F Krafft
  1 sibling, 2 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2004-01-08 17:10 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: martin f krafft, linux kernel mailing list, Bartlomiej Zolnierkiewicz



On Thu, 8 Jan 2004, Marcelo Tosatti wrote:

> > Apart from these panics and hangups, the system also randomly issues
> > segfaults to processes, or reports a kernel oops. These take the
> > following form:
>
> Hi Martin,
>
> I can't help you much, but I believe your problem might be related to
> faulty hardware. Have you checked if the memory OK ?
>
> Try disabling DMA on the Promise?

More information (/proc/mtrr, /proc/interrupts, dmesg, etc) is helpful.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-08 17:10   ` Marcelo Tosatti
@ 2004-01-08 17:37     ` Martin F Krafft
  2004-01-09 10:26     ` martin f krafft
  1 sibling, 0 replies; 13+ messages in thread
From: Martin F Krafft @ 2004-01-08 17:37 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux kernel mailing list, Bartlomiej Zolnierkiewicz

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

also sprach Marcelo Tosatti <marcelo.tosatti@cyclades.com> [2004.01.08.1810 +0100]:
> More information (/proc/mtrr, /proc/interrupts, dmesg, etc) is helpful.

During the lockup, or in general?

And dmesg... during the lockup is not possible. Do you simply want
the boot sequence?

-- 
Martin F. Krafft                Artificial Intelligence Laboratory
Ph.D. Student                   Department of Information Technology
Email: krafft@ailab.ch          University of Zurich
Tel: +41.(0)1.63-54323          Andreasstrasse 15, Office 2.20
http://ailab.ch/people/krafft   CH-8050 Zurich, Switzerland
 
Invalid/expired PGP subkeys? Use subkeys.pgp.net as keyserver!
 
"in just seven days, i can make you a man!"
                                      -- the rocky horror picture show

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-08 17:03 ` Marcelo Tosatti
  2004-01-08 17:10   ` Marcelo Tosatti
@ 2004-01-08 17:37   ` Martin F Krafft
  1 sibling, 0 replies; 13+ messages in thread
From: Martin F Krafft @ 2004-01-08 17:37 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux kernel mailing list, Bartlomiej Zolnierkiewicz

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

also sprach Marcelo Tosatti <marcelo.tosatti@cyclades.com> [2004.01.08.1803 +0100]:
> I can't help you much, but I believe your problem might be related to
> faulty hardware. Have you checked if the memory OK ?A

Memory and harddisks are fault-free (according to memtest86 and
badblocks).

> Try disabling DMA on the Promise?

I'll disable DMA altogether and see if I can reproduce the problem.

-- 
Martin F. Krafft                Artificial Intelligence Laboratory
Ph.D. Student                   Department of Information Technology
Email: krafft@ailab.ch          University of Zurich
Tel: +41.(0)1.63-54323          Andreasstrasse 15, Office 2.20
http://ailab.ch/people/krafft   CH-8050 Zurich, Switzerland
 
Invalid/expired PGP subkeys? Use subkeys.pgp.net as keyserver!
 
linux is like a wigwam.
no gates, no windoze, and an apache inside.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-08 17:10   ` Marcelo Tosatti
  2004-01-08 17:37     ` Martin F Krafft
@ 2004-01-09 10:26     ` martin f krafft
  1 sibling, 0 replies; 13+ messages in thread
From: martin f krafft @ 2004-01-09 10:26 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux kernel mailing list, Bartlomiej Zolnierkiewicz

[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]

also sprach Marcelo Tosatti <marcelo.tosatti@cyclades.com> [2004.01.08.1810 +0100]:
> More information (/proc/mtrr, /proc/interrupts, dmesg, etc) is helpful.

It is currently running 2.6.1-rc3, but the problems exist for 2.4
and 2.6, although not as gravely for 2.6. I hope this information is
still enough, or do you need me to boot 2.4?

gaia:~# cat /proc/mtrr
reg00: base=0x00000000 (   0MB), size=1024MB: write-back, count=1
reg01: base=0xec000000 (3776MB), size=  64MB: write-combining, count=1
reg07: base=0xf0000000 (3840MB), size= 128MB: write-combining, count=1
gaia:~# cat /proc/interrupts
           CPU0       
  0:    2481339          XT-PIC  timer
  1:          8          XT-PIC  i8042
  2:          0          XT-PIC  cascade
  5:     140986          XT-PIC  ide2, ide3
  8:          3          XT-PIC  rtc
 12:      70179          XT-PIC  aic7xxx, eth0
 14:     142086          XT-PIC  ide0
 15:     152040          XT-PIC  ide1
NMI:          0 
ERR:          0

And let me know what you want from dmesg. A bootlog?

-- 
martin;              (greetings from the heart of the sun.)
  \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck
 
invalid/expired pgp subkeys? use subkeys.pgp.net as keyserver!
 
weapon, n.:
  an index of the lack of development of a culture.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-08 15:12 stability problems with 2.4.24/Software RAID/ext3 martin f krafft
  2004-01-08 17:03 ` Marcelo Tosatti
@ 2004-01-09 18:11 ` Martin Josefsson
  2004-01-09 18:53   ` martin f krafft
  1 sibling, 1 reply; 13+ messages in thread
From: Martin Josefsson @ 2004-01-09 18:11 UTC (permalink / raw)
  To: martin f krafft; +Cc: linux kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Thu, 2004-01-08 at 16:12, martin f krafft wrote:
> Hi all,
> 
> I operate a groupware server which is giving me a very hard time.
> It's an AMD Athlon XP 3000+ with 1Gb of RAM, and four 7200 UPM IDE
> harddrives, two attached to the primary channels of the onboard
> controller, and two to the primary channels of a Promise 20269 EIDE
> controller. The kernel is a 2.4.24 with the configuration I placed
> here:

Try replacing the Promise controllers with something diffrent (doesn't
really matter what). I've helped a friend with a server that hung all
the time, it had a few promise-controllers. After it had hung _lots_ of
times we came to the conclusion that we should try some other IDE
controllers (we had replaced everything else) and we borrowed a few
HighPoint controllers. Guess what, the machine is stable with these
controllers :)
I don't have any more data than this.
If you manage to get it stable with another controller, maybe you are
willing to try to find out what the possible problems with the
promise-driver (or hardware) is.

The machine in question had two pdc20268 and two pdc20269 controllers
(we tried with to combine them in all possible combinations, it hung
anyway)

So if you can, try some other controllers.

I personally have a pdc20267 in my workstation that I stress quite
heavily sometimes and I've never had any problems with it.

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-09 18:11 ` Martin Josefsson
@ 2004-01-09 18:53   ` martin f krafft
  2004-01-09 22:14     ` Christian Kivalo
  2004-01-10 12:06     ` Marcelo Tosatti
  0 siblings, 2 replies; 13+ messages in thread
From: martin f krafft @ 2004-01-09 18:53 UTC (permalink / raw)
  To: linux kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 1120 bytes --]

also sprach Martin Josefsson <gandalf@wlug.westbo.se> [2004.01.09.1911 +0100]:
> Try replacing the Promise controllers with something diffrent (doesn't
> really matter what).

Well, I can't find any other suitable ones, really. I can't seem to
find HighPoints, there is 3ware and DawiControl, but I don't know
which ones are supported by Linux.

Maybe someone can give me a suggestion for a non-promise EIDE 133
PCI controller that's natively supported by Linux.

> I personally have a pdc20267 in my workstation that I stress quite
> heavily sometimes and I've never had any problems with it.

that's a different driver. so it might be the driver that's causing
the problems. if i replace the controller, i may be able to debug,
but unless i get a new controller in place, i can't do anything
since this is a productive machine.

thanks,

-- 
martin;              (greetings from the heart of the sun.)
  \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck
 
invalid/expired pgp subkeys? use subkeys.pgp.net as keyserver!
 
micro$oft could shit in a box, and most people would buy it.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: stability problems with 2.4.24/Software RAID/ext3
  2004-01-09 18:53   ` martin f krafft
@ 2004-01-09 22:14     ` Christian Kivalo
  2004-01-10 12:06     ` Marcelo Tosatti
  1 sibling, 0 replies; 13+ messages in thread
From: Christian Kivalo @ 2004-01-09 22:14 UTC (permalink / raw)
  To: linux-kernel

On Friday, January 09, 2004 7:54 PM, martin f krafft wrote:
> Well, I can't find any other suitable ones, really. I can't seem to
> find HighPoints, there is 3ware and DawiControl, but I don't know
> which ones are supported by Linux.
>
> Maybe someone can give me a suggestion for a non-promise EIDE 133
> PCI controller that's natively supported by Linux.

Hi!

3ware cards are hardware raidcontrollers, they are supported.

I can get a dawicontrol card here in austria with a silicon image 680
chip on it. I use 3 cards with sil680 chip (because these are not as
expensive as the 3ware cards) with linux-2.4.23 and connected 6 disks as
master holding a raid5 array. Have'nt had any problems till yet (I have
this setup for ~2 month's now).


Christian

(sorry for broken mua, am currently forced to use this)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-09 18:53   ` martin f krafft
  2004-01-09 22:14     ` Christian Kivalo
@ 2004-01-10 12:06     ` Marcelo Tosatti
  2004-01-10 17:06       ` martin f krafft
  1 sibling, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2004-01-10 12:06 UTC (permalink / raw)
  To: martin f krafft; +Cc: linux-kernel



On Fri, 9 Jan 2004, martin f krafft wrote:

> also sprach Martin Josefsson <gandalf@wlug.westbo.se> [2004.01.09.1911 +0100]:
> > Try replacing the Promise controllers with something diffrent (doesn't
> > really matter what).
>
> Well, I can't find any other suitable ones, really. I can't seem to
> find HighPoints, there is 3ware and DawiControl, but I don't know
> which ones are supported by Linux.
>
> Maybe someone can give me a suggestion for a non-promise EIDE 133
> PCI controller that's natively supported by Linux.
>
> > I personally have a pdc20267 in my workstation that I stress quite
> > heavily sometimes and I've never had any problems with it.
>
> that's a different driver. so it might be the driver that's causing
> the problems. if i replace the controller, i may be able to debug,
> but unless i get a new controller in place, i can't do anything
> since this is a productive machine.

Did you ever try to disable the DMA as suggested?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-10 12:06     ` Marcelo Tosatti
@ 2004-01-10 17:06       ` martin f krafft
  0 siblings, 0 replies; 13+ messages in thread
From: martin f krafft @ 2004-01-10 17:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1422 bytes --]

also sprach Marcelo Tosatti <marcelo.tosatti@cyclades.com> [2004.01.10.1306 +0100]:
> Did you ever try to disable the DMA as suggested?

I am sorry, Marcelo, that it took me so long.

In fact, I tried disabling the DMA and I could *not* get it to crash
without DMA. It did also not crash with DMA on for the onboard (VIA)
controller and off for the Promise. But when I turned DMA back on
for the Promise, it crashed again.

Martin Josefsson has suggested that the Promise controller may be
defective, and it certainly looks like that. I am now trying
a different Promise controller (20376 rather than the 20369, but
same driver), but it also crashes.

Thus, it looks like it's a problem with the driver, doesn't it? Or
either of the two disks. I will run badblocks over them on
a known-to-be-good controller when I get a chance.

If it's a problem with the driver, then I would be happy to help,
but I know nothing about these layers of the computer. I would,
however, give the controller away to someone eager to debug the
driver (provided the university will let me)!

Cheers,

-- 
martin;              (greetings from the heart of the sun.)
  \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck
 
invalid/expired pgp subkeys? use subkeys.pgp.net as keyserver!
 
a qui sait comprendre, peu de mots suffisent.
                                                 -- intelligenti pauca

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: stability problems with 2.4.24/Software RAID/ext3
  2004-01-08 16:02 Cress, Andrew R
@ 2004-01-08 17:05 ` martin f krafft
  0 siblings, 0 replies; 13+ messages in thread
From: martin f krafft @ 2004-01-08 17:05 UTC (permalink / raw)
  To: linux kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 673 bytes --]

also sprach Cress, Andrew R <andrew.r.cress@intel.com> [2004.01.08.1702 +0100]:
> https://listman.redhat.com/archives/ext3-users/2002-December/msg00125.html
> You may want to check it out to see if this fix is already included in
> your 2.4.24 kernel.   

These are both already inthe vanilla 2.4.24 kernel.

Thanks though.

-- 
martin;              (greetings from the heart of the sun.)
  \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck
 
invalid/expired pgp subkeys? use subkeys.pgp.net as keyserver!
 
"the vast majority of our imports come from outside the country."  
                                                      - george w. bush 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: stability problems with 2.4.24/Software RAID/ext3
@ 2004-01-08 16:02 Cress, Andrew R
  2004-01-08 17:05 ` martin f krafft
  0 siblings, 1 reply; 13+ messages in thread
From: Cress, Andrew R @ 2004-01-08 16:02 UTC (permalink / raw)
  To: martin f krafft, linux kernel mailing list

Martin,

I've seen some issues with jbd/transaction.c in 2.4.20+ that look
similar to one of your panics.  There was a fix by RedHat to the problem
I saw.
 
https://listman.redhat.com/archives/ext3-users/2002-December/msg00125.ht
ml
You may want to check it out to see if this fix is already included in
your 2.4.24 kernel.   

Andy

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of martin f krafft
Sent: Thursday, January 08, 2004 10:12 AM
To: linux kernel mailing list
Subject: stability problems with 2.4.24/Software RAID/ext3


Hi all,

I operate a groupware server which is giving me a very hard time.
It's an AMD Athlon XP 3000+ with 1Gb of RAM, and four 7200 UPM IDE
harddrives, two attached to the primary channels of the onboard
controller, and two to the primary channels of a Promise 20269 EIDE
controller. The kernel is a 2.4.24 with the configuration I placed
here:

  ftp://ftp.madduck.net/scratch/config-2.4.24-gaia.gz 

The system is configured with 7 Software-RAID and three swap partitions:

  md1:  /boot      (ext3) RAID 1 spanning hda1 and hdc1
  md5:  /          (ext3) RAID 5 hda5/hdc5/hde5 with hdg5 as a spare
  md6:  /usr       (ext3) RAID 5 hda6/hdc6/hde6 with hdg6 as a spare
  md7:  /var       (ext3) RAID 5 hda7/hdc7/hde7 with hdg7 as a spare
  md8:  /usr/local (ext3) RAID 5 hda8/hdc8/hde8 with hdg8 as a spare
  md9:  /home      (ext3) RAID 5 hda9/hdc9/hde9 with hdg9 as a spare
  md10: /tmp       (ext3) RAID 5 hda10/hdc10/hde10 with hdg10 as a spare

  hda2 holds a non-RAID rescue system with RAID 1/5 supporrt

  hdc2, hde2, hdg2 are swap partitions of 256 Mb each.
  
  hde1 and hdg1 are unused.

The individial harddisks are identically tweaked with hdparm:

  hdparm -A1 -B255 -c1 -d1 -p -u0 -W0 -Xudma6 /dev/hd{a,c,e,g}

See the end of this mail for details.

The system experiences severe stability problems, which I relate to
the filesystem, RAID, or controller code, because it's reproducible
with excessive disk operations. E.g., doing something like

  rsync -a --exclude /tmp / /tmp/dump

will most likely crash the system with a kernel oops. This kernel
oops is not recorded in the log, but I took it down as follows:
  
  kernel: Unable to handle kernel paging request at virtual address
00529610
  kernel:  printing eip:
  kernel: c01c7f41
  kernel: *pde = 00000000
  kernel: Oops: 0002
  kernel: CPU:    0
  kernel: EIP:    0010:[__remove_inode_queue+17/48]    Not tainted
  kernel: EFLAGS: 00010202
  kernel: eax: cef76320   ebx: cc529590   ecx: 00529610   edx : cc529540
  kernel: esi: cc529540   edi: c1e59510   ebp: cc4e7cc0   esp : f3a55e54
  kernel: ds: 0018   es: 0018   ss: 0018
  kernel: Process kjournald (pid: 24176, stackpage=f3a55000)
  kernel: Stack: 00000000 c01c862a cc529540 c02029d8 cc529540 c1e59ea0
c01fec42 cc529540 
  kernel:        00000040 f3a55ea4 00000d0d f7ee8280 f6965d34 00000000
00000000 00000000 
  kernel:        0000000f cb3b3840 e6e308a0 00000d0d cc0ec9c0 cc0eca40
cc0ec0c0 cc149bc0 
  kernel: Call Trace:    [__refile_buffer+106/112]
[journal_free_journal_head+24/32] [journal_commit_transaction+4066/4352]
[kjournald+263/464] [commit_timeout+0/16]
  kernel:   [arch_kernel_thread+43/64] [kjournald+0/464]
  kernel: 
  kernel: Code: 89 01 c7 43 04 00 00 00 00 c7 42 50 00 00 00 00 b8 09 00
00 

  kernel:  <1>Unable to handle kernel NULL pointer dereference at
virtual address 00000000
  kernel:  printing eip:
  kernel: c01be950
  kernel: *pde = 00000000
  kernel: Oops: 0000
  kernel: CPU:    0
  kernel: EIP:    0010:[kmem_cache_reap+128/448]    Not tainted
  kernel: EFLAGS: 00010013
  kernel: eax: 00000000   ebx: 00000001   ecx: c1c0d348   edx : c1c0d358
  kernel: esi: 00000000   edi: 00000005   ebp: 00000000   esp : c1c33f38
  kernel: ds: 0018   es: 0018   ss: 0018
  kernel: Process kswapd (pid: 4, stackpage=c1c33000)
  kernel: Stack: c1240260 000001d0 c1c0d348 00000000 00000000 00000000
00000020 000001d0 
  kernel:        c0102aa0 00000006 c01bf646 00000006 c0102aa0 c0102aa0
000001d0 00000006 
  kernel:        c0102aa0 00000000 c01bf706 00000020 c0102aa0 c1c32000
c0102940 c01bf824 
  kernel: Call Trace:    [shrink_caches+38/176]
[try_to_free_pages_zone+54/96] [kswapd_balance_pgdat+84/160]
[kswapd_balance+25/48] [kswapd+141/176]

since the two crashes are related to kswapd and kjournald, I would
assume it's the underlying RAID code that's problematic. However,
maybe you can extract more information from the above crashes.

The following is a snapshot from `vmstat 1` prior to a regular
kernel panic, which resulted in a reboot (thanks to sys.kernel.panic
== 60):

 1  1  2  10184  12344  47020 749912   0   0     0  4344  382   308   0
1  99
 0  1  1  10184  12344  47020 749912   0   0     0  5936  395   334   0
2  98
 0  1  1  10184  12332  47020 749916   0   0     4  4808  379   330   0
3  97
 0  1  1  10184  12332  47020 749916   0   0     0  5008  342   277   1
0  99
 0  1  2  10184  12328  47024 749916   0   0     0  5120  330   293   0
4  96
 0  3  2  10184  12356  47040 750108   0   0    64  4772  367   360   0
3  97
 0  1  1  10184  12460  47052 749704   0   0  1220  6236  352   390   1
4  95
 0  1  1  10184  12044  47052 750096   0   0  2176  6772  371   497   6
5  89
 0  1  1  10184  12388  47060 749704   0   0   324  7732  367   376   0
6  94
 0  1  2  10184  12512  47068 749824   0   0    56  7448  365   312   0
1  99
 0  1  1  10184  12832  47080 749444   0   0   424  6648  368   363   0
3  97
 0  1  1  10184  11884  47092 750156   0   0  2304  7960  416   504   1
8  91
 2  0  1  10184  12708  47100 749284   0   0  1772  6836  370   462   5
4  91

Interestingly, just now, the machine crashed differently. `vmstat 1`
was still running, but new processes could not be started, after the
kernel reported a lot of oopses in user-space processes (e.g. rsync,
top, zsh), as well as some of the kjournald oopses like above.
I have included the footprint of the user-space program oopses
further down. `vmstat 1` was happily printing the following away,
when the system was already unusable. The b > 127 value is
interesting, as it has been continuously increasing (well, in
a non-decreasing way) after a certain point, and somewhere on the
way, the system reached the state of agnosia.

 0 127  2  16124  10304  43004 682188   0   0     0     0  109     7   0
0 100
 0 127  2  16124  10304  43004 682188   0   0     0     0  111     5   0
0 100
 0 127  2  16124  10304  43004 682188   0   0     0     0  114     9   0
0 100
 0 127  2  16124  10304  43004 682188   0   0     0     0  111     5   0
0 100
 0 127  2  16124  10304  43004 682188   0   0     0     0  115     9   0
0 100
 0 128  2  16124  10420  43004 682060   0   0     0     0  119    12   0
0 100
 0 128  2  16124  10420  43004 682060   0   0     0     0  122    11   0
0 100

Apart from these panics and hangups, the system also randomly issues
segfaults to processes, or reports a kernel oops. These take the
following form:

  kernel: kernel BUG at mmap.c:842!
  kernel: invalid operand: 0000
  kernel: CPU:    0
  kernel: EIP:    0010:[find_vma_prev+124/176]    Not tainted
  kernel: EFLAGS: 00010206
  kernel: eax: c7ce4dc0   ebx: c7ce4e40   ecx: c7ce4dd8   edx: c95fde90
  kernel: esi: 4e968000   edi: c7ce4658   ebp: d16b8ec0   esp: c95fde50
  kernel: ds: 0018   es: 0018   ss: 0018
  kernel: Process python2.1 (pid: 24868, stackpage=c95fd000)
  kernel: Stack: c7ce4e40 4e968000 00001000 d16b8ec0 c01b7104 d16b8ec0
4e968000 c95fde90 
  kernel:        c01d116d e70b82c0 4e93d000 00001000 c01b6a44 d16b8ec0
4e93d000 e70b82c0 
  kernel:        c7ce4dc0 c7ce4e40 00000000 4e968000 00001000 c01b6550
d16b8ec0 4e968000 
  kernel: Call Trace:    [do_munmap+132/432] [link_path_walk+1309/1776]
[get_unmapped_area+164/320] [do_mmap_pgoff+400/1504] [old_mmap+269/336]
  kernel:   [system_call+51/80] [sys_fstat64+73/128] [system_call+77/80]
  kernel: 
  kernel: Code: 0f 0b 4a 03 80 86 34 c0 89 d8 5b 5e 5f 5d c3 39 5d 00 eb
ea 

or:

  kernel: Unable to handle kernel paging request at virtual address
712e746b
  kernel:  printing eip:
  kernel: c01eb950
  kernel: *pde = 00000000
  kernel: Oops: 0000
  kernel: CPU:    0
  kernel: EIP:    0010:[proc_pid_stat+144/800]    Not tainted
  kernel: EFLAGS: 00010206
  kernel: eax: dd95e5ad   ebx: d0988500   ecx: d098851c   edx: 712e7463
  kernel: esi: f5138000   edi: d5ce25ad   ebp: 000003ff   esp: f3a9de14
  kernel: ds: 0018   es: 0018   ss: 0018
  kernel: Process top (pid: 26768, stackpage=f3a9d000)
  kernel: Stack: c01e9eb9 f5138000 c0361e64 cbc4f1c0 cbc4f1c0 c01ea17e
e70b8c40 cbc4f1c0 
  kernel:        0000000b 00000004 f5138000 ffffffea fffffff4 cbc4f82c
cbc4f7c0 e70b8c40 
  kernel:        c01d0b03 cbc4f7c0 e70b8c40 e70b8c40 e5ac300e fffffffe
f3a9df0c c01d116d 
  kernel: Call Trace:    [proc_pid_make_inode+121/160]
[proc_base_lookup+254/560] [real_lookup+195/240]
[link_path_walk+1309/1776] [get_empty_filp+77/288]
  kernel:   [proc_info_read+87/272] [filp_open+98/112]
[sys_read+163/304] [system_call+51/80] [sys_close+78/96]
[system_call+77/80]
  kernel: 
  kernel: Code: 8b 42 08 2b 42 04 8b 52 0c 01 c7 85 d2 75 f1 ba ff ff ff
ff 

Thanks for any hints or pointers!

hdparm configuration:

  multcount    = 16 (on)
  I/O support  =  1 (32-bit)
  unmaskirq    =  0 (off)
  using_dma    =  1 (on)
  keepsettings =  0 (off)
  nowerr       =  0 (off)
  readonly     =  0 (off)
  readahead    =  6 (on)
  geometry     = 238216/16/63, sectors = 240121728, start = 0
  busstate     =  1 (on)
  Model=Maxtor 6Y120L0, FwRev=YAR41BW0, SerialNo=Y31GHARE
  Config={ Fixed }
  RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=57
  BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
  CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=240121728
  IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
  PIO modes: pio0 pio1 pio2 pio3 pio4 
  DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5
*udma6 
  AdvancedPM=yes: disabled (255) WriteCache=enabled
  Drive Supports : ataATA-1 ATA-2 ATA-3 ATA-4 ATA-5 ATA-6 ATA-7 

-- 
martin;              (greetings from the heart of the sun.)
  \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck
 
invalid/expired pgp subkeys? use subkeys.pgp.net as keyserver!
 
"the vast majority of our imports come from outside the country."  
                                                      - george w. bush 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2004-01-10 17:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-01-08 15:12 stability problems with 2.4.24/Software RAID/ext3 martin f krafft
2004-01-08 17:03 ` Marcelo Tosatti
2004-01-08 17:10   ` Marcelo Tosatti
2004-01-08 17:37     ` Martin F Krafft
2004-01-09 10:26     ` martin f krafft
2004-01-08 17:37   ` Martin F Krafft
2004-01-09 18:11 ` Martin Josefsson
2004-01-09 18:53   ` martin f krafft
2004-01-09 22:14     ` Christian Kivalo
2004-01-10 12:06     ` Marcelo Tosatti
2004-01-10 17:06       ` martin f krafft
2004-01-08 16:02 Cress, Andrew R
2004-01-08 17:05 ` martin f krafft

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).