All of lore.kernel.org
 help / color / mirror / Atom feed
* 2.6.20: reproducible hard lockup with RAID-5 resync
@ 2007-02-16  6:47 Corey Hickey
  2007-02-16  7:59 ` Neil Brown
  2007-02-18  9:51 ` Corey Hickey
  0 siblings, 2 replies; 11+ messages in thread
From: Corey Hickey @ 2007-02-16  6:47 UTC (permalink / raw)
  To: linux-raid

I think I have found an easily-reproducible bug in Linux 2.6.20. I have
already applied the "Fix various bugs with aligned reads in RAID5"
patch, and that had no effect. It appears to be related to the resync
process, and makes the system lock up, hard.

The steps to reproduce are:
1. Be running Linux 2.6.20 and do whatever is necessary to prepare for a
crash (close open files, sync, unmount filesystems, or whatever).
Alternatively, just boot with 'init=/bin/bash'.
2. Run 'mdadm -S /dev/md2', where /dev/md2 is a RAID-5.
3. Run 'mdadm -A /dev/md2 -U resync'.
4. Wait about 1 second. The system will lock up.

During the lock up, nothing is printed to the console, and the magic
SysRQ key has no effect; I have to poke the reset button. Normally, I
wouldn't rule out a hardware problem, but I have reasonable faith in my
computer. Neither memtest86+ nor cpuburn nor normal operation have
flushed out any instability.

Upon reboot, 2.6.20 will lock up almost immediately when it tries to
resync the array. This appears to occur regardless of whether the resync
is just starting; if I run 2.6.19 for a while until the resync is, say,
50% done and then reboot to 2.6.20, the lockup still happens.

I have provided what I hope is enough information below.

--------------------------------------------------------------------
System information:
Athlon64 3400+
64-bit Linux 2.6.20 compiled with GCC 4.1.2
64-bit Debian Sid
RAID-5 of 5 devices:
   /dev/hda   (IDE hard drive)
   /dev/sda6  (partition on SATA hard drive)
   /dev/sdb   (SATA hard drive)
   /dev/sdc6  (partition on SATA hard drive)
   /dev/sdd   (SATA hard drive)

--------------------------------------------------------------------
bugfood:~# mdadm -D /dev/md2
/dev/md2:
        Version : 00.90.03
  Creation Time : Mon May 29 22:13:47 2006
     Raid Level : raid5
     Array Size : 781433344 (745.23 GiB 800.19 GB)
    Device Size : 195358336 (186.31 GiB 200.05 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Thu Feb 15 22:07:26 2007
          State : active, resyncing
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 26% complete

           UUID : d016a205:bd3106ef:b19cb15b:b6d70494
         Events : 0.3971003

    Number   Major   Minor   RaidDevice State
       0       8        6        0      active sync   /dev/sda6
       1       8       38        1      active sync   /dev/sdc6
       2       3        0        2      active sync   /dev/hda
       3       8       16        3      active sync   /dev/sdb
       4       8       48        4      active sync   /dev/sdd

--------------------------------------------------------------------

Thank you,
Corey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-16  6:47 2.6.20: reproducible hard lockup with RAID-5 resync Corey Hickey
@ 2007-02-16  7:59 ` Neil Brown
  2007-02-16  8:11   ` Corey Hickey
  2007-02-16 21:23   ` Corey Hickey
  2007-02-18  9:51 ` Corey Hickey
  1 sibling, 2 replies; 11+ messages in thread
From: Neil Brown @ 2007-02-16  7:59 UTC (permalink / raw)
  To: Corey Hickey; +Cc: linux-raid

On Thursday February 15, bugfood-ml@fatooh.org wrote:
> I think I have found an easily-reproducible bug in Linux 2.6.20. I have
> already applied the "Fix various bugs with aligned reads in RAID5"
> patch, and that had no effect. It appears to be related to the resync
> process, and makes the system lock up, hard.

I'm guessing that the problem is at a lower level than raid.
What IDE/SATA controllers do you have?  Google to see if anyone else
has had problems with them in 2.6.20.
> 
> During the lock up, nothing is printed to the console, and the magic
> SysRQ key has no effect; I have to poke the reset button.

Sound's like interrupts are disabled, but x86_64 always enables the
NMI watchdog which should trigger if interrupts are off for too long. 

Do you have CONFIG_DETECT_SOFTLOCKUP=y in your .config (it is in the
kernel debugging options menu I think).  If not, setting that would be
worth a try.

A raid5 resync across 5 sata drives on a couple of different
silicon-image controllers doesn't lock up for me.

NeilBrown


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-16  7:59 ` Neil Brown
@ 2007-02-16  8:11   ` Corey Hickey
  2007-02-16 21:23   ` Corey Hickey
  1 sibling, 0 replies; 11+ messages in thread
From: Corey Hickey @ 2007-02-16  8:11 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:
> On Thursday February 15, bugfood-ml@fatooh.org wrote:
>> I think I have found an easily-reproducible bug in Linux 2.6.20. I have
>> already applied the "Fix various bugs with aligned reads in RAID5"
>> patch, and that had no effect. It appears to be related to the resync
>> process, and makes the system lock up, hard.
> 
> I'm guessing that the problem is at a lower level than raid.
> What IDE/SATA controllers do you have?  Google to see if anyone else
> has had problems with them in 2.6.20.
>> During the lock up, nothing is printed to the console, and the magic
>> SysRQ key has no effect; I have to poke the reset button.
> 
> Sound's like interrupts are disabled, but x86_64 always enables the
> NMI watchdog which should trigger if interrupts are off for too long. 
> 
> Do you have CONFIG_DETECT_SOFTLOCKUP=y in your .config (it is in the
> kernel debugging options menu I think).  If not, setting that would be
> worth a try.
> 
> A raid5 resync across 5 sata drives on a couple of different
> silicon-image controllers doesn't lock up for me.

Wow, thanks for the quick response. I have to go to bed now, but I'll
try to get you that information tomorrow.

Thanks,
Corey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-16  7:59 ` Neil Brown
  2007-02-16  8:11   ` Corey Hickey
@ 2007-02-16 21:23   ` Corey Hickey
  2007-02-17 10:58     ` Corey Hickey
  1 sibling, 1 reply; 11+ messages in thread
From: Corey Hickey @ 2007-02-16 21:23 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:
> On Thursday February 15, bugfood-ml@fatooh.org wrote:
>> I think I have found an easily-reproducible bug in Linux 2.6.20. I have
>> already applied the "Fix various bugs with aligned reads in RAID5"
>> patch, and that had no effect. It appears to be related to the resync
>> process, and makes the system lock up, hard.
> 
> I'm guessing that the problem is at a lower level than raid.
> What IDE/SATA controllers do you have?  Google to see if anyone else
> has had problems with them in 2.6.20.

I have an nForce3 motherboard. lspci calls my IDE:
nVidia Corporation CK8S Parallel ATA Controller (v2.5) (rev a2)
...and my SATA:
nVidia Corporation CK8S Serial ATA Controller (v2.5) (rev a2)

I'm using libata for my SATA drives and the old IDE driver for my IDE 
drive. For reference, I have uploaded my kernel configuration and the 
output of lspci:
http://fatooh.org/files/tmp/config-2.6.20
http://fatooh.org/files/tmp/lspci-v

Anyway, I googled a bit, and I also looked through the recent threads in 
the linux-kernel archives, but I haven't found anything. I don't follow 
kernel development closely, though, so it's quite possible I missed 
something.

When I get home (late) tonight I'll try running dd and badblocks on the 
corresponding drives and partitions.

>> During the lock up, nothing is printed to the console, and the magic
>> SysRQ key has no effect; I have to poke the reset button.
> 
> Sound's like interrupts are disabled, but x86_64 always enables the
> NMI watchdog which should trigger if interrupts are off for too long. 

How long is "too long"? I waited a few minutes, at least, on the first 
few tries.

> Do you have CONFIG_DETECT_SOFTLOCKUP=y in your .config (it is in the
> kernel debugging options menu I think).  If not, setting that would be
> worth a try.

I do indeed have CONFIG_DETECT_SOFTLOCKUP enabled. The Kconfig 
description says it should detect lockups > 10 seconds, I've waited 
longer than that many times.

> A raid5 resync across 5 sata drives on a couple of different
> silicon-image controllers doesn't lock up for me.

Heck. ;)  Would it by any chance make a difference that I'm running 
RAID-5 across a mixture of drives and partitions?

Thanks again,
Corey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-16 21:23   ` Corey Hickey
@ 2007-02-17 10:58     ` Corey Hickey
  2007-02-17 11:14       ` Justin Piszcz
  0 siblings, 1 reply; 11+ messages in thread
From: Corey Hickey @ 2007-02-17 10:58 UTC (permalink / raw)
  To: linux-raid

Corey Hickey wrote:
> When I get home (late) tonight I'll try running dd and badblocks on the 
> corresponding drives and partitions.

Well, I haven't been able to reproduce the problem that way. I tried the
following:

$ dd id=/dev/hda of=/dev/null
$ badblocks /dev/hda
$ badblocks -n /dev/hda

...and the same for sda6, sdb, sdc6, sdd, and md2. In each case I killed
the test after several seconds, on the assumption that if the problem
was reproducible within less than a second by triggering a resync, it
wouldn't take long any other way.

If anyone has any suggestions for further tests I can do, I'll be happy
to try them out.

Thanks,
Corey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-17 10:58     ` Corey Hickey
@ 2007-02-17 11:14       ` Justin Piszcz
  2007-02-17 19:19         ` Corey Hickey
  0 siblings, 1 reply; 11+ messages in thread
From: Justin Piszcz @ 2007-02-17 11:14 UTC (permalink / raw)
  To: Corey Hickey; +Cc: linux-raid



On Sat, 17 Feb 2007, Corey Hickey wrote:

> Corey Hickey wrote:
>> When I get home (late) tonight I'll try running dd and badblocks on the
>> corresponding drives and partitions.
>
> Well, I haven't been able to reproduce the problem that way. I tried the
> following:
>
> $ dd id=/dev/hda of=/dev/null
> $ badblocks /dev/hda
> $ badblocks -n /dev/hda
>
> ...and the same for sda6, sdb, sdc6, sdd, and md2. In each case I killed
> the test after several seconds, on the assumption that if the problem
> was reproducible within less than a second by triggering a resync, it
> wouldn't take long any other way.
>
> If anyone has any suggestions for further tests I can do, I'll be happy
> to try them out.
>
> Thanks,
> Corey
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I noticed you are running a disk on /dev/hda, if this is an older mobo and 
you use multiple PCI IDE controllers, if they are in certain slots, or you 
put too many in, they can cause these problems, I have had a few mobos 
that do it if the PCI cards are in a 'certain combination'-- I have had 
this happen with an MSI and ABIT board.

Justin.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-17 11:14       ` Justin Piszcz
@ 2007-02-17 19:19         ` Corey Hickey
  0 siblings, 0 replies; 11+ messages in thread
From: Corey Hickey @ 2007-02-17 19:19 UTC (permalink / raw)
  To: linux-raid

Justin Piszcz wrote:
>> If anyone has any suggestions for further tests I can do, I'll be happy
>> to try them out.
>
> I noticed you are running a disk on /dev/hda, if this is an older mobo and 
> you use multiple PCI IDE controllers, if they are in certain slots, or you 
> put too many in, they can cause these problems, I have had a few mobos 
> that do it if the PCI cards are in a 'certain combination'-- I have had 
> this happen with an MSI and ABIT board.

I have a single Promise PCI IDE controller, but I don't have any hard
drives hooked up to it. /dev/hda and all the SATA drives are hooked up
to the onboard nVidia controllers.

My motherboard is a "DFI LanParty UT nf3 250Gb" (despite the unwieldy
name, it's a rather nice board) and I think it dates back to August
2004. I built my machine in November 2004, and I've never had an IDE or
SATA problem with it.

I started testing (and then using) my RAID-5 in May 2006 with Linux
2.6.16. I built the array degraded, so I know that resync worked for me
then. So did restriping, which I did twice. Since then, I've run 2.6.17
and 2.6.19, without any problems.

Thanks,
Corey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-16  6:47 2.6.20: reproducible hard lockup with RAID-5 resync Corey Hickey
  2007-02-16  7:59 ` Neil Brown
@ 2007-02-18  9:51 ` Corey Hickey
  2007-02-18 21:09   ` Corey Hickey
  1 sibling, 1 reply; 11+ messages in thread
From: Corey Hickey @ 2007-02-18  9:51 UTC (permalink / raw)
  To: linux-raid

Corey Hickey wrote:
> I think I have found an easily-reproducible bug in Linux 2.6.20. I have
> already applied the "Fix various bugs with aligned reads in RAID5"
> patch, and that had no effect. It appears to be related to the resync
> process, and makes the system lock up, hard.

I now have a different build of 2.6.20 running resync without the
reported problem. I'll try to isolate the specific difference between
the two builds some time tomorrow.

-Corey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-18  9:51 ` Corey Hickey
@ 2007-02-18 21:09   ` Corey Hickey
  2007-02-18 22:04     ` Neil Brown
  0 siblings, 1 reply; 11+ messages in thread
From: Corey Hickey @ 2007-02-18 21:09 UTC (permalink / raw)
  To: linux-raid

Corey Hickey wrote:
> Corey Hickey wrote:
>> I think I have found an easily-reproducible bug in Linux 2.6.20. I have
>> already applied the "Fix various bugs with aligned reads in RAID5"
>> patch, and that had no effect. It appears to be related to the resync
>> process, and makes the system lock up, hard.
> 
> I now have a different build of 2.6.20 running resync without the
> reported problem. I'll try to isolate the specific difference between
> the two builds some time tomorrow.

Ok, so the difference is CONFIG_SYSFS_DEPRECATED. If that is not
defined, the kernel locks up. There's not a lot of code under
#ifdef/#ifndef CONFIG_SYSFS_DEPRECATED, but since I'm not familiar with
any of it I don't expect trying to locate the bug on my own would be
very productive.

Neil, do you have CONFIG_SYSFS_DEPRECATED enabled? If so, does disabling
it reproduce my problem? If you can't reproduce it, should it take the
problem over to linux-kernel?

Thanks,
Corey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-18 21:09   ` Corey Hickey
@ 2007-02-18 22:04     ` Neil Brown
  2007-02-19  0:06       ` Corey Hickey
  0 siblings, 1 reply; 11+ messages in thread
From: Neil Brown @ 2007-02-18 22:04 UTC (permalink / raw)
  To: Corey Hickey; +Cc: linux-raid

On Sunday February 18, bugfood-ml@fatooh.org wrote:
> 
> Ok, so the difference is CONFIG_SYSFS_DEPRECATED. If that is not
> defined, the kernel locks up. There's not a lot of code under
> #ifdef/#ifndef CONFIG_SYSFS_DEPRECATED, but since I'm not familiar with
> any of it I don't expect trying to locate the bug on my own would be
> very productive.
> 
> Neil, do you have CONFIG_SYSFS_DEPRECATED enabled? If so, does disabling
> it reproduce my problem? If you can't reproduce it, should it take the
> problem over to linux-kernel?

# CONFIG_SYSFS_DEPRECATED is not set

No, it is not set, and yet it all still works for me.
It is very hard to see how this CONFIG option can make a difference.
Have you double checked that setting it removed the problem and
clearing it causes the problem?
If so, then maybe an email to linux-kernel would be appropriate.
Include the full config..

NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.20: reproducible hard lockup with RAID-5 resync
  2007-02-18 22:04     ` Neil Brown
@ 2007-02-19  0:06       ` Corey Hickey
  0 siblings, 0 replies; 11+ messages in thread
From: Corey Hickey @ 2007-02-19  0:06 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:
>> Ok, so the difference is CONFIG_SYSFS_DEPRECATED. If that is not
>> defined, the kernel locks up. There's not a lot of code under
>> #ifdef/#ifndef CONFIG_SYSFS_DEPRECATED, but since I'm not familiar with
>> any of it I don't expect trying to locate the bug on my own would be
>> very productive.
>>
>> Neil, do you have CONFIG_SYSFS_DEPRECATED enabled? If so, does disabling
>> it reproduce my problem? If you can't reproduce it, should it take the
>> problem over to linux-kernel?
> 
> # CONFIG_SYSFS_DEPRECATED is not set
> 
> No, it is not set, and yet it all still works for me.

Dang, again. :)

> It is very hard to see how this CONFIG option can make a difference.
> Have you double checked that setting it removed the problem and
> clearing it causes the problem?

Yes, it seems odd to me too, but I have double-checked. If I build a
kernel with CONFIG_SYSFS_DEPRECATED enabled, it works; if I disable that
option and rebuild the kernel, it locks up.

I just tried running 'make defconfig' and then enabling only RAID,
RAID-0, RAID-1, and RAID-4/5/6. If I then disable
CONFIG_SYSFS_DEPRECATED, there aren't any problems. ...so, I'll try to
isolate the problem some more later.

Thanks,
Corey

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-02-19  0:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-16  6:47 2.6.20: reproducible hard lockup with RAID-5 resync Corey Hickey
2007-02-16  7:59 ` Neil Brown
2007-02-16  8:11   ` Corey Hickey
2007-02-16 21:23   ` Corey Hickey
2007-02-17 10:58     ` Corey Hickey
2007-02-17 11:14       ` Justin Piszcz
2007-02-17 19:19         ` Corey Hickey
2007-02-18  9:51 ` Corey Hickey
2007-02-18 21:09   ` Corey Hickey
2007-02-18 22:04     ` Neil Brown
2007-02-19  0:06       ` Corey Hickey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.