All of lore.kernel.org
 help / color / mirror / Atom feed
* New RAID causing system lockups
@ 2010-09-11 18:13 Mike Hartman
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Hartman @ 2010-09-11 18:13 UTC (permalink / raw)
  To: linux-raid

PART 2:

Eventually I realized that while I couldn't do anything with bash, I
could run (some) commands directly via ssh (ssh odin <command>) and
they would work ok. I was able to run dmesg, cat some files. Was able
to ls some directories for a while, but eventually couldn't anymore.
Was NOT able to cat /proc/mdstat. It would just hang. Attached
(dmesg_1.txt) is the dmesg output I got, which seems to include
everything from the start of the reshaping up to the lockup. The RAID
system definitely seems to be involved.

After waiting a day or so with no change and nothing else working I
gritted my teeth and did a hard reboot, hoping my array wasn't totally
hosed. Fortunately, I was able to reassemble the array using the
backup file specified as part of my conversion command and the
reshaping picked back up where it left off. It completed without
further incident (took about 4 days).

Once the reshaping was complete I ran fsck on its filesystem (came
back clean even when forced), mounted it, and everything looked ok. No
files appeared to be lost. Chalking the freeze up to a one-time
problem related to the reshaping, I started copying all the data from
one of the other 1.5TB drives into the md0. (The idea is to keep
copying each drive's contents into the array, wiping it, adding it as
a hot spare, and then growing the RAID and its filesystem
accordingly.)

When I'm almost done (the 1.5TB only has 55GB left on it) the system
hangs again. Same symptoms as before. I was able to run dmesg again
(dmesg_2.txt) and the call trace looks pretty similar. It still
mentions the RAID system a good bit, even though no high level RAID
operations were going on and I was just writing to the array. This
time I only waited an hour or two before giving up and opting for the
hard reboot. Once again the array seemed to be ok once it was brought
back up.

Seems to be a fairly fundamental problem, whatever it is, and anything
that causes a lockup like this is a pretty big bug in a stable kernel.
The individual drives test out fine with everything I've tried.
Everything looks completely healthy until these lockups occur. I've
attached my lspci and kernel config in case there's something useful
in there.

Any ideas?

Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread
* New RAID causing system lockups
@ 2010-09-11 18:20 Mike Hartman
  2010-09-11 18:45 ` Mike Hartman
  2010-09-11 20:43 ` Neil Brown
  0 siblings, 2 replies; 17+ messages in thread
From: Mike Hartman @ 2010-09-11 18:20 UTC (permalink / raw)
  To: linux-raid

PART 3:

Update:

I'm even more concerned about this now, because I just started the
newest reshaping to add a new drive with:

mdadm --grow -c 256 --raid-devices=5 --backup-file=/grow_md0.bak /dev/md0

And the system output:

mdadm: Need to backup 768K of critical section..

cat /proc/mdstat shows the reshaping is proceeding,

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdi1[0] sdf1[5] md1p1[4] sdj1[3] sdh1[1]
      2929691136 blocks super 1.2 level 6, 128k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (56576/1464845568)
finish=2156.9min speed=11315K/sec

md1 : active raid0 sdg1[0] sdk1[1]
      1465141760 blocks super 1.2 128k chunks

unused devices: <none>

but I've checked for /grow_md0.bak and it's not there. So it looks
like for some reason it ignored my backup file option.

This scares me, because if I experience the lockup again and am forced
to reboot, without a backup file I'm afraid my array will be hosed.
I'm also afraid to stop it cleanly right now for the same reason.

So in addition to fixing the lockup itself, does anyone know if
there's a way to either cancel this reshaping or belatedly add the
backup file in a different way so it will be recoverable? It's only at
1% and says it will take another 2193 minutes.

Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread
* New RAID causing system lockups
@ 2010-09-11 18:12 Mike Hartman
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Hartman @ 2010-09-11 18:12 UTC (permalink / raw)
  To: linux-raid

I've tried sending this twice now (it was my first post to the list)
but it never seems to make it through. Resending in multiple parts to
see if it's just too long.

PART 1:

First let me outline where I am and how I got to this point.

About a week ago I created a RAID array on my Gentoo server. I already
had a handful of full, independent drives on that server, and 3 new
empty ones. The three new 1.5TB SATA drives are in an external e-SATA
enclosure, along with 2 of the existing drives (750GB each). The
e-SATA enclosure is connected to the server with a Syba SD-SA2PEX-2E
card (SIL3132 chipset) since it supports port multiplying. The other 4
drives (2 1.5TB, 2 750GB) are still mounted in the server itself.

My goal was to end up with all the drives (9) in a single RAID 6 array
to use as a storage partition (not for any system files). I only had 3
clean ones, so I wanted to start with RAID 5, use that new space to
clear off some of the other drives, and bootstrap up to a RAID 6.

My first step was to update to the newest stable gentoo kernel
(2.6.35-gentoo-r4) to be sure I had reasonably current mdadm support.
No problems during that upgrade.

Then I created 1.5 TB partitions (type 0xDA) on each of the 3 new
(empty) drives and assembled them into a RAID 5 array (md0). Once that
was finished resyncing I created an ext4 filesystem and started
copying over everything that was on the 2 750GB drives in the same
enclosure.

Once that was done (no problems) and the 750s were empty I created a
RAID 0 (md1) from them. I created a 1.5TB partition on md1 just like I
had on a bare drives, and then added that partition to md0 as a hot
spare. I've seen that approach in several RAID tutorials - it seems
like the only way to get these undersized drives into the same RAID 6.

Then I switched md0 over to a RAID 6, using that hot spare. The
reshaping was SLOW (4MB/s) but that seems to be par for the course in
a RAID5->RAID6 transition.

It was during this reshaping that I saw my first lockup. I was
monitoring things via SSH, and the reshaping was about 13% complete.
The filesystem was mounted but wasn't being written to (or even read
much). I noticed my SSH session had stopped responding, so I tried
creating a new one in a fresh terminal. I was able to enter my
password, see the MOTD, and get a prompt, but couldn't type anything
into it. Tried this several times with no luck. Physically sat down at
the computer (no X running) and couldn't even get the screen to wake
up. The monitor's LED made it seem awake, but I only got a black
screen and couldn't even Ctrl-Alt-F2 to get a fresh terminal.

CONTINUED IN PART 2

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-09-21 11:28 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-11 18:13 New RAID causing system lockups Mike Hartman
  -- strict thread matches above, loose matches on Subject: below --
2010-09-11 18:20 Mike Hartman
2010-09-11 18:45 ` Mike Hartman
2010-09-11 20:43 ` Neil Brown
2010-09-11 20:56   ` Mike Hartman
2010-09-13  6:28     ` Mike Hartman
2010-09-13 15:57       ` Mike Hartman
2010-09-13 23:51         ` Neil Brown
     [not found]           ` <AANLkTin=jy=xJTtN5mQ6U=rYw3p+_4-nmkhO7zqR0KLP@mail.gmail.com>
2010-09-14  1:11             ` Mike Hartman
2010-09-14  1:35               ` Neil Brown
2010-09-14  2:50                 ` Mike Hartman
2010-09-14  3:35                   ` Mike Hartman
2010-09-14  3:48                     ` Neil Brown
     [not found]                       ` <AANLkTimXabL-TyjqJ81syrx-Oxn50qexbA8q9p22sxJt@mail.gmail.com>
2010-09-15 21:49                         ` Mike Hartman
2010-09-21  2:26                           ` Neil Brown
2010-09-21 11:28                             ` Mike Hartman
2010-09-11 18:12 Mike Hartman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.