All of lore.kernel.org
 help / color / mirror / Atom feed
* Raid Recovery after Machine Failure
@ 2005-03-13  1:54 Can Sar
  2005-03-13  9:47 ` David Greaves
  0 siblings, 1 reply; 2+ messages in thread
From: Can Sar @ 2005-03-13  1:54 UTC (permalink / raw)
  To: linux-raid

Hi,

I am working with a research group that is currently building a tool to 
automatically find bugs in file systems, and related questions. We are 
trying to check whether file systems really guarantee the consistencies 
they promise, and one aspect we are looking at is running them on top 
of Raid devices. In order to do this we have to understand a few things 
about the Linux Raid driver/tools and I haven't been able to figure 
this out from the documention/source code, so maybe you can help me.
I asked this same question a few days ago, but I think I didn't really 
state it clearly, so let me try to rephrase it.

For Raid 4-6 and for say 5 disks say we write a block that is striped 
across all the disks, and after 4 of the disks write their part of the 
block to disk the machine crashes without the 5th disk being able to 
complete the write. Because of this, the checksum for this stripe 
should be incorrect, right?

The raid array is a Linux soft raid array set up using mdadm, and none 
of the disks actually crashed or wrote had any errors during this 
operation (the machine crashed for some other reason) We then reboot 
the machine and recreate the array, then remount it and then try to 
read the sector that was previously written (that has an incorrect 
checksum). At what point will the raid driver discover that something 
is wrong? Will it ever (I feel that it should discover this during the 
read at latest). Will it try to perform any kind of recovery or simply 
fail?
How would this change if only 3 of the 5 disk writes made it to disk? 
Fixing the error would be impossible of course (at least with Raid 4 
and 5, i know little about 6), but detection should still work. Will 
the driver complain?

Thank you so very much for your help,
Can


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Raid Recovery after Machine Failure
  2005-03-13  1:54 Raid Recovery after Machine Failure Can Sar
@ 2005-03-13  9:47 ` David Greaves
  0 siblings, 0 replies; 2+ messages in thread
From: David Greaves @ 2005-03-13  9:47 UTC (permalink / raw)
  To: Can Sar; +Cc: linux-raid

I *think* this is correct.
I'm a user, not a coder.
If nothing else it should help you search the archives for clarification :)

In general I think the answer lies around md's superblocks

Can Sar wrote:

> Hi,
>
> I am working with a research group that is currently building a tool 
> to automatically find bugs in file systems, and related questions. We 
> are trying to check whether file systems really guarantee the 
> consistencies they promise, and one aspect we are looking at is 
> running them on top of Raid devices. In order to do this we have to 
> understand a few things about the Linux Raid driver/tools and I 
> haven't been able to figure this out from the documention/source code, 
> so maybe you can help me.
> I asked this same question a few days ago, but I think I didn't really 
> state it clearly, so let me try to rephrase it.
>
> For Raid 4-6 and for say 5 disks say we write a block that is striped 
> across all the disks, and after 4 of the disks write their part of the 
> block to disk the machine crashes without the 5th disk being able to 
> complete the write. Because of this, the checksum for this stripe 
> should be incorrect, right?

If I understand correctly, the superblocks are updated after each device 
sync - in this case superblock on disk 5 is different to 1-4. This means 
that disk 5 is kicked on restart and the array re-syncs using 1-4 to 
verify or write (not sure) disk 5.

> The raid array is a Linux soft raid array set up using mdadm, and none 
> of the disks actually crashed or wrote had any errors during this 
> operation (the machine crashed for some other reason) We then reboot 
> the machine and recreate the array,

this should be 'automatic'
It's not so much 'recreate' (special recovery related meaning in md 
terminology) as 'start the md device'

> then remount it and then try to read the sector that was previously 
> written (that has an incorrect checksum). At what point will the raid 
> driver discover that something is wrong?

as it starts it checks the superblock sequence number, notices that one 
disk is wrong and not use it.

> Will it ever (I feel that it should discover this during the read at 
> latest). Will it try to perform any kind of recovery or simply fail?

so, since the superblock is wrong it starts in 'degraded' mode and resyncs.

> How would this change if only 3 of the 5 disk writes made it to disk? 
> Fixing the error would be impossible of course (at least with Raid 4 
> and 5, i know little about 6), but detection should still work. Will 
> the driver complain?

I don't know what happens if the superblock fails to update on say, 3 
out of 6 disks in an array.
The driver _will_ complain.

Newer kernels have an experimental fail facility that you may be 
interested in:
CONFIG_MD_FAULTY:
The "faulty" module allows for a block device that occasionally returns
read or write errors.  It is useful for testing.

HTH

David

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2005-03-13  9:47 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-03-13  1:54 Raid Recovery after Machine Failure Can Sar
2005-03-13  9:47 ` David Greaves

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.