All of lore.kernel.org
 help / color / mirror / Atom feed
* Requesting replace mode for changing a disk
@ 2009-05-08 22:15 Goswin von Brederlow
  2009-05-09 11:41 ` John Robinson
  2009-05-09 23:07 ` Bill Davidsen
  0 siblings, 2 replies; 24+ messages in thread
From: Goswin von Brederlow @ 2009-05-08 22:15 UTC (permalink / raw)
  To: linux-raid

Hi,

consider the following situation: You have a software raid that runs
fine but one disk is suspect (e.g. SMART says failure imminent or
something). How do you replace that disk?

Currently you have do fail/remove the disk from the raid, add a
fresh disk and resync. That leaves a large window in which redundancy
is compromised. With current disk sizes that can be days.

It would be nice if one could tell the kernel to replace a disk in a
raid set with a spare without the need to degrade the raid.

Thoughts?

MfG
        Goswin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-08 22:15 Requesting replace mode for changing a disk Goswin von Brederlow
@ 2009-05-09 11:41 ` John Robinson
  2009-05-09 23:07 ` Bill Davidsen
  1 sibling, 0 replies; 24+ messages in thread
From: John Robinson @ 2009-05-09 11:41 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-raid

On 08/05/2009 23:15, Goswin von Brederlow wrote:
> Hi,
> 
> consider the following situation: You have a software raid that runs
> fine but one disk is suspect (e.g. SMART says failure imminent or
> something). How do you replace that disk?
> 
> Currently you have do fail/remove the disk from the raid, add a
> fresh disk and resync. That leaves a large window in which redundancy
> is compromised. With current disk sizes that can be days.
> 
> It would be nice if one could tell the kernel to replace a disk in a
> raid set with a spare without the need to degrade the raid.
> 
> Thoughts?

I remember this being discussed a few months ago, and I think it's 
fairly high up Neil Brown's to do/roadmap for the future.

Cheers,

John.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-08 22:15 Requesting replace mode for changing a disk Goswin von Brederlow
  2009-05-09 11:41 ` John Robinson
@ 2009-05-09 23:07 ` Bill Davidsen
  2009-05-10  1:22   ` Goswin von Brederlow
                     ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Bill Davidsen @ 2009-05-09 23:07 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-raid

Goswin von Brederlow wrote:
> Hi,
>
> consider the following situation: You have a software raid that runs
> fine but one disk is suspect (e.g. SMART says failure imminent or
> something). How do you replace that disk?
>
> Currently you have do fail/remove the disk from the raid, add a
> fresh disk and resync. That leaves a large window in which redundancy
> is compromised. With current disk sizes that can be days.
>
> It would be nice if one could tell the kernel to replace a disk in a
> raid set with a spare without the need to degrade the raid.
>
> Thoughts?
>   

This is one of many things proposed occasionally here, no real 
objection, sometimes loud support, but no one actually *does* the code.

You have described the problem exactly, and the solution is still to do 
it manually. But you don't need to fail the drive long term, if you can 
stop the array for a few moments. You stop the array, remove the suspect 
drive, create a raid1 of the suspect drive marked write-mostly and the 
new spare, then add the raid1 in place of the suspect drive. For any 
chunks present on the new drive the reads will go there, reducing 
access, while data is copied from the old to the new in resync, and 
writes still go to the old suspect drive so if the new drive fails you 
are no worse off. When the raid1 is clean you stop the main array and 
back the suspect drive out.

This is complicated enough that I totally agree a hot migrate would be 
desirable. This is why people use lvm, although I make zero claims that 
this same problem will solve more easily, I'm just not an lvm guru (or 
even a newbie, just an occasional user).

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-09 23:07 ` Bill Davidsen
@ 2009-05-10  1:22   ` Goswin von Brederlow
  2009-05-10  2:20   ` Guy Watkins
  2009-05-13  1:21   ` Leslie Rhorer
  2 siblings, 0 replies; 24+ messages in thread
From: Goswin von Brederlow @ 2009-05-10  1:22 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Goswin von Brederlow, linux-raid

Bill Davidsen <davidsen@tmr.com> writes:

> Goswin von Brederlow wrote:
>> Hi,
>>
>> consider the following situation: You have a software raid that runs
>> fine but one disk is suspect (e.g. SMART says failure imminent or
>> something). How do you replace that disk?
>>
>> Currently you have do fail/remove the disk from the raid, add a
>> fresh disk and resync. That leaves a large window in which redundancy
>> is compromised. With current disk sizes that can be days.
>>
>> It would be nice if one could tell the kernel to replace a disk in a
>> raid set with a spare without the need to degrade the raid.
>>
>> Thoughts?
>>
>
> This is one of many things proposed occasionally here, no real
> objection, sometimes loud support, but no one actually *does* the code.
>
> You have described the problem exactly, and the solution is still to
> do it manually. But you don't need to fail the drive long term, if you
> can stop the array for a few moments. You stop the array, remove the
> suspect drive, create a raid1 of the suspect drive marked write-mostly
> and the new spare, then add the raid1 in place of the suspect
> drive. For any chunks present on the new drive the reads will go
> there, reducing access, while data is copied from the old to the new
> in resync, and writes still go to the old suspect drive so if the new
> drive fails you are no worse off. When the raid1 is clean you stop the
> main array and back the suspect drive out.
>
> This is complicated enough that I totally agree a hot migrate would be
> desirable. This is why people use lvm, although I make zero claims
> that this same problem will solve more easily, I'm just not an lvm
> guru (or even a newbie, just an occasional user).

The difference, appart from simpler usage, would be that the raid does
not have to be stoped. Stopping the raid that contains / or /usr means
some downtime.

In the case of LVM there is the fact that you can suspend a
device-mapper device and alter its mapping any way you wish. So you
can do things manually without umounting the filesystems. But lvm /
device-mapper doesn't have all the raid stuff so one can't just
switch.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: Requesting replace mode for changing a disk
  2009-05-09 23:07 ` Bill Davidsen
  2009-05-10  1:22   ` Goswin von Brederlow
@ 2009-05-10  2:20   ` Guy Watkins
  2009-05-10  7:02     ` Goswin von Brederlow
  2009-05-10 14:33     ` Bill Davidsen
  2009-05-13  1:21   ` Leslie Rhorer
  2 siblings, 2 replies; 24+ messages in thread
From: Guy Watkins @ 2009-05-10  2:20 UTC (permalink / raw)
  To: 'Bill Davidsen', 'Goswin von Brederlow'; +Cc: linux-raid

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of Bill Davidsen
} Sent: Saturday, May 09, 2009 7:08 PM
} To: Goswin von Brederlow
} Cc: linux-raid@vger.kernel.org
} Subject: Re: Requesting replace mode for changing a disk
} 
} Goswin von Brederlow wrote:
} > Hi,
} >
} > consider the following situation: You have a software raid that runs
} > fine but one disk is suspect (e.g. SMART says failure imminent or
} > something). How do you replace that disk?
} >
} > Currently you have do fail/remove the disk from the raid, add a
} > fresh disk and resync. That leaves a large window in which redundancy
} > is compromised. With current disk sizes that can be days.
} >
} > It would be nice if one could tell the kernel to replace a disk in a
} > raid set with a spare without the need to degrade the raid.
} >
} > Thoughts?
} >
} 
} This is one of many things proposed occasionally here, no real
} objection, sometimes loud support, but no one actually *does* the code.
} 
} You have described the problem exactly, and the solution is still to do
} it manually. But you don't need to fail the drive long term, if you can
} stop the array for a few moments. You stop the array, remove the suspect
} drive, create a raid1 of the suspect drive marked write-mostly and the
} new spare, then add the raid1 in place of the suspect drive. For any
} chunks present on the new drive the reads will go there, reducing
} access, while data is copied from the old to the new in resync, and
} writes still go to the old suspect drive so if the new drive fails you
} are no worse off. When the raid1 is clean you stop the main array and
} back the suspect drive out.
} 
} This is complicated enough that I totally agree a hot migrate would be
} desirable. This is why people use lvm, although I make zero claims that
} this same problem will solve more easily, I'm just not an lvm guru (or
} even a newbie, just an occasional user).

If the disk is suspect, I would expect read errors!
If you have 1 bad block on the suspect disk, this process will fail.
If the logic was built-in to md, then any read errors while replacing could
be recovered from another disk or disks.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-10  2:20   ` Guy Watkins
@ 2009-05-10  7:02     ` Goswin von Brederlow
  2009-05-10 14:33     ` Bill Davidsen
  1 sibling, 0 replies; 24+ messages in thread
From: Goswin von Brederlow @ 2009-05-10  7:02 UTC (permalink / raw)
  To: Guy Watkins
  Cc: 'Bill Davidsen', 'Goswin von Brederlow', linux-raid

"Guy Watkins" <linux-raid@watkins-home.com> writes:

> If the disk is suspect, I would expect read errors!
> If you have 1 bad block on the suspect disk, this process will fail.
> If the logic was built-in to md, then any read errors while replacing could
> be recovered from another disk or disks.

That is actualy a good point. I didn't even think of that.

It would require a true replace mode in all raid levels though and not
just internally starting a raid1.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-10  2:20   ` Guy Watkins
  2009-05-10  7:02     ` Goswin von Brederlow
@ 2009-05-10 14:33     ` Bill Davidsen
  2009-05-10 15:55       ` Guy Watkins
  1 sibling, 1 reply; 24+ messages in thread
From: Bill Davidsen @ 2009-05-10 14:33 UTC (permalink / raw)
  To: Guy Watkins; +Cc: 'Goswin von Brederlow', linux-raid

Guy Watkins wrote:
> } -----Original Message-----
> } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> } owner@vger.kernel.org] On Behalf Of Bill Davidsen
> } Sent: Saturday, May 09, 2009 7:08 PM
> } To: Goswin von Brederlow
> } Cc: linux-raid@vger.kernel.org
> } Subject: Re: Requesting replace mode for changing a disk
> } 
> } Goswin von Brederlow wrote:
> } > Hi,
> } >
> } > consider the following situation: You have a software raid that runs
> } > fine but one disk is suspect (e.g. SMART says failure imminent or
> } > something). How do you replace that disk?
> } >
> } > Currently you have do fail/remove the disk from the raid, add a
> } > fresh disk and resync. That leaves a large window in which redundancy
> } > is compromised. With current disk sizes that can be days.
> } >
> } > It would be nice if one could tell the kernel to replace a disk in a
> } > raid set with a spare without the need to degrade the raid.
> } >
> } > Thoughts?
> } >
> } 
> } This is one of many things proposed occasionally here, no real
> } objection, sometimes loud support, but no one actually *does* the code.
> } 
> } You have described the problem exactly, and the solution is still to do
> } it manually. But you don't need to fail the drive long term, if you can
> } stop the array for a few moments. You stop the array, remove the suspect
> } drive, create a raid1 of the suspect drive marked write-mostly and the
> } new spare, then add the raid1 in place of the suspect drive. For any
> } chunks present on the new drive the reads will go there, reducing
> } access, while data is copied from the old to the new in resync, and
> } writes still go to the old suspect drive so if the new drive fails you
> } are no worse off. When the raid1 is clean you stop the main array and
> } back the suspect drive out.
> } 
> } This is complicated enough that I totally agree a hot migrate would be
> } desirable. This is why people use lvm, although I make zero claims that
> } this same problem will solve more easily, I'm just not an lvm guru (or
> } even a newbie, just an occasional user).
>
> If the disk is suspect, I would expect read errors!
> If you have 1 bad block on the suspect disk, this process will fail.
>   

The raid1 is part of the original raid5, so the error should go to that 
level, where it will be recovered, and hopefully then rewritten. I have 
actually done this, and it has always completed, so I haven't researched 
why it worked, just noted that it did.
> If the logic was built-in to md, then any read errors while replacing could
> be recovered from another disk or disks.
>
>   


-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: Requesting replace mode for changing a disk
  2009-05-10 14:33     ` Bill Davidsen
@ 2009-05-10 15:55       ` Guy Watkins
  0 siblings, 0 replies; 24+ messages in thread
From: Guy Watkins @ 2009-05-10 15:55 UTC (permalink / raw)
  To: 'Bill Davidsen', 'Guy Watkins'
  Cc: 'Goswin von Brederlow', linux-raid

} -----Original Message-----
} From: Bill Davidsen [mailto:davidsen@tmr.com]
} Sent: Sunday, May 10, 2009 10:34 AM
} To: Guy Watkins
} Cc: 'Goswin von Brederlow'; linux-raid@vger.kernel.org
} Subject: Re: Requesting replace mode for changing a disk
} 
} Guy Watkins wrote:
} > } -----Original Message-----
} > } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} > } owner@vger.kernel.org] On Behalf Of Bill Davidsen
} > } Sent: Saturday, May 09, 2009 7:08 PM
} > } To: Goswin von Brederlow
} > } Cc: linux-raid@vger.kernel.org
} > } Subject: Re: Requesting replace mode for changing a disk
} > }
} > } Goswin von Brederlow wrote:
} > } > Hi,
} > } >
} > } > consider the following situation: You have a software raid that runs
} > } > fine but one disk is suspect (e.g. SMART says failure imminent or
} > } > something). How do you replace that disk?
} > } >
} > } > Currently you have do fail/remove the disk from the raid, add a
} > } > fresh disk and resync. That leaves a large window in which
} redundancy
} > } > is compromised. With current disk sizes that can be days.
} > } >
} > } > It would be nice if one could tell the kernel to replace a disk in a
} > } > raid set with a spare without the need to degrade the raid.
} > } >
} > } > Thoughts?
} > } >
} > }
} > } This is one of many things proposed occasionally here, no real
} > } objection, sometimes loud support, but no one actually *does* the
} code.
} > }
} > } You have described the problem exactly, and the solution is still to
} do
} > } it manually. But you don't need to fail the drive long term, if you
} can
} > } stop the array for a few moments. You stop the array, remove the
} suspect
} > } drive, create a raid1 of the suspect drive marked write-mostly and the
} > } new spare, then add the raid1 in place of the suspect drive. For any
} > } chunks present on the new drive the reads will go there, reducing
} > } access, while data is copied from the old to the new in resync, and
} > } writes still go to the old suspect drive so if the new drive fails you
} > } are no worse off. When the raid1 is clean you stop the main array and
} > } back the suspect drive out.
} > }
} > } This is complicated enough that I totally agree a hot migrate would be
} > } desirable. This is why people use lvm, although I make zero claims
} that
} > } this same problem will solve more easily, I'm just not an lvm guru (or
} > } even a newbie, just an occasional user).
} >
} > If the disk is suspect, I would expect read errors!
} > If you have 1 bad block on the suspect disk, this process will fail.
} >
} 
} The raid1 is part of the original raid5, so the error should go to that
} level, where it will be recovered, and hopefully then rewritten. I have
} actually done this, and it has always completed, so I haven't researched
} why it worked, just noted that it did.

It depends on who sees the error.  If the parent array is trying to read,
then yes.  But if the RAID1 is reading to sync, then no.  The RAID1 layer
does not know about the RAID5 (or whatever) just above.

} > If the logic was built-in to md, then any read errors while replacing
} could
} > be recovered from another disk or disks.
} >
} >


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: Requesting replace mode for changing a disk
  2009-05-09 23:07 ` Bill Davidsen
  2009-05-10  1:22   ` Goswin von Brederlow
  2009-05-10  2:20   ` Guy Watkins
@ 2009-05-13  1:21   ` Leslie Rhorer
  2009-05-13  3:27     ` Goswin von Brederlow
  2009-05-13  4:31     ` Neil Brown
  2 siblings, 2 replies; 24+ messages in thread
From: Leslie Rhorer @ 2009-05-13  1:21 UTC (permalink / raw)
  To: 'Linux RAID'

> This is one of many things proposed occasionally here, no real
> objection, sometimes loud support, but no one actually *does* the code.

	At the risk of being a metoo, I would really love this feature.

> You have described the problem exactly, and the solution is still to do
> it manually. But you don't need to fail the drive long term, if you can
> stop the array for a few moments. You stop the array, remove the suspect
> drive,

Um, how, exactly?  That is to say, after stopping the array, how does one
remove the drive?  From the next step in your suggestion, it doesn't seem
tome you are talking about physically removing the drive, so how does one
remove a drive from a stopped array for this purpose?  I didn't think that
either

	mdadm -r <drive> <array>
or

	mdadm -f <drive> <array>

could be used on a stopped array.  Am I mistaken?

> create a raid1 of the suspect drive marked write-mostly and the
> new spare,

But doesn't creating the array with the drive wipe the contents?  If so, it
doesn't seem to me this provides much redundancy.

> then add the raid1 in place of the suspect drive.

Before starting the array?  If so, how?  Or should one do an assemble
including the newly minted RAID1?  I thought mdadm would take the newly
added drive to be blank, even if it isn't.

> For any
> chunks present on the new drive the reads will go there, reducing

Huh?  Are you saying any read which finds one chunk missing will
automatically write back the missing data (doing a spot rebuild), or
something else?

> access, while data is copied from the old to the new in resync, and

See my query above.  It seems to me you are saying the RAID1 can be created
without wiping the drive.

> writes still go to the old suspect drive so if the new drive fails you
> are no worse off.

I think I would expect the old drive to be more likely to fail than the new.

> When the raid1 is clean you stop the main array and
> back the suspect drive out.

OK, basically the same question.  How does one disassemble the RAID1 array
without wiping the data on the new drive?
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  1:21   ` Leslie Rhorer
@ 2009-05-13  3:27     ` Goswin von Brederlow
  2009-05-13  4:36       ` Neil Brown
  2009-05-13  4:31     ` Neil Brown
  1 sibling, 1 reply; 24+ messages in thread
From: Goswin von Brederlow @ 2009-05-13  3:27 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

"Leslie Rhorer" <lrhorer@satx.rr.com> writes:

>> This is one of many things proposed occasionally here, no real
>> objection, sometimes loud support, but no one actually *does* the code.
>
> 	At the risk of being a metoo, I would really love this feature.
>
>> You have described the problem exactly, and the solution is still to do
>> it manually. But you don't need to fail the drive long term, if you can
>> stop the array for a few moments. You stop the array, remove the suspect
>> drive,
>
> Um, how, exactly?  That is to say, after stopping the array, how does one
> remove the drive?  From the next step in your suggestion, it doesn't seem
> tome you are talking about physically removing the drive, so how does one
> remove a drive from a stopped array for this purpose?  I didn't think that
> either
>
> 	mdadm -r <drive> <array>
> or
>
> 	mdadm -f <drive> <array>
>
> could be used on a stopped array.  Am I mistaken?
>
>> create a raid1 of the suspect drive marked write-mostly and the
>> new spare,
>
> But doesn't creating the array with the drive wipe the contents?  If so, it
> doesn't seem to me this provides much redundancy.
>
>> then add the raid1 in place of the suspect drive.
>
> Before starting the array?  If so, how?  Or should one do an assemble
> including the newly minted RAID1?  I thought mdadm would take the newly
> added drive to be blank, even if it isn't.
>
>> For any
>> chunks present on the new drive the reads will go there, reducing
>
> Huh?  Are you saying any read which finds one chunk missing will
> automatically write back the missing data (doing a spot rebuild), or
> something else?
>
>> access, while data is copied from the old to the new in resync, and
>
> See my query above.  It seems to me you are saying the RAID1 can be created
> without wiping the drive.
>
>> writes still go to the old suspect drive so if the new drive fails you
>> are no worse off.
>
> I think I would expect the old drive to be more likely to fail than the new.
>
>> When the raid1 is clean you stop the main array and
>> back the suspect drive out.
>
> OK, basically the same question.  How does one disassemble the RAID1 array
> without wiping the data on the new drive?

I think he ment this:

mdadm --stop /dev/md0
mdadm --build /dev/md9 --chunk=64k --level=1 --raid-devices=2 /dev/suspect /dev/new
mdadm --assemble /dev/md0 /dev/md9 /dev/other ...

MfG
        Goswin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: Requesting replace mode for changing a disk
  2009-05-13  1:21   ` Leslie Rhorer
  2009-05-13  3:27     ` Goswin von Brederlow
@ 2009-05-13  4:31     ` Neil Brown
  2009-05-13  4:37       ` SandeepKsinha
  2009-05-13  7:28       ` Goswin von Brederlow
  1 sibling, 2 replies; 24+ messages in thread
From: Neil Brown @ 2009-05-13  4:31 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On Tuesday May 12, lrhorer@satx.rr.com wrote:
> 
> But doesn't creating the array with the drive wipe the contents?  If so, it
> doesn't seem to me this provides much redundancy.

No.  Creating an array does not wipe the contents.
It might cause a resync which will copy contents from one drive to the
other and I don't promise which one.
However if you:

   mdadm -C /dev/md0 --level 1 -n 2 /dev/foo missing
   mdadm /dev/md0 --add /dev/bar

then the contents on /dev/foo will not be changed (except for a few K
at the end for the metadata) and then all of foo will be copied to
bar.

NeilBrown

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  3:27     ` Goswin von Brederlow
@ 2009-05-13  4:36       ` Neil Brown
  2009-05-13  7:37         ` Goswin von Brederlow
  2009-05-14 10:44         ` David Greaves
  0 siblings, 2 replies; 24+ messages in thread
From: Neil Brown @ 2009-05-13  4:36 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: lrhorer, 'Linux RAID'

On Wednesday May 13, goswin-v-b@web.de wrote:
> > OK, basically the same question.  How does one disassemble the RAID1 array
> > without wiping the data on the new drive?
> 
> I think he ment this:
> 
> mdadm --stop /dev/md0
> mdadm --build /dev/md9 --chunk=64k --level=1 --raid-devices=2 /dev/suspect /dev/new
> mdadm --assemble /dev/md0 /dev/md9 /dev/other ...

or better still:

  mdadm --grow /dev/md0 --bitmap internal
  mdadm /dev/md0 --fail /dev/suspect --remove /dev/suspect
  mdadm --build /dev/md9 --level 1 --raid-devices 2 /dev/suspect missing
  mdadm /dev/md0 --add /dev/md9
  mdadm /dev/md9 --add /dev/new

no down time at all.  The bitmap ensures that /dev/md9 will be
recovered almost immediately once it is added back in to the array.

The one problem with this approach is that if there is a read error on
/dev/suspect while data is being copied to /dev/new, you lose.

Hence the requested functionality which I do hope to implement for
raid456 and raid10 (it adds no value to raid1).
Maybe by the end of this year... it is on the roadmap.

NeilBrown

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  4:31     ` Neil Brown
@ 2009-05-13  4:37       ` SandeepKsinha
  2009-05-13  4:54         ` Neil Brown
  2009-05-13  7:28       ` Goswin von Brederlow
  1 sibling, 1 reply; 24+ messages in thread
From: SandeepKsinha @ 2009-05-13  4:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: lrhorer, Linux RAID

Hi,

On Wed, May 13, 2009 at 10:01 AM, Neil Brown <neilb@suse.de> wrote:
> On Tuesday May 12, lrhorer@satx.rr.com wrote:
>>
>> But doesn't creating the array with the drive wipe the contents?  If so, it
>> doesn't seem to me this provides much redundancy.
>
> No.  Creating an array does not wipe the contents.
> It might cause a resync which will copy contents from one drive to the
> other and I don't promise which one.
> However if you:
>
Now, my question is that what if I create a RAID1 with 100 disks on each side.
Do you mean to say that there will be unnecessary resync happening
there as well, that too for unallocated/written data.

If thats the case, we surely need to handle these two situations
differently (1) which neil mentioned (2) the one I mentioned above.

Remember I referring to the case of creation.

>   mdadm -C /dev/md0 --level 1 -n 2 /dev/foo missing
>   mdadm /dev/md0 --add /dev/bar
>
> then the contents on /dev/foo will not be changed (except for a few K
> at the end for the metadata) and then all of foo will be copied to
> bar.
>

Will the create happen at the first place?
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Regards,
Sandeep.





 	
“To learn is to change. Education is a process that changes the learner.”
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  4:37       ` SandeepKsinha
@ 2009-05-13  4:54         ` Neil Brown
  2009-05-13  5:07           ` SandeepKsinha
  0 siblings, 1 reply; 24+ messages in thread
From: Neil Brown @ 2009-05-13  4:54 UTC (permalink / raw)
  To: SandeepKsinha; +Cc: lrhorer, Linux RAID

On Wednesday May 13, sandeepksinha@gmail.com wrote:
> Hi,
> 
> On Wed, May 13, 2009 at 10:01 AM, Neil Brown <neilb@suse.de> wrote:
> > On Tuesday May 12, lrhorer@satx.rr.com wrote:
> >>
> >> But doesn't creating the array with the drive wipe the contents?  If so, it
> >> doesn't seem to me this provides much redundancy.
> >
> > No.  Creating an array does not wipe the contents.
> > It might cause a resync which will copy contents from one drive to the
> > other and I don't promise which one.
> > However if you:
> >
> Now, my question is that what if I create a RAID1 with 100 disks on each side.
> Do you mean to say that there will be unnecessary resync happening
> there as well, that too for unallocated/written data.

I'm not sure what "100 disks on each side" means.
Do you mean a raid1 across 100 devices?  i.e. 100 copies of each
block?

In any case, md has no concept of unallocated/written data.  Every
block is potentially meaningful and needs to be copied for resync.

I have had thoughts about keeping track of which blocks have been used
so that 'TRIM' can be passed down.  But it is a long way from being a
reality.


> 
> If thats the case, we surely need to handle these two situations
> differently (1) which neil mentioned (2) the one I mentioned above.
> 
> Remember I referring to the case of creation.
> 
> >   mdadm -C /dev/md0 --level 1 -n 2 /dev/foo missing
> >   mdadm /dev/md0 --add /dev/bar
> >
> > then the contents on /dev/foo will not be changed (except for a few K
> > at the end for the metadata) and then all of foo will be copied to
> > bar.
> >
> 
> Will the create happen at the first place?

I don't understand this question, sorry.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  4:54         ` Neil Brown
@ 2009-05-13  5:07           ` SandeepKsinha
  2009-05-13  5:21             ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: SandeepKsinha @ 2009-05-13  5:07 UTC (permalink / raw)
  To: Neil Brown; +Cc: lrhorer, Linux RAID

On Wed, May 13, 2009 at 10:24 AM, Neil Brown <neilb@suse.de> wrote:
> On Wednesday May 13, sandeepksinha@gmail.com wrote:
>> Hi,
>>
>> On Wed, May 13, 2009 at 10:01 AM, Neil Brown <neilb@suse.de> wrote:
>> > On Tuesday May 12, lrhorer@satx.rr.com wrote:
>> >>
>> >> But doesn't creating the array with the drive wipe the contents?  If so, it
>> >> doesn't seem to me this provides much redundancy.
>> >
>> > No.  Creating an array does not wipe the contents.
>> > It might cause a resync which will copy contents from one drive to the
>> > other and I don't promise which one.
>> > However if you:
>> >
>> Now, my question is that what if I create a RAID1 with 100 disks on each side.
>> Do you mean to say that there will be unnecessary resync happening
>> there as well, that too for unallocated/written data.
>
> I'm not sure what "100 disks on each side" means.
> Do you mean a raid1 across 100 devices?  i.e. 100 copies of each
> block?
>

I meant we have two copies of 100 disks on each side of the mirror.
Sorry, I am not very sure how md would handle it but say, I created
two logical volumes of 100 disks and try to make a raid1 out of it.

> In any case, md has no concept of unallocated/written data.  Every
> block is potentially meaningful and needs to be copied for resync.
>

So, while creation it is always guaranteed that a resync will always
happen. I believe this can be avoided by just adding some flags. The
user can specify its intention.

> I have had thoughts about keeping track of which blocks have been used
> so that 'TRIM' can be passed down.  But it is a long way from being a
> reality.
>
>
>>
>> If thats the case, we surely need to handle these two situations
>> differently (1) which neil mentioned (2) the one I mentioned above.
>>
>> Remember I referring to the case of creation.
>>
>> >   mdadm -C /dev/md0 --level 1 -n 2 /dev/foo missing
>> >   mdadm /dev/md0 --add /dev/bar
>> >
>> > then the contents on /dev/foo will not be changed (except for a few K
>> > at the end for the metadata) and then all of foo will be copied to
>> > bar.
>> >
>>
>> Will the create happen at the first place?
>
> I don't understand this question, sorry.
>
Actually I could not understand, what did you mean by "missing" in the
above line, which creates the array.

> NeilBrown
>



-- 
Regards,
Sandeep.





 	
“To learn is to change. Education is a process that changes the learner.”
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  5:07           ` SandeepKsinha
@ 2009-05-13  5:21             ` NeilBrown
  2009-05-13  5:31               ` SandeepKsinha
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2009-05-13  5:21 UTC (permalink / raw)
  To: SandeepKsinha; +Cc: lrhorer, Linux RAID

On Wed, May 13, 2009 3:07 pm, SandeepKsinha wrote:
> On Wed, May 13, 2009 at 10:24 AM, Neil Brown <neilb@suse.de> wrote:
>> On Wednesday May 13, sandeepksinha@gmail.com wrote:
>>> Hi,
>>>
>>> On Wed, May 13, 2009 at 10:01 AM, Neil Brown <neilb@suse.de> wrote:
>>> > On Tuesday May 12, lrhorer@satx.rr.com wrote:
>>> >>
>>> >> But doesn't creating the array with the drive wipe the contents?  If
>>> so, it
>>> >> doesn't seem to me this provides much redundancy.
>>> >
>>> > No.  Creating an array does not wipe the contents.
>>> > It might cause a resync which will copy contents from one drive to
>>> the
>>> > other and I don't promise which one.
>>> > However if you:
>>> >
>>> Now, my question is that what if I create a RAID1 with 100 disks on
>>> each side.
>>> Do you mean to say that there will be unnecessary resync happening
>>> there as well, that too for unallocated/written data.
>>
>> I'm not sure what "100 disks on each side" means.
>> Do you mean a raid1 across 100 devices?  i.e. 100 copies of each
>> block?
>>
>
> I meant we have two copies of 100 disks on each side of the mirror.
> Sorry, I am not very sure how md would handle it but say, I created
> two logical volumes of 100 disks and try to make a raid1 out of it.

for example, two raid0 array, each out of 100 drives.  Then a raid1
joining them.
Yes, you could do that (though it would generally be better to create
100 raid1 pairs, and make a raid0 of them, but that is beside the point
I think).

>
>> In any case, md has no concept of unallocated/written data.  Every
>> block is potentially meaningful and needs to be copied for resync.
>>
>
> So, while creation it is always guaranteed that a resync will always
> happen. I believe this can be avoided by just adding some flags. The
> user can specify its intention.

You can get the resync not to happen by using the "--assume-clean"
flag when you create the array.
However that doesn't really save a lot and it will still have to do
a complete copy if you ever fail a drive and replace it.
So it is a small optimisation.


>>>
>>> If thats the case, we surely need to handle these two situations
>>> differently (1) which neil mentioned (2) the one I mentioned above.
>>>
>>> Remember I referring to the case of creation.
>>>
>>> >   mdadm -C /dev/md0 --level 1 -n 2 /dev/foo missing
>>> >   mdadm /dev/md0 --add /dev/bar
>>> >
>>> > then the contents on /dev/foo will not be changed (except for a few K
>>> > at the end for the metadata) and then all of foo will be copied to
>>> > bar.
>>> >
>>>
>>> Will the create happen at the first place?
>>
>> I don't understand this question, sorry.
>>
> Actually I could not understand, what did you mean by "missing" in the
> above line, which creates the array.

Read the manpage for mdadm.

The word "missing" means create the array without any device in this slot.
It will be as though the device in that slot had failed and been removed.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  5:21             ` NeilBrown
@ 2009-05-13  5:31               ` SandeepKsinha
  2009-05-13 10:51                 ` Neil Brown
  0 siblings, 1 reply; 24+ messages in thread
From: SandeepKsinha @ 2009-05-13  5:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: lrhorer, Linux RAID

On Wed, May 13, 2009 at 10:51 AM, NeilBrown <neilb@suse.de> wrote:
> On Wed, May 13, 2009 3:07 pm, SandeepKsinha wrote:
>> On Wed, May 13, 2009 at 10:24 AM, Neil Brown <neilb@suse.de> wrote:
>>> On Wednesday May 13, sandeepksinha@gmail.com wrote:
>>>> Hi,
>>>>
>>>> On Wed, May 13, 2009 at 10:01 AM, Neil Brown <neilb@suse.de> wrote:
>>>> > On Tuesday May 12, lrhorer@satx.rr.com wrote:
>>>> >>
>>>> >> But doesn't creating the array with the drive wipe the contents?  If
>>>> so, it
>>>> >> doesn't seem to me this provides much redundancy.
>>>> >
>>>> > No.  Creating an array does not wipe the contents.
>>>> > It might cause a resync which will copy contents from one drive to
>>>> the
>>>> > other and I don't promise which one.
>>>> > However if you:
>>>> >
>>>> Now, my question is that what if I create a RAID1 with 100 disks on
>>>> each side.
>>>> Do you mean to say that there will be unnecessary resync happening
>>>> there as well, that too for unallocated/written data.
>>>
>>> I'm not sure what "100 disks on each side" means.
>>> Do you mean a raid1 across 100 devices?  i.e. 100 copies of each
>>> block?
>>>
>>
>> I meant we have two copies of 100 disks on each side of the mirror.
>> Sorry, I am not very sure how md would handle it but say, I created
>> two logical volumes of 100 disks and try to make a raid1 out of it.
>
> for example, two raid0 array, each out of 100 drives.  Then a raid1
> joining them.
> Yes, you could do that (though it would generally be better to create
> 100 raid1 pairs, and make a raid0 of them, but that is beside the point
> I think).
>
>>
>>> In any case, md has no concept of unallocated/written data.  Every
>>> block is potentially meaningful and needs to be copied for resync.
>>>
>>
>> So, while creation it is always guaranteed that a resync will always
>> happen. I believe this can be avoided by just adding some flags. The
>> user can specify its intention.
>
> You can get the resync not to happen by using the "--assume-clean"
> flag when you create the array.
> However that doesn't really save a lot and it will still have to do
> a complete copy if you ever fail a drive and replace it.
> So it is a small optimisation.
>
>
>>>>
>>>> If thats the case, we surely need to handle these two situations
>>>> differently (1) which neil mentioned (2) the one I mentioned above.
>>>>
>>>> Remember I referring to the case of creation.
>>>>
>>>> >   mdadm -C /dev/md0 --level 1 -n 2 /dev/foo missing
>>>> >   mdadm /dev/md0 --add /dev/bar
>>>> >
>>>> > then the contents on /dev/foo will not be changed (except for a few K
>>>> > at the end for the metadata) and then all of foo will be copied to
>>>> > bar.
>>>> >
>>>>
>>>> Will the create happen at the first place?
>>>
>>> I don't understand this question, sorry.
>>>
>> Actually I could not understand, what did you mean by "missing" in the
>> above line, which creates the array.
>
> Read the manpage for mdadm.
>
> The word "missing" means create the array without any device in this slot.
> It will be as though the device in that slot had failed and been removed.
>

Thanks but I am still confused. Here is what I see on the console.


[10:55:16 sinhas]$ sudo mdadm -C /dev/md0 --level 1 -n 2 /dev/sda5 missing
mdadm: /dev/sda5 appears to contain an ext2fs file system
    size=19535008K  mtime=Sun Apr  5 21:34:00 2009
mdadm: /dev/sda5 appears to be part of a raid array:
    level=raid1 devices=2 ctime=Wed May 13 10:06:33 2009
Continue creating array? y
mdadm: RUN_ARRAY failed: Invalid argument
mdadm: stopped /dev/md0

> NeilBrown
>
>



-- 
Regards,
Sandeep.





 	
“To learn is to change. Education is a process that changes the learner.”
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  4:31     ` Neil Brown
  2009-05-13  4:37       ` SandeepKsinha
@ 2009-05-13  7:28       ` Goswin von Brederlow
  1 sibling, 0 replies; 24+ messages in thread
From: Goswin von Brederlow @ 2009-05-13  7:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: lrhorer, 'Linux RAID'

Neil Brown <neilb@suse.de> writes:

> On Tuesday May 12, lrhorer@satx.rr.com wrote:
>> 
>> But doesn't creating the array with the drive wipe the contents?  If so, it
>> doesn't seem to me this provides much redundancy.
>
> No.  Creating an array does not wipe the contents.
> It might cause a resync which will copy contents from one drive to the
> other and I don't promise which one.
> However if you:
>
>    mdadm -C /dev/md0 --level 1 -n 2 /dev/foo missing
>    mdadm /dev/md0 --add /dev/bar
>
> then the contents on /dev/foo will not be changed (except for a few K
> at the end for the metadata) and then all of foo will be copied to
> bar.
>
> NeilBrown

But as the disk is already part of a raid those few K at the end would
be the critical meta data of the original raid.

This only works with a raid without metatada.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  4:36       ` Neil Brown
@ 2009-05-13  7:37         ` Goswin von Brederlow
  2009-05-13 11:02           ` Neil Brown
  2009-05-14 10:44         ` David Greaves
  1 sibling, 1 reply; 24+ messages in thread
From: Goswin von Brederlow @ 2009-05-13  7:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: Goswin von Brederlow, lrhorer, 'Linux RAID'

Neil Brown <neilb@suse.de> writes:

> On Wednesday May 13, goswin-v-b@web.de wrote:
>> > OK, basically the same question.  How does one disassemble the RAID1 array
>> > without wiping the data on the new drive?
>> 
>> I think he ment this:
>> 
>> mdadm --stop /dev/md0
>> mdadm --build /dev/md9 --chunk=64k --level=1 --raid-devices=2 /dev/suspect /dev/new
>> mdadm --assemble /dev/md0 /dev/md9 /dev/other ...
>
> or better still:
>
>   mdadm --grow /dev/md0 --bitmap internal
>   mdadm /dev/md0 --fail /dev/suspect --remove /dev/suspect
>   mdadm --build /dev/md9 --level 1 --raid-devices 2 /dev/suspect missing
>   mdadm /dev/md0 --add /dev/md9
>   mdadm /dev/md9 --add /dev/new
>
> no down time at all.  The bitmap ensures that /dev/md9 will be
> recovered almost immediately once it is added back in to the array.

I keep forgetting bitmaps. :)

> The one problem with this approach is that if there is a read error on
> /dev/suspect while data is being copied to /dev/new, you lose.
>
> Hence the requested functionality which I do hope to implement for
> raid456 and raid10 (it adds no value to raid1).
> Maybe by the end of this year... it is on the roadmap.
>
> NeilBrown

What about raid0? You can't use your bitmap trick there.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  5:31               ` SandeepKsinha
@ 2009-05-13 10:51                 ` Neil Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Neil Brown @ 2009-05-13 10:51 UTC (permalink / raw)
  To: SandeepKsinha; +Cc: lrhorer, Linux RAID

On Wednesday May 13, sandeepksinha@gmail.com wrote:
> 
> Thanks but I am still confused. Here is what I see on the console.
> 
> 
> [10:55:16 sinhas]$ sudo mdadm -C /dev/md0 --level 1 -n 2 /dev/sda5 missing
> mdadm: /dev/sda5 appears to contain an ext2fs file system
>     size=19535008K  mtime=Sun Apr  5 21:34:00 2009
> mdadm: /dev/sda5 appears to be part of a raid array:
>     level=raid1 devices=2 ctime=Wed May 13 10:06:33 2009
> Continue creating array? y
> mdadm: RUN_ARRAY failed: Invalid argument
> mdadm: stopped /dev/md0

Odd.  Look for explanatory message in 'dmesg'.

NeilBrown

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  7:37         ` Goswin von Brederlow
@ 2009-05-13 11:02           ` Neil Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Neil Brown @ 2009-05-13 11:02 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: lrhorer, 'Linux RAID'

On Wednesday May 13, goswin-v-b@web.de wrote:
> Neil Brown <neilb@suse.de> writes:
> 
> > On Wednesday May 13, goswin-v-b@web.de wrote:
> >> > OK, basically the same question.  How does one disassemble the RAID1 array
> >> > without wiping the data on the new drive?
> >> 
> >> I think he ment this:
> >> 
> >> mdadm --stop /dev/md0
> >> mdadm --build /dev/md9 --chunk=64k --level=1 --raid-devices=2 /dev/suspect /dev/new
> >> mdadm --assemble /dev/md0 /dev/md9 /dev/other ...
> >
> > or better still:
> >
> >   mdadm --grow /dev/md0 --bitmap internal
> >   mdadm /dev/md0 --fail /dev/suspect --remove /dev/suspect
> >   mdadm --build /dev/md9 --level 1 --raid-devices 2 /dev/suspect missing
> >   mdadm /dev/md0 --add /dev/md9
> >   mdadm /dev/md9 --add /dev/new
> >
> > no down time at all.  The bitmap ensures that /dev/md9 will be
> > recovered almost immediately once it is added back in to the array.
> 
> I keep forgetting bitmaps. :)
> 
> > The one problem with this approach is that if there is a read error on
> > /dev/suspect while data is being copied to /dev/new, you lose.
> >
> > Hence the requested functionality which I do hope to implement for
> > raid456 and raid10 (it adds no value to raid1).
> > Maybe by the end of this year... it is on the roadmap.
> >
> > NeilBrown
> 
> What about raid0? You can't use your bitmap trick there.

I seriously had not considered raid0 for this functionality at all.
I guess I assume that people who use raid0 directly on normal drives
don't really value their data, so if a device starts failing, they
will just give up the data as lost (i.e. use the raid0 as a cache for
something).

Maybe I need to come up with a way to atomically swap a device in any
array....  maybe.

I actually would really like to provide this hot-replace functionality
without explicitly implementing it for each level.

The first part of that is to implement support for maintaining a
bad-block-list.  This is a per-device list that identifies sectors
that should fail when read.

Then if you resync a raid1 and you get a read failure, you don't have
to reject the whole drive, you just record the bad block (on both
drives) and move on.

Then we can use "swap the drive for a raid1" to mostly implement
hot-replace.
Once the recovery finishes, mdadm can check out the bad block list,
and trigger a resync in the top-level array for just those sectors.
That will cause the bad block to be over-written by good data from the
top-level.   This removes the bad block from the list.
Once the list is empty (for the new drive), we swap out the raid1 and
put the new drive back in and all is happy.

To be able use this as a real solution, I think we want that
atomic-swap function.  Using the bitmap trick is OK, but not ideal
(and as you say, doesn't work on raid0).

My other unresolved issue about this approach is correct handling of
the metadata.  If we crash in the middle of a hot-recovery I want to
be sure that the new drive isn't mistakenly assumed to be fully
recovered.  When the metadata is at the end, that should "just work".
But when it is at the start it becomes more awkward.  This is probably
solvable, but I haven't solved it yet.

NeilBrown


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-13  4:36       ` Neil Brown
  2009-05-13  7:37         ` Goswin von Brederlow
@ 2009-05-14 10:44         ` David Greaves
  2009-05-14 12:00           ` Neil Brown
  1 sibling, 1 reply; 24+ messages in thread
From: David Greaves @ 2009-05-14 10:44 UTC (permalink / raw)
  To: Neil Brown; +Cc: Goswin von Brederlow, lrhorer, 'Linux RAID'

Neil Brown wrote:
> The one problem with this approach is that if there is a read error on
> /dev/suspect while data is being copied to /dev/new, you lose.
> 
> Hence the requested functionality which I do hope to implement for
> raid456 and raid10 (it adds no value to raid1).
> Maybe by the end of this year... it is on the roadmap.

Neil,
If you have ideas about how this should be accomplished then outlining them may
provide a reasonable starting point for those new to the code; especially  if
there are any steps that you may clearly see that would help others to make a start.

I've posted this request a few times but the md code is sufficiently
overwhelming that I haven't attempted a solution.

David

-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
  2009-05-14 10:44         ` David Greaves
@ 2009-05-14 12:00           ` Neil Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Neil Brown @ 2009-05-14 12:00 UTC (permalink / raw)
  To: David Greaves; +Cc: Goswin von Brederlow, lrhorer, 'Linux RAID'

On Thursday May 14, david@dgreaves.com wrote:
> Neil Brown wrote:
> > The one problem with this approach is that if there is a read error on
> > /dev/suspect while data is being copied to /dev/new, you lose.
> > 
> > Hence the requested functionality which I do hope to implement for
> > raid456 and raid10 (it adds no value to raid1).
> > Maybe by the end of this year... it is on the roadmap.
> 
> Neil,
> If you have ideas about how this should be accomplished then outlining them may
> provide a reasonable starting point for those new to the code; especially  if
> there are any steps that you may clearly see that would help others to make a start.

As I said in some other email recently, I think an important precursor
to this hot-replace functionality is to support a per-device bad-block
list.  This allows a device to remain in an array even if a few blocks
have failed - only individual stripes will be degraded.
Then the hot-replace function can be used on only on drives that are
threatening bad blocks, but also on drives that have actually
delivered bad blocks.

The procedure for effecting a hot-replace would then be:
 - swap the suspect device for a no-metadata raid1 containing just
   the suspect device (it's not clear to me yet exactly how this
   will be managed but I have some ideas)
 - add the new device to the raid1
 - enable an in-memory bad-block list for the raid1
 - allow a recovery that just recovers the data part of the
   suspect device, not the metadata.  Any read errors will simply add
   to the bad block list
 - For each entry in this suspect drive's bad-block-list, trigger
   a resync of just that block in the top-level array.  This involves
   setting up 'low' and 'high' values via sysfs and writing 'repair'
   to sync_action.
   This should clear the entry from the bad block list.
 - once the bad block list is clear ... sort out the metadata some
   how, and swap the new device in place of the raid1.

Getting the metadata right is the awkward bit.  When the main array
writes metadata to the raid1, I don't want it to go the new drive
until the new drive actually have fully up-to-date data.
The only way I can think at the moment to make it work is to build a 
raid1 from just the data parts of the two devices, and use a linear
array to combine that with the metadata parts of the suspect device
and give the linear array to the main device.  That would work, but it
seems rather ugly, so I'm not convinced

Anyway, the first step is getting a bad-block-list working.

Below are some notes I wrote a while ago when someone else was showing
interest in a bad block list.  Nothing has come of that yet.
It envisages the BBL being associated with an 'externally managed
metadata' array.  For this purpose, I would want it also to work for
"no metadata" array, and possible for 1.x arrays with the kernel
writing the BBL to the device (maybe).

-------------------
I envisage these changes to the kernel:
 1/ store a BBL with each rdev, and make it available for read/write
    through a sysfs file (or two).
    It would probably be stored as an RB-tree or similar,  The
    assumption is that the log would normally be very small and
    sparse. 

 2/ any READ request against a block that is listed in the BBL returns
    a failure (or is detected by read-balancing and causes a different
    device to be chosen).

 3/ any WRITE request against a block in the BBL is attempted and if
    it succeeds, the block is removed from the BBL.

 4/ When recovery gets a read failure, it adds the block to the BBL
    rather than trying to write it.
    Adding a block to the BBL causes the sysfs file to report as
    'urgent-readable' to 'poll' (POLLPRI) thus allowing userspace to
    find the new bad blocks and add them to the list on stable storage.

 5/ When a write error causes a drive to be marked as
    'failed/blocked', userspace can either unblock and remove it (as
    currently) or update the BBL with the offending blocks and
    re-enable the drive.

One difficulty is how to present the BBL through sysfs.
A sysfs file is limited to 4096 characters and we may want the BBL to
be large enough to exceed that.
I have an idea that entries in the BBL can be either 'acknowledged' or
'unacknowledged'.  Then the sysfs file lists the unacknowledged blocks
first.  userspace can write to the sysfs file to acknowledge blocks,
which then allows other blocks to appear in the file.

To read all the entries in the BBL, we could write a message that
means "mark all entries and unacknowledged", then read and acknowledge
until everything has been read.

Alternately we could have a second file into which we can write the
address of the smallest block that we want to read from the main file.
 
I'm assuming that the BBL would allow a granularity of 512 byte sectors.  
-----------------------------------------------

The 'bbl' would be a library of code that each raid personality can
choose to make use, much like the bitmap.c code.

I think that implementing bbl.c should be a reasonably manageable
project for someone with reasonable coding skills but minimal
knowledge of md.  It would involve
  - creating and maintaining the in-memory bbl
  - providing access to it via sysfs
  - providing appropriate interface routines for md/raidX to call.

We would then need to define a way to enable a bbl on a given device.
I imagine the one sysfs file would serve.
  The file '/sys/block/mdX/md/dev-foo/bbl'
  initially reads a 'none'
  If you write 'clear' to it, and empty bbl is created
  If you write "+sector address", that address is added to it.
    If it was already present, it gets 'acknowledged'.
  If you write "-sector address", that address is removed
  If you write "flush" (??) all entries get un-acknowleged
  If you read, you get all the un-acknowleged address, in order, then
   all the acknowledged addresses.

It would be important that this does not slow IO down.  So lookups
should be fast. 
In most cases the list will be empty.  In that case, the lookup must be
extremely fast (definitely no locking)

Is that enough to get you started :-)

NeilBrown

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Requesting replace mode for changing a disk
@ 2009-05-13  4:08 Sandeep K Sinha
  0 siblings, 0 replies; 24+ messages in thread
From: Sandeep K Sinha @ 2009-05-13  4:08 UTC (permalink / raw)
  To: linux-raid

>"Leslie Rhorer" <lrhorer@satx.rr.com> writes:

>> This is one of many things proposed occasionally here, no real
>> objection, sometimes loud support, but no one actually *does* the code.
>
> 	At the risk of being a metoo, I would really love this feature.
>
>> You have described the problem exactly, and the solution is still to do
>> it manually. But you don't need to fail the drive long term, if you can
>> stop the array for a few moments. You stop the array, remove the suspect
>> drive,
>
> Um, how, exactly?  That is to say, after stopping the array, how does one
> remove the drive?  From the next step in your suggestion, it doesn't seem
> tome you are talking about physically removing the drive, so how does one
> remove a drive from a stopped array for this purpose?  I didn't think that
> either
>
> 	mdadm -r <drive> <array>
> or
>
> 	mdadm -f <drive> <array>
>
> could be used on a stopped array.  Am I mistaken?
>
>> create a raid1 of the suspect drive marked write-mostly and the
>> new spare,
>
> But doesn't creating the array with the drive wipe the contents?  If so, it
> doesn't seem to me this provides much redundancy.
>
>> then add the raid1 in place of the suspect drive.
>
> Before starting the array?  If so, how?  Or should one do an assemble
> including the newly minted RAID1?  I thought mdadm would take the newly
> added drive to be blank, even if it isn't.
>
>> For any
>> chunks present on the new drive the reads will go there, reducing
>
> Huh?  Are you saying any read which finds one chunk missing will
> automatically write back the missing data (doing a spot rebuild), or
> something else?
>

Yes, this is one of concepts that used while using a mirrored configuration.
I am not sure though, that even md has it. Even a read fails on one of
the part of the mirror
then it is served from the mirror. Also, a write is sent back to keep
the mirror in sync.
This should ideally happen, as if a read fails on the device which is
being mirrored, you are
aware that there is something wrong and you should try to recover from
it silently.


>> access, while data is copied from the old to the new in resync, and
>
> See my query above.  It seems to me you are saying the RAID1 can be created
> without wiping the drive.
>
>> writes still go to the old suspect drive so if the new drive fails you
>> are no worse off.
>
> I think I would expect the old drive to be more likely to fail than the new.
>
>> When the raid1 is clean you stop the main array and
>> back the suspect drive out.
>
> OK, basically the same question.  How does one disassemble the RAID1 array
> without wiping the data on the new drive?

> I think he ment this:
>
> mdadm --stop /dev/md0
> mdadm --build /dev/md9 --chunk=64k --level=1 --raid-devices=2 /dev/suspect /dev/new
> mdadm --assemble /dev/md0 /dev/md9 /dev/other ...

> MfG
>        Goswin

Now the point here is that is it so simple to offline an array and
just assimilate it back with a different raid level.
IMO, may be on desktop environment it is acceptable but not when
deployed on a larger scale.

What I believe here is that there should be an option to fail to a
disk gradually. This can be invoked the moment you start getting to
know that your disk is faulty and is soon going to be blown away.
The first approach is to start copying the contents of that disk to a
spare disk. And once completed, the faulty disk should be replaced
with the newly constructed disk.

In the course of construction, write must go on both the devices and
read must be served from the original faulty disk.
I am not sure whether we have a similar mechanism implemented in md or
not, but this will sound better for a long term goal.

If not I believe failing should be provided with this extra
functionality of "immediate" or "gradual" failure. As both should
ideally start reconstruction.

Comments please.


--
Regards,
Sandeep.






“To learn is to change. Education is a process that changes the learner.”
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-05-14 12:00 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-08 22:15 Requesting replace mode for changing a disk Goswin von Brederlow
2009-05-09 11:41 ` John Robinson
2009-05-09 23:07 ` Bill Davidsen
2009-05-10  1:22   ` Goswin von Brederlow
2009-05-10  2:20   ` Guy Watkins
2009-05-10  7:02     ` Goswin von Brederlow
2009-05-10 14:33     ` Bill Davidsen
2009-05-10 15:55       ` Guy Watkins
2009-05-13  1:21   ` Leslie Rhorer
2009-05-13  3:27     ` Goswin von Brederlow
2009-05-13  4:36       ` Neil Brown
2009-05-13  7:37         ` Goswin von Brederlow
2009-05-13 11:02           ` Neil Brown
2009-05-14 10:44         ` David Greaves
2009-05-14 12:00           ` Neil Brown
2009-05-13  4:31     ` Neil Brown
2009-05-13  4:37       ` SandeepKsinha
2009-05-13  4:54         ` Neil Brown
2009-05-13  5:07           ` SandeepKsinha
2009-05-13  5:21             ` NeilBrown
2009-05-13  5:31               ` SandeepKsinha
2009-05-13 10:51                 ` Neil Brown
2009-05-13  7:28       ` Goswin von Brederlow
2009-05-13  4:08 Sandeep K Sinha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.