All of lore.kernel.org
 help / color / mirror / Atom feed
* safe segmenting of conflicting changes (was: Two degraded mirror segments recombined out of sync for massive data loss)
@ 2010-04-23 13:42 Christian Gatzemeier
  2010-04-23 15:08 ` Phillip Susi
  2010-05-05 11:28 ` detecting segmentation / conflicting changes Christian Gatzemeier
  0 siblings, 2 replies; 20+ messages in thread
From: Christian Gatzemeier @ 2010-04-23 13:42 UTC (permalink / raw)
  To: linux-raid

Hi all,

On 4/7/2010 7:49 PM, Neil Brown wrote:
>
> mdadm --incremental should only included both disks in the array if
> 1/ their event counts are the same, or +/- 1, or
> 2/ there is a write-intent bitmap and the older event count is within
>    the range recorded in the write-intent bitmap.
>
> I don't think there is anything practical that could be changed in md
> or
> mdadm to make it possible to catch this behaviour and refuse the
> assemble the array... 

Maybe the superblocks of members containing conflicting
changes already provide that information. I.e. won't they claim each other
to have failed, while a real failed superblock does not claim itself or
others to have failed?


> Given that you have changed both halves, you have implicitly said that
> both halves are still "good".  If they are different, you need to
> explicitly tell md which one you want and which one you don't.

Before doing dist-upgrades to your system (or larger refactoring changes
to data-arrays), it is very handy to pull a member from a raid1 to be
able to revert back (without much downtime) if something goes wrong, and
being able to switch between versions/have both versions available
for comparison/repair.

 
> If you break a mirror, change both halves, then put it together again
> there is no clearly "right" answer as to what will appear.

If the members are --incremental(y) hot-plugged I think the first part
(segment) should appear. Any further segments with conflicting changes
should not be re-added automatically (because re-syncing is not a
update action in this case, but implies changes will get lost.)

Quite the same with --assemble, it should be OK to assemble the parts in the
order given on the command line, but as soon as any conflicting changes
are detected return:

"mdadm: not re-adding /dev/<member> to /dev/<array> because it
constitutes an alternative version containing conflicting changes"


> The easiest way to do this is to use --zero-superblock on the "bad"
> device.

This actually isn't so good in an udev/hot-plugging environment, since
blkid will then detect that device as containing the filesystem on the
md device (same UUID), and it gets mounted directly instead of md device.

Here is an idea to realize safe segmenting of conflicting changes with
the resulting alternative versions manageable in a hot-plugging manner:

* When assembling, check for conflicting "failed" states in the
  superblocks to detect conflicting changes. On conflicts, i.e. if an
  additional member claims an allready running member has failed:
   + that member should not be added to the array
   + report (console and --monitor event) that an alternative
     version with conflicting changes has been detected "mdadm: not
     re-adding /dev/<member> to /dev/<array> because constitutes an
     alternative version containing conflicting changes"
   + require and support --force with --add for manual re-syncing of
     alternative versions (unlike with re-syncing outdated
     devices/versions, in this case changes will get lost).

Enhancement 1)
  To facilitate easy inspection of alternative versions (i.e. for safe and
  easy diffing, merging, etc.) --incremental could assemble array
  components that contain alternative versions into temporary
  auxiliary devices. 
  (would require temporarily mangling the fs UUID to ensure there are no
  duplicates in the system)

Enhancement 2)
  Those that want to be able to disable hot-plugging of
  segments with conflicting changes/alternative versions, after an
  incidence with multiple versions connected at the same time occured,
  will need some additional enhancements:
   + A way to mark some raid members (segments) as containing
     known alternative versions, and to mark them as such when an
     incident occurs in which they come up after another
     segment of the array is already running degraded.
   + An option like
     "AUTO -SINGLE_SEGMENTS_WITH_KNOWN_ALTERNATIVE_VERSIONS"
     to disable hotplug support for alternative versions once they came
     up after some other version.


Kind Regards,
Christian




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes (was: Two degraded mirror segments recombined out of sync for massive data loss)
  2010-04-23 13:42 safe segmenting of conflicting changes (was: Two degraded mirror segments recombined out of sync for massive data loss) Christian Gatzemeier
@ 2010-04-23 15:08 ` Phillip Susi
  2010-04-23 18:18   ` Phillip Susi
  2010-04-23 21:04   ` safe segmenting of conflicting changes, and hot-plugging between alternative versions Christian Gatzemeier
  2010-05-05 11:28 ` detecting segmentation / conflicting changes Christian Gatzemeier
  1 sibling, 2 replies; 20+ messages in thread
From: Phillip Susi @ 2010-04-23 15:08 UTC (permalink / raw)
  To: Christian Gatzemeier; +Cc: linux-raid

On 4/23/2010 9:42 AM, Christian Gatzemeier wrote:
> Maybe the superblocks of members containing conflicting
> changes already provide that information. I.e. won't they claim each other
> to have failed, while a real failed superblock does not claim itself or
> others to have failed?

Indeed, they should both say the other is failed, so when mdadm
--incremental sees the second disk claims the first disk is failed, but
it is active and working fine in the running array, it should realize
that the superblock on the second disk is wrong, and correct it, which
would leave the second disk as failed, removed, and neither use the out
of sync data on the disk, nor overwrite it with a copy from the first.

In the process of correcting the wrong superblock on the second disk,
the write intent bitmap should be reset as well to force a complete
resync if you do add it back to the array.

> Before doing dist-upgrades to your system (or larger refactoring changes
> to data-arrays), it is very handy to pull a member from a raid1 to be
> able to revert back (without much downtime) if something goes wrong, and
> being able to switch between versions/have both versions available
> for comparison/repair.

If you intend to do that you /should/ explicitly split the array first.
 If you cause that to be done by plugging one in alone and activating it
degraded, then do the same to the other, then when you plug in both this
will be detected by the above corrective action, giving you the
opportunity to move the rejected disk to a new array for inspection, or
force add it back to the old array to discard its contents and resync.

>> If you break a mirror, change both halves, then put it together again
>> there is no clearly "right" answer as to what will appear.
> 
> If the members are --incremental(y) hot-plugged I think the first part
> (segment) should appear. Any further segments with conflicting changes
> should not be re-added automatically (because re-syncing is not a
> update action in this case, but implies changes will get lost.)

Exactly.

> * When assembling, check for conflicting "failed" states in the
>   superblocks to detect conflicting changes. On conflicts, i.e. if an
>   additional member claims an allready running member has failed:
>    + that member should not be added to the array
>    + report (console and --monitor event) that an alternative
>      version with conflicting changes has been detected "mdadm: not
>      re-adding /dev/<member> to /dev/<array> because constitutes an
>      alternative version containing conflicting changes"
>    + require and support --force with --add for manual re-syncing of
>      alternative versions (unlike with re-syncing outdated
>      devices/versions, in this case changes will get lost).

Yep, that's pretty much what I've been suggesting in the bug report,
except the detailed message about conflicting changes I see as an
optional nicety.  Simply saying that the second disk is failed and
removed would be sufficient.

> Enhancement 1)
>   To facilitate easy inspection of alternative versions (i.e. for safe and
>   easy diffing, merging, etc.) --incremental could assemble array
>   components that contain alternative versions into temporary
>   auxiliary devices. 
>   (would require temporarily mangling the fs UUID to ensure there are no
>   duplicates in the system)

This part seems like it is outside the scope of mdadm, and should be
handled elsewhere.  Maybe in udisks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes (was: Two degraded mirror segments recombined out of sync for massive data loss)
  2010-04-23 15:08 ` Phillip Susi
@ 2010-04-23 18:18   ` Phillip Susi
  2010-04-26 16:59     ` safe segmenting of conflicting changes Doug Ledford
  2010-04-23 21:04   ` safe segmenting of conflicting changes, and hot-plugging between alternative versions Christian Gatzemeier
  1 sibling, 1 reply; 20+ messages in thread
From: Phillip Susi @ 2010-04-23 18:18 UTC (permalink / raw)
  To: linux-raid

After some more testing it seems the problems with --incremental are
more deep and general.  I have found two steps that both appear to do
the wrong thing:

mdadm /dev/md0 --fail /dev/sdb
mdadm /dev/md0 --remove /dev/sdb

At this point mdadm -E /dev/sda shows that sdb has been removed, but
mdadm -E of /dev/sdb shows both disks are active and in sync still.  The
metadata of sdb should be updated when failed or removed if possible.

mdadm --incremental /dev/sdb

This goes ahead and adds the disk back to the array, despite the fact
that it has been explicitly removed.

Whether or not the superblock on sdb is updated when it is removed,
--incremental should NOT use it as long as mdadm -D /dev/md0 says that
disk is removed, at least not use it in /dev/md0.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes, and hot-plugging between alternative versions
  2010-04-23 15:08 ` Phillip Susi
  2010-04-23 18:18   ` Phillip Susi
@ 2010-04-23 21:04   ` Christian Gatzemeier
  2010-04-24  8:10     ` Christian Gatzemeier
  2010-04-26 17:11     ` Doug Ledford
  1 sibling, 2 replies; 20+ messages in thread
From: Christian Gatzemeier @ 2010-04-23 21:04 UTC (permalink / raw)
  To: linux-raid

Phillip Susi <psusi <at> cfl.rr.com> writes:

> when mdadm
> --incremental sees the second disk claims the first disk is failed, but
> it is active and working fine in the running array, it should realize
> that the superblock on the second disk is wrong, and correct it, which
> would leave the second disk as failed, removed, and neither use the out
> of sync data on the disk, nor overwrite it with a copy from the first.

"Correcting the superblocks" of conflicting members, would translate into having
a defined way to mark those members as composing a segment that contains a known
alternative version of the array. The earliest an alternative version can be
detected, and thus be known and marked as such, is on an incident when a
conflicting segment comes up while another segment of the array is already
running degraded. (To simply support segments consisting of single raid member
devices it may be enough if a superblock marking itself as failed would mean it
is contains conflicting changes. Multi member segments would require segment IDs)

IMHO all segments with alternative versions can be marked as known on such 
incidences. However whether the segments containing alternative versions
continue to be normally assembled when they come up after the incident like
before, or if they get ignored in favor of the arbitrary first segment of the
incidence, should be configurable.

For users that don't need or want to be able to switch between versions of an
array by simply switching disks in a hot-pluggable manner, and for those
concerned about a failure mode that may exist and make disks available in an
alternating manner and them not noticing it all the time until an incident, I
suggested "AUTO -SINGLE_SEGMENTS_WITH_KNOWN_ALTERNATIVE_VERSIONS".

In order to manage segments with alternative versions in a hot-plug manner
however, all segments need to continue to show up under their real array ID, if
they are connected first or one at a time. (KNOWN_ALTERNATIVE_VERSIONS need to
be assembled if they come up.) If the segments would be transformed into
separate arrays the system won't recognize the segment of the array as such and
not boot or open it correctly any more. And you wouldn't be able to switch
between versions by switching the disks that are connected.
 
Kind regards,
Christian




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes, and hot-plugging between alternative versions
  2010-04-23 21:04   ` safe segmenting of conflicting changes, and hot-plugging between alternative versions Christian Gatzemeier
@ 2010-04-24  8:10     ` Christian Gatzemeier
  2010-04-26 17:11     ` Doug Ledford
  1 sibling, 0 replies; 20+ messages in thread
From: Christian Gatzemeier @ 2010-04-24  8:10 UTC (permalink / raw)
  To: linux-raid


> (To simply support segments consisting of single raid member
> devices it may be enough if a superblock marking itself as failed would
> mean itis contains conflicting changes. Multi member segments would require
> segment IDs)

Don't know why I thought that extra IDs would be needed, as UUID+device-number
is enough and degrading into multi device segments is and has been working all
the time.

Aside not catching a conflict in the case of two segments with same event count
(both segments only booted with few changes), i.e. the main issue addressed in
my original post, plugging different versions also just works.



Concerning the scheme to mark segments as containing conflicting changes for
Enhancement 2) from my original post:

I think saying a superblock marking itself as failed (.) would indicates a known
conflicting segment (alternative version), may just as well work for multiple
member segments.
In the example below, first both dev1+dev2 can be considered to contain known
conflicting segments (alternative versions) because they mark themselves as
failed. But because they mark each other as up (U) they both constitute one
two-member-segment. The dev3 was the first segment on the incidence that both
segments came up together, because it does not mark itself as failed.

dev1 [.U.]
dev2 [U..]
dev3 [..U]


Kind regards,
Christian





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-23 18:18   ` Phillip Susi
@ 2010-04-26 16:59     ` Doug Ledford
  2010-04-26 17:48       ` Phillip Susi
  0 siblings, 1 reply; 20+ messages in thread
From: Doug Ledford @ 2010-04-26 16:59 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1934 bytes --]

On 04/23/2010 02:18 PM, Phillip Susi wrote:
> After some more testing it seems the problems with --incremental are
> more deep and general.  I have found two steps that both appear to do
> the wrong thing:
> 
> mdadm /dev/md0 --fail /dev/sdb
> mdadm /dev/md0 --remove /dev/sdb
> 
> At this point mdadm -E /dev/sda shows that sdb has been removed, but
> mdadm -E of /dev/sdb shows both disks are active and in sync still.  The
> metadata of sdb should be updated when failed or removed if possible.
> 
> mdadm --incremental /dev/sdb
> 
> This goes ahead and adds the disk back to the array, despite the fact
> that it has been explicitly removed.

Of course it does.  You've just explicitly readded it, which is no
different than your explicit removal.  Mdadm honors both.

> Whether or not the superblock on sdb is updated when it is removed,
> --incremental should NOT use it as long as mdadm -D /dev/md0 says that
> disk is removed, at least not use it in /dev/md0.

Why not?  It's not like it uses it without correcting the missing bits
first.  My guess is that you've either A) got a write intent bitmap or
B) did the test in such a way that the raid stack knows the drive is
actually still in sync, in which case it was readded without resyncing
or with only the resyncing necessary to satisfy the bitmap, either of
which was probably so fast you didn't notice it.  To test this to your
satisfaction and make sure that the raid stack is doing what you want,
fail then remove a drive, then do a bunch of writes to the array, *then*
readd the drive using incremental.  It will either get kicked out as
stale, or it will get resynced.  Can't remember which off the top of my
head.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes, and hot-plugging between alternative versions
  2010-04-23 21:04   ` safe segmenting of conflicting changes, and hot-plugging between alternative versions Christian Gatzemeier
  2010-04-24  8:10     ` Christian Gatzemeier
@ 2010-04-26 17:11     ` Doug Ledford
  2010-04-26 21:10       ` Christian Gatzemeier
  1 sibling, 1 reply; 20+ messages in thread
From: Doug Ledford @ 2010-04-26 17:11 UTC (permalink / raw)
  To: Christian Gatzemeier; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5140 bytes --]

On 04/23/2010 05:04 PM, Christian Gatzemeier wrote:
> Phillip Susi <psusi <at> cfl.rr.com> writes:
> 
>> when mdadm
>> --incremental sees the second disk claims the first disk is failed, but
>> it is active and working fine in the running array, it should realize
>> that the superblock on the second disk is wrong, and correct it, which
>> would leave the second disk as failed, removed, and neither use the out
>> of sync data on the disk, nor overwrite it with a copy from the first.
> 
> "Correcting the superblocks" of conflicting members, would translate into having
> a defined way to mark those members as composing a segment that contains a known
> alternative version of the array. The earliest an alternative version can be
> detected, and thus be known and marked as such, is on an incident when a
> conflicting segment comes up while another segment of the array is already
> running degraded. (To simply support segments consisting of single raid member
> devices it may be enough if a superblock marking itself as failed would mean it
> is contains conflicting changes. Multi member segments would require segment IDs)
> 
> IMHO all segments with alternative versions can be marked as known on such 
> incidences. However whether the segments containing alternative versions
> continue to be normally assembled when they come up after the incident like
> before, or if they get ignored in favor of the arbitrary first segment of the
> incidence, should be configurable.
> 
> For users that don't need or want to be able to switch between versions of an
> array by simply switching disks in a hot-pluggable manner, and for those
> concerned about a failure mode that may exist and make disks available in an
> alternating manner and them not noticing it all the time until an incident, I
> suggested "AUTO -SINGLE_SEGMENTS_WITH_KNOWN_ALTERNATIVE_VERSIONS".
> 
> In order to manage segments with alternative versions in a hot-plug manner
> however, all segments need to continue to show up under their real array ID, if
> they are connected first or one at a time. (KNOWN_ALTERNATIVE_VERSIONS need to
> be assembled if they come up.) If the segments would be transformed into
> separate arrays the system won't recognize the segment of the array as such and
> not boot or open it correctly any more. And you wouldn't be able to switch
> between versions by switching the disks that are connected.

Actually, I have a feature request that I haven't gotten around to yet
for something similar to this.  It's the ability pause a raid1 array,
causing a member of the array to stop all updates while the rest of the
array operates as normal.  You then do your system updates, do your
testing, and if you decide it was a bad update, then you revert the
paused state of the array and you are back to the state you had prior to
the update.  The basic guidelines that I've worked out for how this must
be done are as follows:

1) Use mdadm to mark a constituent device of an array as a paused member
(add an internal write intent bitmap if no bitmap currently exists and
use bitmap to track changed areas of array).
2) Reboot, pause becomes effective on next assembly (this is because you
want to make sure the pause takes effect at a point in time when the
filesystem is clean, pausing the system while live would be bad).
3) Perform updates, do testing.
4) Either unpause the array, keeping current setup (in which case the
unpause is immediate and you start syncing the current array data to the
paused array member), or unpause --revert, in which case the unpause
does just like the pause did and waits until the next reboot to become
effective for the obvious reason that we can't revert filesystem state
on a live filesystem.
5) If we added a bitmap where none existed before, remove it.

Done.

However, this is fairly orthogonal to the original problem you
mentioned, specifically that mounting to members of a raid1 array
independently can trick them into thinking they are in sync when they
aren't.  The simplest solution to solve that problem would be to add a
generation count to each device's data in each superblock such that if
device B is failed from the array, then the subsequent update to the
superblock on device A would record not only that device B was failed,
but what the generation count was when device B was failed.  On
subsequent reassembly, if device B reappears, and the generation count
on device B does not match the recorded generation count for device B's
failure incident, then refuse to reassemble the devices into the same
array as this would indicate that the arrays have changed independent of
each other.  But that would probably require a superblock version update
to start storing that for each failed device.  Unless Neil could find
some place to stash the data in the current superblock layouts.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-26 16:59     ` safe segmenting of conflicting changes Doug Ledford
@ 2010-04-26 17:48       ` Phillip Susi
  2010-04-26 18:05         ` Doug Ledford
  0 siblings, 1 reply; 20+ messages in thread
From: Phillip Susi @ 2010-04-26 17:48 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

On 4/26/2010 12:59 PM, Doug Ledford wrote:
>> This goes ahead and adds the disk back to the array, despite the fact
>> that it has been explicitly removed.
> 
> Of course it does.  You've just explicitly readded it, which is no
> different than your explicit removal.  Mdadm honors both.

No, --incremental is automatically invoked by udev to scan disks as they
are detected and try to assemble them.  It isn't an explicit --add
operation.

>> Whether or not the superblock on sdb is updated when it is removed,
>> --incremental should NOT use it as long as mdadm -D /dev/md0 says that
>> disk is removed, at least not use it in /dev/md0.
> 
> Why not?  It's not like it uses it without correcting the missing bits
> first.  My guess is that you've either A) got a write intent bitmap or

Actually under the right circumstances it DOES use the second disk's
incorrect data without correcting it first, and if it does overwrite it,
that causes data loss so should not be done without an explicit --add
--force.  The fact that this happens is the entire reason for this thread.

Whether or not it can be added safely, the disk has been explicitly
removed so automatically adding it back is not acceptable.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-26 17:48       ` Phillip Susi
@ 2010-04-26 18:05         ` Doug Ledford
  2010-04-26 18:43           ` Phillip Susi
  0 siblings, 1 reply; 20+ messages in thread
From: Doug Ledford @ 2010-04-26 18:05 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2656 bytes --]

On 04/26/2010 01:48 PM, Phillip Susi wrote:
> On 4/26/2010 12:59 PM, Doug Ledford wrote:
>>> This goes ahead and adds the disk back to the array, despite the fact
>>> that it has been explicitly removed.
>>
>> Of course it does.  You've just explicitly readded it, which is no
>> different than your explicit removal.  Mdadm honors both.
> 
> No, --incremental is automatically invoked by udev to scan disks as they
> are detected and try to assemble them.  It isn't an explicit --add
> operation.

So, the point of raid is to be as reliable as possible, if the disk that
was once gone is now back, we want to use it if possible.

>>> Whether or not the superblock on sdb is updated when it is removed,
>>> --incremental should NOT use it as long as mdadm -D /dev/md0 says that
>>> disk is removed, at least not use it in /dev/md0.
>>
>> Why not?  It's not like it uses it without correcting the missing bits
>> first.  My guess is that you've either A) got a write intent bitmap or
> 
> Actually under the right circumstances it DOES use the second disk's
> incorrect data without correcting it first,

A problem for which I suggested a specific fix in another email.

> and if it does overwrite it,
> that causes data loss so should not be done without an explicit --add
> --force.  The fact that this happens is the entire reason for this thread.

The problem is the cause of this thread, and it's a bug that should be
fixed, it should not cause us to require things to have an explicit
--add --force to use a previously failed drive.  This is a case of
reacting to a bug by disabling a useful aspect of the stack instead of
simply fixing the bug IMO.  When the raid stack thinks that things are
out of sync, it doesn't automatically do anything bad.  The real bug
here is that there is a way to get things out of sync behind the raid
stack's back.

> Whether or not it can be added safely, the disk has been explicitly
> removed so automatically adding it back is not acceptable.

The md raid stack makes no distinction between explicit removal or a
device that disappeared because of a glitch in a USB cable or some such.
 In both cases the drive is failed and removed.  So the fact that you
draw a distinction is irrelevant until such time as the raid superblocks
are changed to be able to encode the cause of a device being removed in
addition to the fact that it simply was removed.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-26 18:05         ` Doug Ledford
@ 2010-04-26 18:43           ` Phillip Susi
  2010-04-26 19:07             ` Doug Ledford
  0 siblings, 1 reply; 20+ messages in thread
From: Phillip Susi @ 2010-04-26 18:43 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

On 4/26/2010 2:05 PM, Doug Ledford wrote:
> So, the point of raid is to be as reliable as possible, if the disk that
> was once gone is now back, we want to use it if possible.

No, we don't.  I explicitly removed that disk from the array because I
have no wish for it to be there any more.  Maybe I plan on using it in
another array, or maybe I plan on shreding its contents.  Whatever I'm
planning for that disk, it does not involve it being used in a raid
array any more.

> The problem is the cause of this thread, and it's a bug that should be
> fixed, it should not cause us to require things to have an explicit
> --add --force to use a previously failed drive.  This is a case of

Then when the drive fails it should only be marked as failed, not also
removed.  If I manually remove it, then it should stay removed until I
decide to do something else with it.

> The md raid stack makes no distinction between explicit removal or a
> device that disappeared because of a glitch in a USB cable or some such.
>  In both cases the drive is failed and removed.  So the fact that you

Then that's the problem.  If it fails, it should be marked as failed.
If it is removed, it should be marked as removed.  They are two
different actions, that should have different results.  Why on earth the
two flags seem to always be used together is beyond me.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-26 18:43           ` Phillip Susi
@ 2010-04-26 19:07             ` Doug Ledford
  2010-04-26 19:38               ` Phillip Susi
  0 siblings, 1 reply; 20+ messages in thread
From: Doug Ledford @ 2010-04-26 19:07 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3014 bytes --]

On 04/26/2010 02:43 PM, Phillip Susi wrote:
> On 4/26/2010 2:05 PM, Doug Ledford wrote:
>> So, the point of raid is to be as reliable as possible, if the disk that
>> was once gone is now back, we want to use it if possible.
> 
> No, we don't.  I explicitly removed that disk from the array because I
> have no wish for it to be there any more.

Then you need to remove the superblock from the device.

>  Maybe I plan on using it in
> another array, or maybe I plan on shreding its contents.  Whatever I'm
> planning for that disk, it does not involve it being used in a raid
> array any more.

The problem here seems to be an issue of expectations.  You think that
"removed" is used as a flag to record intent, where as it actually is
nothing more than a matter of state.

>> The problem is the cause of this thread, and it's a bug that should be
>> fixed, it should not cause us to require things to have an explicit
>> --add --force to use a previously failed drive.  This is a case of
> 
> Then when the drive fails it should only be marked as failed, not also
> removed.  If I manually remove it, then it should stay removed until I
> decide to do something else with it.

As above, removed is a matter of state and simply means that the device
is no longer being held open by the raid device (aka, the device is no
longer a slave to the raid array).

>> The md raid stack makes no distinction between explicit removal or a
>> device that disappeared because of a glitch in a USB cable or some such.
>>  In both cases the drive is failed and removed.  So the fact that you
> 
> Then that's the problem.  If it fails, it should be marked as failed.
> If it is removed, it should be marked as removed.  They are two
> different actions, that should have different results.  Why on earth the
> two flags seem to always be used together is beyond me.

Failed is also a matter of state.  It means the device has encountered
some sort of error and we should no longer attempt to send any
read/write commands to the device.  It is not a statement of *why* it's
in that state.  The removed state indicates that the device has been
removed from the array and is no longer a slave to the array.  Again, no
indication of intent or cause, purely an issue of state.

And there in is the reason that you must go through both states in order
to remove a drive.  The first state, Failed, is what stops commands from
going to the drive, and the second state, Removed, is what actually
removes the dead drive from the array.  We don't allow you to remove a
live drive from the array.

Now, as those are both states, and not causes or intents, it is the
removing of the superblock that signifies intent that a drive should no
longer be part of the array.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-26 19:07             ` Doug Ledford
@ 2010-04-26 19:38               ` Phillip Susi
  2010-04-26 23:33                 ` Doug Ledford
  0 siblings, 1 reply; 20+ messages in thread
From: Phillip Susi @ 2010-04-26 19:38 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

On 4/26/2010 3:07 PM, Doug Ledford wrote:
> Then you need to remove the superblock from the device.

Why?  It has been removed.  In English removed means it is no longer
part of the array.  Elements which are not part of the array should not
be MADE part of the array just because they happen to be there.

Having to zero the superblock after failing and removing the drive is a
race condition with detecting the drive and automatically adding it back
to the array.  To properly remove the disk from the array the superblock
needs to be updated before the kernel releases the underlying device.

> The problem here seems to be an issue of expectations.  You think that
> "removed" is used as a flag to record intent, where as it actually is
> nothing more than a matter of state.

No, I don't think it has anything to do with intent.  I think that the
state of being removed means it is no longer part of the array.  It
sounds like your understanding of the state should be described in
English as detached or disconnected, rather than removed.

> Failed is also a matter of state.  It means the device has encountered
> some sort of error and we should no longer attempt to send any
> read/write commands to the device.  It is not a statement of *why* it's
> in that state.  The removed state indicates that the device has been
> removed from the array and is no longer a slave to the array.  Again, no
> indication of intent or cause, purely an issue of state.

Yes, it does not indicate why, nor do we care.  What we care about is
that the drive failed or was removed, so we should not be using it.  Why
bother recording that fact in the superblock if you're just going to
ignore it the next time you start the array?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes, and hot-plugging between alternative versions
  2010-04-26 17:11     ` Doug Ledford
@ 2010-04-26 21:10       ` Christian Gatzemeier
  0 siblings, 0 replies; 20+ messages in thread
From: Christian Gatzemeier @ 2010-04-26 21:10 UTC (permalink / raw)
  To: linux-raid

Doug Ledford <dledford <at> redhat.com> writes:
> 
> Actually, I have a feature request that I haven't gotten around to yet
> for something similar to this.  It's the ability pause a raid1 array,
> causing a member of the array to stop all updates while the rest of the
> array operates as normal.

Indeed that is quite similar. Related terms would be "paused segment" and
"alternative version/segment", the latter probably "locked-out".

The main differences being that cleanly pausing a segment would be done by
issuing a command while segmenting can also happen due to failure modes or
intentional hot/cold-plugging. And that a segment containing an alternative
version would not necessarily have to be static. Though, by making use of some
new "locking-out" functionality the pause command could make sure the
alternative version is never auto-assembled and stays static from the start,
while the proposed enhancement 2) thought only after incidents where conflicting
versions appeared together. 

So it looks, as if intentionally "pausing" could be implemented as ("alternative
version" + "lock-out") and could at the same time allow safe segmenting in other
circumstances.

Only a mark to "locked out" members may be enough to implement all this. So I'd
suggest that "a superblock marking itself as removed" may be a mark for "locked
out" rather than for "alternative version", and be exempt from auto-readding.

If we can reliably detect alternative versions by checking for conflicts in
failed claims of superblocks, we probably don't need another extra measure to
mark superblocks as containing an alternative version. And pausing a segment 
would (on shutdown) make the paused segment claim the rest of the array failed
and the paused segments were removed, while rest claims the paused segment
failed and was removed.

Can someone find a flaw with the superblock marking itself as removed approach?




> However, this is fairly orthogonal to the original problem you
> mentioned, specifically that mounting to members of a raid1 array
> independently can trick them into thinking they are in sync when they
> aren't.

Hm, more or less. In the case at hand detection of the conflicting changes
failed, and thus auto-segmenting, or more explicitly keeping the alternative
versions appart that were created by degrading different segments on different
boots failed. I was seeing it as a test case for safe segmenting, in which the
versions are not diverged much (+-1 same event count or bitmap range).

> The simplest solution to solve that problem would be to add a
> generation count to each device's data in each superblock

Ah ok, I understand that may be easier to implement.

Can you see some flaw in checking for superblocks that mark running superblocks
as faulty, as a conflict detection algorithm? That may not be limited only to
new superblocks.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-26 19:38               ` Phillip Susi
@ 2010-04-26 23:33                 ` Doug Ledford
  2010-04-27 16:20                   ` Phillip Susi
  0 siblings, 1 reply; 20+ messages in thread
From: Doug Ledford @ 2010-04-26 23:33 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4916 bytes --]

On 04/26/2010 03:38 PM, Phillip Susi wrote:
> On 4/26/2010 3:07 PM, Doug Ledford wrote:
>> Then you need to remove the superblock from the device.
> 
> Why?  It has been removed.  In English removed means it is no longer
> part of the array.

And in English raid means a hostile or predatory incursion, it has
nothing to do with disc drives.  And in English cat is an animal you
pet.  So technical jargon and regular English don't always agree, what's
your point?

>  Elements which are not part of the array should not
> be MADE part of the array just because they happen to be there.

Sorry, but that's just not going to happen, ever.  There any number of
valid reasons why someone might want to temporarily remove a drive from
an array and then readd it back later, and when they readd it back they
want it to come back, and they want it to know that it used to be part
of the array and only resync the necessary bits (if you have a write
intent bitmap, otherwise it resyncs the whole thing).

> Having to zero the superblock after failing and removing the drive is a
> race condition with detecting the drive and automatically adding it back
> to the array.

No, it's not.  The udev rules that add the drive don't race with
manually removing it because they don't act on change events, only add
events.

>  To properly remove the disk from the array the superblock
> needs to be updated before the kernel releases the underlying device.

Not going to happen.  Doing what you request would undo a number of very
useful features in the raid stack.  So you might as well save your
breath, we aren't going to make a remove event equivalent to a zero
superblock event because then the entire --readd option would be
rendered useless.

>> The problem here seems to be an issue of expectations.  You think that
>> "removed" is used as a flag to record intent, where as it actually is
>> nothing more than a matter of state.
> 
> No, I don't think it has anything to do with intent.  I think that the
> state of being removed means it is no longer part of the array.  It
> sounds like your understanding of the state should be described in
> English as detached or disconnected, rather than removed.

Depends on context.  Removed makes perfect sense from the point of view
that the device has been removed from the list of devices currently held
with an exclusive open by the md raid stack.

>> Failed is also a matter of state.  It means the device has encountered
>> some sort of error and we should no longer attempt to send any
>> read/write commands to the device.  It is not a statement of *why* it's
>> in that state.  The removed state indicates that the device has been
>> removed from the array and is no longer a slave to the array.  Again, no
>> indication of intent or cause, purely an issue of state.
> 
> Yes, it does not indicate why, nor do we care.  What we care about is
> that the drive failed or was removed, so we should not be using it.  Why
> bother recording that fact in the superblock if you're just going to
> ignore it the next time you start the array?

Because there are both transient and permanent failures.  Experience
caused us to switch from treating all failures as permanent to treating
failures as transient and picking up where we left off if at all
possible because too many people were having a single transient failure
render their array degraded, only to have a real issue come up sometime
later that then meant the array was no longer degraded, but entirely
dead.  The job of the raid stack is to survive as much failure as
possible before dying itself.  We can't do that if we allow a single,
transient event to cause us to stop using something entirely.

Besides, what you seem to be forgetting is that those events that make
us genuinely not want to use a device also make it so that at the next
reboot the device generally isn't available or seen by the OS
(controller failure, massive failure of the platter, etc).  Simply
failing and removing a device using mdadm mimics a transient failure.
If you fail, remove, then zero-superblock then you mimic a permanent
failure.  There you go, you have a choice.

If we were to do as you wish, then users would no longer have a choice,
they would be forced into mimicking a hard failure only.  I prefer to
give users a choice on how they want to do things.  So, just because you
happen to think that the only way it *should* be done is like a hard
failure doesn't mean we are going to change it to be that way.  Things
are the way they are for a reason, best to just learn to use
--zero-superblock if that's what you want.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-26 23:33                 ` Doug Ledford
@ 2010-04-27 16:20                   ` Phillip Susi
  2010-04-27 17:27                     ` Doug Ledford
  0 siblings, 1 reply; 20+ messages in thread
From: Phillip Susi @ 2010-04-27 16:20 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

On 4/26/2010 7:33 PM, Doug Ledford wrote:
> And in English raid means a hostile or predatory incursion, it has
> nothing to do with disc drives.  And in English cat is an animal you
> pet.  So technical jargon and regular English don't always agree, what's
> your point?

RAID is an acronym that just happens to also spell an English word.
Removed and Failed are not, and your cat example is a complete non
sequitur.  My point is that when naming technical things you should do
so sanely.  You wouldn't label the state a disk goes into when it keeps
failing IO requests as "Iceland" would you?  Of course not.  The state
is named failed because the disk has failed.

>>  Elements which are not part of the array should not
>> be MADE part of the array just because they happen to be there.
> 
> Sorry, but that's just not going to happen, ever.  There any number of
> valid reasons why someone might want to temporarily remove a drive from
> an array and then readd it back later, and when they readd it back they
> want it to come back, and they want it to know that it used to be part
> of the array and only resync the necessary bits (if you have a write
> intent bitmap, otherwise it resyncs the whole thing).

I didn't say never allow it to be added back, I said don't go doing it
automatically.  An explicit add should, of course, work as it does now,
but it should not be added just because udev decided it has appeared and
called mdadm --incremental on it.

> No, it's not.  The udev rules that add the drive don't race with
> manually removing it because they don't act on change events, only add
> events.

And who is to say that you won't get one of those?  A power failure
after --remove and when the system comes back up, viola, the disk gets
put back into the array.  Or maybe your hotplug environment has a loose
cable that slips out and you put it back.  This clearly violates the
principal of least surprise.

> Not going to happen.  Doing what you request would undo a number of very
> useful features in the raid stack.  So you might as well save your
> breath, we aren't going to make a remove event equivalent to a zero
> superblock event because then the entire --readd option would be
> rendered useless.

I didn't say that.  I said that a remove event 1) should actually bother
recording the removed state on the disk being removed ( right now it
only records it on the other disks ), and 2) the fact that the disk is
in the removed state should prevent --incremental from automatically
re-adding it.

> Because there are both transient and permanent failures.  Experience
> caused us to switch from treating all failures as permanent to treating
> failures as transient and picking up where we left off if at all
> possible because too many people were having a single transient failure
> render their array degraded, only to have a real issue come up sometime
> later that then meant the array was no longer degraded, but entirely
> dead.  The job of the raid stack is to survive as much failure as
> possible before dying itself.  We can't do that if we allow a single,
> transient event to cause us to stop using something entirely.

That's a good thing and is why it is fine for --incremental to activate
a disk in the failed state if it appears to have returned to being
operational and it is safe to do so ( meaning hasn't also been activated
degraded ).  It should not do this for the removed state however.

> Besides, what you seem to be forgetting is that those events that make
> us genuinely not want to use a device also make it so that at the next
> reboot the device generally isn't available or seen by the OS
> (controller failure, massive failure of the platter, etc).  Simply
> failing and removing a device using mdadm mimics a transient failure.
> If you fail, remove, then zero-superblock then you mimic a permanent
> failure.  There you go, you have a choice.

Failed and removed are two different states; they should have different
behaviors.  Failed = temporary, removed = more permanent.
zero-superblock is completely permanent.  Removed should be a good
middle ground where you still CAN re-add the device, but it should not
be done automatically.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-27 16:20                   ` Phillip Susi
@ 2010-04-27 17:27                     ` Doug Ledford
  2010-04-27 18:04                       ` Phillip Susi
  0 siblings, 1 reply; 20+ messages in thread
From: Doug Ledford @ 2010-04-27 17:27 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 8009 bytes --]

On 04/27/2010 12:20 PM, Phillip Susi wrote:
> RAID is an acronym that just happens to also spell an English word.
> Removed and Failed are not, and your cat example is a complete non
> sequitur.  My point is that when naming technical things you should do
> so sanely.  You wouldn't label the state a disk goes into when it keeps
> failing IO requests as "Iceland" would you?  Of course not.  The state
> is named failed because the disk has failed.

And the state "removed" is labeled as such because the device has been
removed from the list of slave devices that the kernel keeps.  Nothing
more.  You are reading into it things that weren't intended.  What you
are reading into it might even be a reasonable interpretation, but it's
not the actual interpretation.

> I didn't say never allow it to be added back, I said don't go doing it
> automatically.  An explicit add should, of course, work as it does now,
> but it should not be added just because udev decided it has appeared and
> called mdadm --incremental on it.

This is, in fact, completely contrary to where we are heading with
things.  We in fact *do* want udev invoked incremental rules to readd
the device after it has been removed.  The entire hotunplug/hotplug
support I'm working on does *exactly* that.  On device removal is does
both a fail and remove action, an on device insertion is does a readd or
add as needed.

So, as I said, you are reading more into "removed" than we intend, and
we *will* be automatically removing devices when they go away, so it's
entirely appropriate that if we automatically remove them then we don't
consider "removed" to be a manual intervention only state, it is a valid
automatic state and recovery from it should be equally automatic.

>> No, it's not.  The udev rules that add the drive don't race with
>> manually removing it because they don't act on change events, only add
>> events.
> 
> And who is to say that you won't get one of those?  A power failure
> after --remove and when the system comes back up, viola, the disk gets
> put back into the array.  Or maybe your hotplug environment has a loose
> cable that slips out and you put it back.  This clearly violates the
> principal of least surprise.

No, it doesn't.  This is exactly what people expect in a hotplug
environment.  A device shows up, you use it.  If you don't want the
device to be used, then remove the superblock.  This whole argument
centers around the fact that, to you, --remove means "don't use this
device again".  That's a very reasonable thing to think, but it's not
actually what it means.  It simply means "remove this device from the
slaves held by this array".  Only under certain circumstances will it
get readded back into the array automatically (you reboot the machine,
power failure, cable unplug/plug, etc.)  This is because of the
interaction between hotplug discovery and the fact that we merely
removed the drive from the list of slaves to the array, we did not mark
the drive as "not to be used".  That's what zero-superblock is for.  And
this whole argument that the drive being readded is a big deal is bogus
too.  You can always just re-remove the device if it got added.  If you
were wanting to preserve the data on the drive (say you were splitting a
raid1 array and wanting it to remain as it was for possible revert
capability) then you could issue this command:

mdadm /dev/md0 -f /dev/sdc1 -r /dev/sdc1; mdadm --zero-superblock /dev/sdc1

and that should be sufficient to satisfy your needs.  If we race between
the remove and the zero-superblock with something like a power failure
then obviously so little will have changed that you can simply repeat
the procedure until you successfully complete it without a power failure.

>> Not going to happen.  Doing what you request would undo a number of very
>> useful features in the raid stack.  So you might as well save your
>> breath, we aren't going to make a remove event equivalent to a zero
>> superblock event because then the entire --readd option would be
>> rendered useless.
> 
> I didn't say that.  I said that a remove event 1) should actually bother
> recording the removed state on the disk being removed ( right now it
> only records it on the other disks ),

This is intentional.  A remove event merely triggers a kernel error
cycle on the target device.  We don't differentiate between a user
initiated remove and one that's the result of catastrophic disc failure.
 However, trying to access a dead disc causes all sorts of bad behavior
on a real running system with a real disc failure, so once we know a
disc is bad and we are kicking it from the array, we only try to write
that data to the good discs so we aren't hosing the system.

> and 2) the fact that the disk is
> in the removed state should prevent --incremental from automatically
> re-adding it.

We are specifically going in the opposite direction here.  We *want* to
automatically readd removed devices because we are implementing
automatic removal on hot unplug, which means we want automatic addition
on hot plug.

>> Because there are both transient and permanent failures.  Experience
>> caused us to switch from treating all failures as permanent to treating
>> failures as transient and picking up where we left off if at all
>> possible because too many people were having a single transient failure
>> render their array degraded, only to have a real issue come up sometime
>> later that then meant the array was no longer degraded, but entirely
>> dead.  The job of the raid stack is to survive as much failure as
>> possible before dying itself.  We can't do that if we allow a single,
>> transient event to cause us to stop using something entirely.
> 
> That's a good thing and is why it is fine for --incremental to activate
> a disk in the failed state if it appears to have returned to being
> operational and it is safe to do so ( meaning hasn't also been activated
> degraded ).  It should not do this for the removed state however.

Again, we are back to the fact that you are interpreting removed to be
something it isn't.  We can argue about this all day long, but that
option has had a specific meaning for long enough, and has been around
long enough, that it can't be changed now without breaking all sorts of
back compatibility.

>> Besides, what you seem to be forgetting is that those events that make
>> us genuinely not want to use a device also make it so that at the next
>> reboot the device generally isn't available or seen by the OS
>> (controller failure, massive failure of the platter, etc).  Simply
>> failing and removing a device using mdadm mimics a transient failure.
>> If you fail, remove, then zero-superblock then you mimic a permanent
>> failure.  There you go, you have a choice.
> 
> Failed and removed are two different states; they should have different
> behaviors.  Failed = temporary, removed = more permanent.

There is *no* such distinction between failed and removed.  Only *you*
are inferring that distinction.  The real distinction is failed == no
longer allowed to process read/write requests from the block layer but
still present as a slave to the array, removed == no longer present as a
slave to the array.

> zero-superblock is completely permanent.  Removed should be a good
> middle ground where you still CAN re-add the device, but it should not
> be done automatically.

A semantic change such as this would require huge amounts of pain in
terms of fixing up scripts to do as you expect.  It would be far easier
on the entire mdadm using world to add a new option that implements what
you want instead of changing existing behavior.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-27 17:27                     ` Doug Ledford
@ 2010-04-27 18:04                       ` Phillip Susi
  2010-04-27 19:29                         ` Doug Ledford
  0 siblings, 1 reply; 20+ messages in thread
From: Phillip Susi @ 2010-04-27 18:04 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

On 4/27/2010 1:27 PM, Doug Ledford wrote:
> And the state "removed" is labeled as such because the device has been
> removed from the list of slave devices that the kernel keeps.  Nothing
> more.  You are reading into it things that weren't intended.  What you
> are reading into it might even be a reasonable interpretation, but it's
> not the actual interpretation.

Then it is a useless and non existent state.  Whether the kernel
currently has the disk open or not is purely ephemeral; there is no
reason to bother recording it in the superblock.  Why bother recording
that bit in the superblock if you are just going to ignore it?  It is
there, so it should not be ignored.

> This is, in fact, completely contrary to where we are heading with
> things.  We in fact *do* want udev invoked incremental rules to readd
> the device after it has been removed.  The entire hotunplug/hotplug
> support I'm working on does *exactly* that.  On device removal is does
> both a fail and remove action, an on device insertion is does a readd or
> add as needed.

Only because failed and removed are currently entangled states rather
than different states.  If they were disentangled so that removed
actually meant something different than failed, then failed would be the
state that you want to automatically pick back up in --incremental, and
removed should only be set when you explicitly --remove.

> So, as I said, you are reading more into "removed" than we intend, and
> we *will* be automatically removing devices when they go away, so it's
> entirely appropriate that if we automatically remove them then we don't
> consider "removed" to be a manual intervention only state, it is a valid
> automatic state and recovery from it should be equally automatic.

Again, failed and removed should be two different states.  If the disk
fails, it should be marked as failed, not removed.

> This is because of the interaction between hotplug discovery and the
> fact that we merely removed the drive from the list of slaves to the
> array, we did not mark the drive as "not to be used".

That's what removed /should/ be since under the current definition, the
state actually does not exist in a persistent way.  You may as well
remove the bit from the superblock entirely since it is always ignored.

> mdadm /dev/md0 -f /dev/sdc1 -r /dev/sdc1; mdadm --zero-superblock /dev/sdc1
> 
> and that should be sufficient to satisfy your needs.  If we race between
> the remove and the zero-superblock with something like a power failure
> then obviously so little will have changed that you can simply repeat
> the procedure until you successfully complete it without a power failure.

Except that now I can't manually re-add the disk to the array.

> This is intentional.  A remove event merely triggers a kernel error
> cycle on the target device.  We don't differentiate between a user
> initiated remove and one that's the result of catastrophic disc failure.

Why not?  Seems to be because of a broken definition of the removed state.

>  However, trying to access a dead disc causes all sorts of bad behavior
> on a real running system with a real disc failure, so once we know a
> disc is bad and we are kicking it from the array, we only try to write
> that data to the good discs so we aren't hosing the system.

That is what failed is for.  Once failed, you can't write to the disk
anymore.  If it isn't failed, you should be able to remove it, and the
superblock should be updated prior to detaching.

> We are specifically going in the opposite direction here.  We *want* to
> automatically readd removed devices because we are implementing
> automatic removal on hot unplug, which means we want automatic addition
> on hot plug.

Again, this is because you are conflating removed, and failed.  If you
treat them separately then you don't have this problem.  Failed happens
automatically, so can be undone automatically, removed does not.

> Again, we are back to the fact that you are interpreting removed to be
> something it isn't.  We can argue about this all day long, but that
> option has had a specific meaning for long enough, and has been around
> long enough, that it can't be changed now without breaking all sorts of
> back compatibility.

This may be true for the --remove switch, but as it stands right now,
the removed flag in the superblock is utterly meaningless.  If you want
--remove to continue to behave the way it does, then you could add a
--really-remove command to set the removed flag, though this seems a
little silly.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-27 18:04                       ` Phillip Susi
@ 2010-04-27 19:29                         ` Doug Ledford
  2010-04-28 13:22                           ` Phillip Susi
  0 siblings, 1 reply; 20+ messages in thread
From: Doug Ledford @ 2010-04-27 19:29 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5698 bytes --]

On 04/27/2010 02:04 PM, Phillip Susi wrote:
> On 4/27/2010 1:27 PM, Doug Ledford wrote:
>> And the state "removed" is labeled as such because the device has been
>> removed from the list of slave devices that the kernel keeps.  Nothing
>> more.  You are reading into it things that weren't intended.  What you
>> are reading into it might even be a reasonable interpretation, but it's
>> not the actual interpretation.
> 
> Then it is a useless and non existent state.

Actually, that's more true than you realize.  Technically, there is no
"removed" state (at least not that's stored on the removed drive).
There are working, failed, and removed discs as far as the slot array
for the md device is concerned.  The usage of REMOVED in the device
states is limited to marking slots in the constituent device list that
have neither a working disc nor a failed disc associated with it.  It
has nothing to do with whether or not mdadm --remove was ever called.

>  Whether the kernel
> currently has the disk open or not is purely ephemeral; there is no
> reason to bother recording it in the superblock.

We do record it, but mainly in the form of our disk count.  We keep
track of active discs, failed discs, and spare discs.  Whether we have
the disc open and in our device list changes what our failed disc count
would be.  On removal, we decrement that failed disc count.  The slot
that the disc used to occupy gets changed from being failed to removed
simply to indicate there is no rdev associated with that slot.  That is
the information we record.

>  Why bother recording
> that bit in the superblock if you are just going to ignore it?  It is
> there, so it should not be ignored.

You are making assumptions about the use of that bit.  We don't ignore
it, we just don't use it the way you think we should.

>> mdadm /dev/md0 -f /dev/sdc1 -r /dev/sdc1; mdadm --zero-superblock /dev/sdc1
>>
>> and that should be sufficient to satisfy your needs.  If we race between
>> the remove and the zero-superblock with something like a power failure
>> then obviously so little will have changed that you can simply repeat
>> the procedure until you successfully complete it without a power failure.
> 
> Except that now I can't manually re-add the disk to the array.

Sure you can: mdadm /dev/md0 -a /dev/sdc1

>> This is intentional.  A remove event merely triggers a kernel error
>> cycle on the target device.  We don't differentiate between a user
>> initiated remove and one that's the result of catastrophic disc failure.
> 
> Why not?  Seems to be because of a broken definition of the removed state.

Not at all.  You make lots of assumptions, some of them rather silly.
We don't differentiate between the two because the best way to ensure
you don't have errors in your device failure code path is to actually
use it other than just when a device fails.  It was an intentional
design decision that when the set faulty and hot remove ioctl calls were
added to the raid stack, that they would actually call into the low
level command handling routines and activate the actual error paths that
would be followed if a command came back from a device with some sort of
error.  By doing that, we are actually testing the command failure path
every time we instigate a hot fail using mdadm.  That was intentional,
and it had nothing to do with the definition of removed state.

>>  However, trying to access a dead disc causes all sorts of bad behavior
>> on a real running system with a real disc failure, so once we know a
>> disc is bad and we are kicking it from the array, we only try to write
>> that data to the good discs so we aren't hosing the system.
> 
> That is what failed is for.  Once failed, you can't write to the disk
> anymore.  If it isn't failed, you should be able to remove it, and the
> superblock should be updated prior to detaching.

That's all fine and dandy for the MD raid stack, version 2.  Feel free
to start writing said stack with your preferred semantic definitions in
place.

>> We are specifically going in the opposite direction here.  We *want* to
>> automatically readd removed devices because we are implementing
>> automatic removal on hot unplug, which means we want automatic addition
>> on hot plug.
> 
> Again, this is because you are conflating removed, and failed.

No, because we only have one failed state: failed.  Removed as far as
mdadm is concerned is an action, and removed as far as the kernel is
concerned just means "I don't have any devices opened for this slot".
It's not a valid state for a device.

>  If you
> treat them separately then you don't have this problem.  Failed happens
> automatically, so can be undone automatically, removed does not.
> 
>> Again, we are back to the fact that you are interpreting removed to be
>> something it isn't.  We can argue about this all day long, but that
>> option has had a specific meaning for long enough, and has been around
>> long enough, that it can't be changed now without breaking all sorts of
>> back compatibility.
> 
> This may be true for the --remove switch, but as it stands right now,
> the removed flag in the superblock is utterly meaningless.  If you want
> --remove to continue to behave the way it does, then you could add a
> --really-remove command to set the removed flag, though this seems a
> little silly.

I'm sure Neil will take patches.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: safe segmenting of conflicting changes
  2010-04-27 19:29                         ` Doug Ledford
@ 2010-04-28 13:22                           ` Phillip Susi
  0 siblings, 0 replies; 20+ messages in thread
From: Phillip Susi @ 2010-04-28 13:22 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

On 4/27/2010 3:29 PM, Doug Ledford wrote:
> We do record it, but mainly in the form of our disk count.  We keep
> track of active discs, failed discs, and spare discs.  Whether we have
> the disc open and in our device list changes what our failed disc count
> would be.  On removal, we decrement that failed disc count.  The slot
> that the disc used to occupy gets changed from being failed to removed
> simply to indicate there is no rdev associated with that slot.  That is
> the information we record.

Why bother recording that?  Whether or not the kernel last had an rdev
open is useless information to have in the superblock.  Why not instead
leave it recorded as failed until --removed?

> You are making assumptions about the use of that bit.  We don't ignore
> it, we just don't use it the way you think we should.

Then how is it used, because I see no way in which the kernel cares
whether or not the kernel had a given device open or not prior to a crash.

> Sure you can: mdadm /dev/md0 -a /dev/sdc1

That's add, not re-add.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* detecting segmentation / conflicting changes
  2010-04-23 13:42 safe segmenting of conflicting changes (was: Two degraded mirror segments recombined out of sync for massive data loss) Christian Gatzemeier
  2010-04-23 15:08 ` Phillip Susi
@ 2010-05-05 11:28 ` Christian Gatzemeier
  1 sibling, 0 replies; 20+ messages in thread
From: Christian Gatzemeier @ 2010-05-05 11:28 UTC (permalink / raw)
  To: linux-raid


Just curious, as we're seing cases where corrupt arrays are assembled.

Do you see something wrong with the idea to check for superblocks claiming each
other as failed? Do you think it is not a right measure to detect whether
segments of an array have been run degraded separately?






^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-05-05 11:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-23 13:42 safe segmenting of conflicting changes (was: Two degraded mirror segments recombined out of sync for massive data loss) Christian Gatzemeier
2010-04-23 15:08 ` Phillip Susi
2010-04-23 18:18   ` Phillip Susi
2010-04-26 16:59     ` safe segmenting of conflicting changes Doug Ledford
2010-04-26 17:48       ` Phillip Susi
2010-04-26 18:05         ` Doug Ledford
2010-04-26 18:43           ` Phillip Susi
2010-04-26 19:07             ` Doug Ledford
2010-04-26 19:38               ` Phillip Susi
2010-04-26 23:33                 ` Doug Ledford
2010-04-27 16:20                   ` Phillip Susi
2010-04-27 17:27                     ` Doug Ledford
2010-04-27 18:04                       ` Phillip Susi
2010-04-27 19:29                         ` Doug Ledford
2010-04-28 13:22                           ` Phillip Susi
2010-04-23 21:04   ` safe segmenting of conflicting changes, and hot-plugging between alternative versions Christian Gatzemeier
2010-04-24  8:10     ` Christian Gatzemeier
2010-04-26 17:11     ` Doug Ledford
2010-04-26 21:10       ` Christian Gatzemeier
2010-05-05 11:28 ` detecting segmentation / conflicting changes Christian Gatzemeier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.