Two degraded mirror segments recombined out of sync for massive data loss

All of lore.kernel.org
 help / color / mirror / Atom feed

* Two degraded mirror segments recombined out of sync for massive data loss
@ 2010-04-07 20:45 Phillip Susi
  2010-04-07 21:21 ` Michael Evans
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Phillip Susi @ 2010-04-07 20:45 UTC (permalink / raw)
  To: linux-raid

The gist of the problem is this: after booting a mirror in degraded mode
with only the first disk, then doing the same with only the second disk,
then booting with both disks again, mdadm happily recombines the two
disks out of sync, causing two divergent filesystems to become munged
together.

The problem was initially discovered testing the coming lucid release of
Ubuntu doing clean installs in a virtualization environment, and I have
reproduced it manually activating and deactivating the array built out
of two lvm logical volumes under Karmic.  What seems to be happening is
that when you activate in degraded mode ( mdadm --assemble --run ), the
metadata on the first disk is changed to indicate that the second disk
was faulty and removed.  When you activate with only the second disk,
you would think it would say the first disk was faulty, removed, but for
some reason it ends up only marking it as removed, but not faulty.  Now
both disks are degraded.

When mdadm --incrmental is run by udev on the first disk, it happily
activates it since the array is degraded, but has one out of one active
member present, with the second member faulty,removed.  When mdadm
--incremental is run by udev on the second disk, it happily slips the
disk into the active array, WITHOUT SYNCING.

My two questions are:

1) When doing mdadm --assemble --run with only the second disk present,
shouldn't it mark the first disk as faulty, removed instead of only removed?

2) When mdadm --incremental is run on the second disk, shouldn't it
refuse to use it since the array says the second disk is faulty, removed?

The bug report related to this can be found at:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/557429

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Two degraded mirror segments recombined out of sync for massive data loss
  2010-04-07 20:45 Two degraded mirror segments recombined out of sync for massive data loss Phillip Susi
@ 2010-04-07 21:21 ` Michael Evans
  2010-04-07 22:58   ` Jools Wills
  2010-04-08 14:58   ` Billy Crook
  2010-04-07 23:49 ` Neil Brown
  2010-04-14 20:56 ` Bill Davidsen
  2 siblings, 2 replies; 7+ messages in thread
From: Michael Evans @ 2010-04-07 21:21 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

On Wed, Apr 7, 2010 at 1:45 PM, Phillip Susi <psusi@cfl.rr.com> wrote:
> The gist of the problem is this: after booting a mirror in degraded mode
> with only the first disk, then doing the same with only the second disk,
> then booting with both disks again, mdadm happily recombines the two
> disks out of sync, causing two divergent filesystems to become munged
> together.
>
> The problem was initially discovered testing the coming lucid release of
> Ubuntu doing clean installs in a virtualization environment, and I have
> reproduced it manually activating and deactivating the array built out
> of two lvm logical volumes under Karmic.  What seems to be happening is
> that when you activate in degraded mode ( mdadm --assemble --run ), the
> metadata on the first disk is changed to indicate that the second disk
> was faulty and removed.  When you activate with only the second disk,
> you would think it would say the first disk was faulty, removed, but for
> some reason it ends up only marking it as removed, but not faulty.  Now
> both disks are degraded.
>
> When mdadm --incrmental is run by udev on the first disk, it happily
> activates it since the array is degraded, but has one out of one active
> member present, with the second member faulty,removed.  When mdadm
> --incremental is run by udev on the second disk, it happily slips the
> disk into the active array, WITHOUT SYNCING.
>
> My two questions are:
>
> 1) When doing mdadm --assemble --run with only the second disk present,
> shouldn't it mark the first disk as faulty, removed instead of only removed?
>
> 2) When mdadm --incremental is run on the second disk, shouldn't it
> refuse to use it since the array says the second disk is faulty, removed?
>
> The bug report related to this can be found at:
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/557429
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

It sounds like the last 'synced' time should be tracked, as well as
the last modification time. If the two differ then it can be known
that the contents has diverged since last sync.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Two degraded mirror segments recombined out of sync for massive data loss
  2010-04-07 21:21 ` Michael Evans
@ 2010-04-07 22:58   ` Jools Wills
  2010-04-08 14:58   ` Billy Crook
  1 sibling, 0 replies; 7+ messages in thread
From: Jools Wills @ 2010-04-07 22:58 UTC (permalink / raw)
  To: Michael Evans; +Cc: Phillip Susi, linux-raid

Just to note a relevant point about mdadm on ubuntu - and concerning the
upcoming release lucid, ubuntu is going to ship Lucid with mdadm 2.6.7,
which is rather old now.

Best Regards

Jools

Jools Wills
-- 
IT Consultant
Oxford Inspire - http://www.oxfordinspire.co.uk - be inspired
t: 01235 519446 m: 07966 577498
jools@oxfordinspire.co.uk


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Two degraded mirror segments recombined out of sync for massive data loss
  2010-04-07 20:45 Two degraded mirror segments recombined out of sync for massive data loss Phillip Susi
  2010-04-07 21:21 ` Michael Evans
@ 2010-04-07 23:49 ` Neil Brown
  2010-04-08 13:56   ` Phillip Susi
  2010-04-14 20:56 ` Bill Davidsen
  2 siblings, 1 reply; 7+ messages in thread
From: Neil Brown @ 2010-04-07 23:49 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

On Wed, 7 Apr 2010 16:45:32 -0400
Phillip Susi <psusi@cfl.rr.com> wrote:

> The gist of the problem is this: after booting a mirror in degraded mode
> with only the first disk, then doing the same with only the second disk,
> then booting with both disks again, mdadm happily recombines the two
> disks out of sync, causing two divergent filesystems to become munged
> together.

I can only imagine two circumstances in which this could happen.
1/ You have a write-intent-bitmap configured.
2/ The event count on the two devices incremented by exactly the same
   about while they were in use separately.

The second seems very improbably, but is certainly possible.

Please confirm whether or not you had a bitmap configured.

> 
> The problem was initially discovered testing the coming lucid release of
> Ubuntu doing clean installs in a virtualization environment, and I have
> reproduced it manually activating and deactivating the array built out
> of two lvm logical volumes under Karmic.  What seems to be happening is
> that when you activate in degraded mode ( mdadm --assemble --run ), the
> metadata on the first disk is changed to indicate that the second disk
> was faulty and removed.  When you activate with only the second disk,
> you would think it would say the first disk was faulty, removed, but for
> some reason it ends up only marking it as removed, but not faulty.  Now
> both disks are degraded.
> 
> When mdadm --incrmental is run by udev on the first disk, it happily
> activates it since the array is degraded, but has one out of one active
> member present, with the second member faulty,removed.  When mdadm
> --incremental is run by udev on the second disk, it happily slips the
> disk into the active array, WITHOUT SYNCING.
> 
> My two questions are:
> 
> 1) When doing mdadm --assemble --run with only the second disk present,
> shouldn't it mark the first disk as faulty, removed instead of only removed?

There is no important difference between "missing" and "faulty".  If md
cannot access a device there is no way for it to know whether you, the admin,
considers that device to have failed or to simply have been removed
temporarily (e.g. as part of some backup regime).

> 
> 2) When mdadm --incremental is run on the second disk, shouldn't it
> refuse to use it since the array says the second disk is faulty, removed?
> 

No.  Just because the device was removed from the array doesn't mean you
don't want to to be part of the array any more.  And seeing the device is
still plugged in...

mdadm --incremental should only included both disks in the array if
1/ their event counts are the same, or +/- 1, or
2/ there is a write-intent bitmap and the older event count is within
   the range recorded in the write-intent bitmap.

You should understand that what you have done is at least undefined.
If you break a mirror, change both halves, then put it together again there
is no clearly "right" answer as to what will appear.

Given that you have changed both halves, you have implicitly said that both
halves are still "good".  If they are different, you need to explicitly tell
md which one you want and which one you don't.
The easiest way to do this is to use --zero-superblock on the "bad" device.

I don't think there is anything practical that could be changed in md or
mdadm to make it possible to catch this behaviour and refuse the assemble the
array...  Maybe mdadm could check that the bitmap on the 'old' device is a
subset of the bitmap on the 'new' device - that might be enough.
But if the devices just happen to have the same event count then as far as md
is concerned, they do contain the same data.

NeilBrown

> The bug report related to this can be found at:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/557429
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Two degraded mirror segments recombined out of sync for massive data loss
  2010-04-07 23:49 ` Neil Brown
@ 2010-04-08 13:56   ` Phillip Susi
  0 siblings, 0 replies; 7+ messages in thread
From: Phillip Susi @ 2010-04-08 13:56 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On 4/7/2010 7:49 PM, Neil Brown wrote:
> I can only imagine two circumstances in which this could happen.
> 1/ You have a write-intent-bitmap configured.
> 2/ The event count on the two devices incremented by exactly the same
>    about while they were in use separately.
> 
> The second seems very improbably, but is certainly possible.
> 
> Please confirm whether or not you had a bitmap configured.

No write intent bitmap configured, and yes, the event count appears to
be the same on both legs.

> There is no important difference between "missing" and "faulty".  If md
> cannot access a device there is no way for it to know whether you, the admin,
> considers that device to have failed or to simply have been removed
> temporarily (e.g. as part of some backup regime).

Yes, but if a disk is faulty,removed, then either you explicitly told
mdadm to fail the disk and remove it, or it was failed and removed
during degraded activation.  In this case, shouldn't it require an mdadm
--add to re-insert it into the array?

If you had manually failed and removed the disk, then their metadata
would both agree that the second disk was removed, and it would require
an explicit --add to return it.  This problem seems to stem from the
fact that their metadata disagree about which disk is removed.  In this
case, shouldn't the data in the already active array taken from the
first disk override the metadata in the second disk when it is
incrementally added?

In other words, mdadm --incremental should update the metadata on the
second disk to agree with the first, showing the second disk is the one
that is removed, and not activate the disk without an mdadm --add.

> No.  Just because the device was removed from the array doesn't mean you
> don't want to to be part of the array any more.  And seeing the device is
> still plugged in...

What?  Of course it does.  If you explicitly remove the device it means
you don't want it being part of the array any more.

> mdadm --incremental should only included both disks in the array if
> 1/ their event counts are the same, or +/- 1, or
> 2/ there is a write-intent bitmap and the older event count is within
>    the range recorded in the write-intent bitmap.

I'm not familiar with the meaning of the event count.  Why should it
matter?  And shouldn't the only effect the write-intent bitmap has is to
speed up resyncing when you manually re-add the disk?

> You should understand that what you have done is at least undefined.
> If you break a mirror, change both halves, then put it together again there
> is no clearly "right" answer as to what will appear.

Yes, which version you get is undefined and I would think would come
down to which disk was discovered first, but you certainly should get
one version, or the other, not a mismash of both.  If the second disk
were left as removed and required manual intervention to use, then the
administrator could examine it and recover any data written to that disk
but not the first, before manually re-inserting it into the array
causing a resync.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Two degraded mirror segments recombined out of sync for massive data loss
  2010-04-07 21:21 ` Michael Evans
  2010-04-07 22:58   ` Jools Wills
@ 2010-04-08 14:58   ` Billy Crook
  1 sibling, 0 replies; 7+ messages in thread
From: Billy Crook @ 2010-04-08 14:58 UTC (permalink / raw)
  To: Michael Evans; +Cc: Phillip Susi, linux-raid

On Wed, Apr 7, 2010 at 16:21, Michael Evans <mjevans1983@gmail.com> wrote:
> It sounds like the last 'synced' time should be tracked, as well as
> the last modification time. If the two differ then it can be known
> that the contents has diverged since last sync.

I have perhaps a better solution:
Every time an event happens that could affect the coherency of the
components of an array (i.e. started, stopped, disk failed), a counter
is incremented on all of the components.  Then a random number is
written next to it (same number to all disks).

On assembling an array:
For all components in the array, find the highest counter.

If enough disks with this counter are present and contain the same
random number, the it starts the array, and if a rebuild is necessary
to regain parity or the specified number of mirrors, the remaining
components of the same array are consumed for this purpose.

If there are multiple randoms present for this highest counter, then
the last modification time can be used to choose the most up to date
one.  It comes online, and overwrites the components with the older
modification time, or maybe it just prompts the admin, and starts
degraded.

This would catch the problem originally reported by Phillip because
the random numbers written to the components' headers would have been
written at different times, and so they would be different.  mdadm
would know by this that they had diverged regardless of any timestamp
or counter.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Two degraded mirror segments recombined out of sync for massive data loss
  2010-04-07 20:45 Two degraded mirror segments recombined out of sync for massive data loss Phillip Susi
  2010-04-07 21:21 ` Michael Evans
  2010-04-07 23:49 ` Neil Brown
@ 2010-04-14 20:56 ` Bill Davidsen
  2 siblings, 0 replies; 7+ messages in thread
From: Bill Davidsen @ 2010-04-14 20:56 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

Phillip Susi wrote:
> The gist of the problem is this: after booting a mirror in degraded mode
> with only the first disk, then doing the same with only the second disk,
> then booting with both disks again, mdadm happily recombines the two
> disks out of sync, causing two divergent filesystems to become munged
> together.
>
> The problem was initially discovered testing the coming lucid release of
> Ubuntu doing clean installs in a virtualization environment, and I have
> reproduced it manually activating and deactivating the array built out
> of two lvm logical volumes under Karmic.  What seems to be happening is
> that when you activate in degraded mode ( mdadm --assemble --run ), the
> metadata on the first disk is changed to indicate that the second disk
> was faulty and removed.  When you activate with only the second disk,
> you would think it would say the first disk was faulty, removed, but for
> some reason it ends up only marking it as removed, but not faulty.  Now
> both disks are degraded.
>
> When mdadm --incrmental is run by udev on the first disk, it happily
> activates it since the array is degraded, but has one out of one active
> member present, with the second member faulty,removed.  When mdadm
> --incremental is run by udev on the second disk, it happily slips the
> disk into the active array, WITHOUT SYNCING.
>
> My two questions are:
>
> 1) When doing mdadm --assemble --run with only the second disk present,
> shouldn't it mark the first disk as faulty, removed instead of only removed?
>
> 2) When mdadm --incremental is run on the second disk, shouldn't it
> refuse to use it since the array says the second disk is faulty, removed?
>
> The bug report related to this can be found at:
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/557429
>
>   

Is any of this due to the rather elderly versions of the kernel and 
mdadm which Ubuntu was running?

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-04-14 20:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-07 20:45 Two degraded mirror segments recombined out of sync for massive data loss Phillip Susi
2010-04-07 21:21 ` Michael Evans
2010-04-07 22:58   ` Jools Wills
2010-04-08 14:58   ` Billy Crook
2010-04-07 23:49 ` Neil Brown
2010-04-08 13:56   ` Phillip Susi
2010-04-14 20:56 ` Bill Davidsen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.