All of lore.kernel.org
 help / color / mirror / Atom feed
* DM-RAID1 data corruption
@ 2009-04-14 20:46 Mikulas Patocka
  2009-04-14 21:07 ` Takahiro Yasui
  2009-04-15  3:12 ` malahal
  0 siblings, 2 replies; 16+ messages in thread
From: Mikulas Patocka @ 2009-04-14 20:46 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: Heinz Mauelshagen, dm-devel

Hi

This is the scenario of data corruption that I was talking about:

Mirror has two legs, 0 and 1 and a log. Disk 0 is the default.

A write is propagated to both legs. The write fails on leg 0 and succeeds 
on leg 1.

The function "write_callback" puts the bio to "failure" list (if 
errors_handled was true). It also wakes userspace.

do_failures pops the bios from ms->log_failure and calls dm_rh_mark_nosync 
on them to mark the region nosync. dm_rh_mark_nosync completes the bio 
with success.

*the computer crahes* (before the userspace daemon had a chance to run)

On next reboot, disk is 0 revived (suppose that it temporarily failed 
because of a loose cable, overheating, insufficient power or so, and the 
condition is repaired), raid1 sees set bit in the dirty bitmap and starts 
copying data from disk 0 to disk 1.

The result: write bio was ended as succes, but the data was lost. For 
databases, this might have bad consequences - committed transactions being 
forgotten.

-

If the above scenario can't happen, pls. describe why.

What would be a possible way to fix this?

Delay all bios until the userspace code removes the failed mirror?
Or store the number of the default mirror in the log?

Mikulas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-04-14 20:46 DM-RAID1 data corruption Mikulas Patocka
@ 2009-04-14 21:07 ` Takahiro Yasui
  2009-04-15  3:12 ` malahal
  1 sibling, 0 replies; 16+ messages in thread
From: Takahiro Yasui @ 2009-04-14 21:07 UTC (permalink / raw)
  To: device-mapper development; +Cc: Heinz Mauelshagen, Alasdair G Kergon

Hi Mikulas,

I know this data corruption issue can happen. To make this
condition easily, I stopped dmeventd and injected an error
to leg 0, then this issue happened in my environment.

The problem is leg 0 is always the default mirror without
checking any information. To store the information which
leg is the default mirror might solve this issue.

Thanks,
Taka


> Hi
> 
> This is the scenario of data corruption that I was talking about:
> 
> Mirror has two legs, 0 and 1 and a log. Disk 0 is the default.
> 
> A write is propagated to both legs. The write fails on leg 0 and succeeds 
> on leg 1.
> 
> The function "write_callback" puts the bio to "failure" list (if 
> errors_handled was true). It also wakes userspace.
> 
> do_failures pops the bios from ms->log_failure and calls dm_rh_mark_nosync 
> on them to mark the region nosync. dm_rh_mark_nosync completes the bio 
> with success.
> 
> *the computer crahes* (before the userspace daemon had a chance to run)
> 
> On next reboot, disk is 0 revived (suppose that it temporarily failed 
> because of a loose cable, overheating, insufficient power or so, and the 
> condition is repaired), raid1 sees set bit in the dirty bitmap and starts 
> copying data from disk 0 to disk 1.
> 
> The result: write bio was ended as succes, but the data was lost. For 
> databases, this might have bad consequences - committed transactions being 
> forgotten.
> 
> -
> 
> If the above scenario can't happen, pls. describe why.
> 
> What would be a possible way to fix this?
> 
> Delay all bios until the userspace code removes the failed mirror?
> Or store the number of the default mirror in the log?
> 
> Mikulas
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-04-14 20:46 DM-RAID1 data corruption Mikulas Patocka
  2009-04-14 21:07 ` Takahiro Yasui
@ 2009-04-15  3:12 ` malahal
  2009-04-15 20:38   ` Takahiro Yasui
  1 sibling, 1 reply; 16+ messages in thread
From: malahal @ 2009-04-15  3:12 UTC (permalink / raw)
  To: dm-devel

Mikulas Patocka [mpatocka@redhat.com] wrote:
> Hi
> 
> because of a loose cable, overheating, insufficient power or so, and the 
> condition is repaired), raid1 sees set bit in the dirty bitmap and starts 
> copying data from disk 0 to disk 1.
> 
> The result: write bio was ended as succes, but the data was lost. For 
> databases, this might have bad consequences - committed transactions being 
> forgotten.
> 
> -
> 
> If the above scenario can't happen, pls. describe why.
 
IIRC, this is a known problem, always attributed to a "rare/small
window" of chance. :-(

> Delay all bios until the userspace code removes the failed mirror?

That is what the code does when a log device fails. We can use the same
approach.

> Or store the number of the default mirror in the log?

This is one way to do it but what about "corelog" mirrors?

Look at this patch
http://permalink.gmane.org/gmane.linux.kernel.device-mapper.devel/4973

It essentially generates an uevet and waits for the user level code to
act on it and send a message to unblock it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-04-15  3:12 ` malahal
@ 2009-04-15 20:38   ` Takahiro Yasui
  2009-04-16  2:49     ` malahal
  0 siblings, 1 reply; 16+ messages in thread
From: Takahiro Yasui @ 2009-04-15 20:38 UTC (permalink / raw)
  To: dm-devel

malahal@us.ibm.com wrote:
> Look at this patch
> http://permalink.gmane.org/gmane.linux.kernel.device-mapper.devel/4973
>
> It essentially generates an uevet and waits for the user level code to
> act on it and send a message to unblock it.

This patch was posted more then a year ago, and I could not find
any discussion on this issue/patch in the mailing list archive.
What was the conclusion of the discussion about this patch?
Are there any discussions outside this mailing list?

I think data corruption is really a serious problem even if it
has very small chance to happen. If it is a known problem, we
need to fix it. Don't you agree?

I roughly looked your patch, and I understand it is one of the
approach to fix this issue. I have just one concern about delay.
When 'unblock' message is delayed for some reason (by dmeventd?),
all write I/Os from applications need to be waited, and many I/Os
might come in a write queue and be blocked during 'block' status.

Thanks,
Taka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-04-15 20:38   ` Takahiro Yasui
@ 2009-04-16  2:49     ` malahal
  2009-04-16 22:24       ` Takahiro Yasui
  0 siblings, 1 reply; 16+ messages in thread
From: malahal @ 2009-04-16  2:49 UTC (permalink / raw)
  To: dm-devel

Takahiro Yasui [tyasui@redhat.com] wrote:
> malahal@us.ibm.com wrote:
> > Look at this patch
> > http://permalink.gmane.org/gmane.linux.kernel.device-mapper.devel/4973
> >
> > It essentially generates an uevet and waits for the user level code to
> > act on it and send a message to unblock it.
> 
> This patch was posted more then a year ago, and I could not find
> any discussion on this issue/patch in the mailing list archive.
> What was the conclusion of the discussion about this patch?
> Are there any discussions outside this mailing list?

The patch alone can't fix the issue. It needed LVM changes. We had some
discussions on how to implement the LVM related changes. Finally I was
told look at remote-replication target code to see how that handles
selecting the right "MASTER" device. That code is not published yet.

> I think data corruption is really a serious problem even if it
> has very small chance to happen. If it is a known problem, we
> need to fix it. Don't you agree?

Of course, yes!

> I roughly looked your patch, and I understand it is one of the
> approach to fix this issue. I have just one concern about delay.
> When 'unblock' message is delayed for some reason (by dmeventd?),
> all write I/Os from applications need to be waited, and many I/Os
> might come in a write queue and be blocked during 'block' status.

That is how the "log device" failure is handled today. Alasdair also
thought we needed to change LVM to handle events as soon as possible
using a single thread and not block behind an LVM scan, etc.

Another method is to have dm-mirror target metadata on the disk itself.
This metadata is internal to the kernel module and would NOT touch it.
This would avoid any user level interaction and delays.

Of course, we can do something in the log itself but it will not fix
"corelog" mirrors, more over the system can't auto recover after a
missing log alone.

Thanks, Malahal.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-04-16  2:49     ` malahal
@ 2009-04-16 22:24       ` Takahiro Yasui
  2009-04-20  9:56         ` Mikulas Patocka
  0 siblings, 1 reply; 16+ messages in thread
From: Takahiro Yasui @ 2009-04-16 22:24 UTC (permalink / raw)
  To: dm-devel

malahal@us.ibm.com wrote:
> Takahiro Yasui [tyasui@redhat.com] wrote:
>> malahal@us.ibm.com wrote:
>>> Look at this patch
>>> http://permalink.gmane.org/gmane.linux.kernel.device-mapper.devel/4973
>>>
>>> It essentially generates an uevet and waits for the user level code to
>>> act on it and send a message to unblock it.
>> This patch was posted more then a year ago, and I could not find
>> any discussion on this issue/patch in the mailing list archive.
>> What was the conclusion of the discussion about this patch?
>> Are there any discussions outside this mailing list?
> 
> The patch alone can't fix the issue. It needed LVM changes. We had some
> discussions on how to implement the LVM related changes. Finally I was
> told look at remote-replication target code to see how that handles
> selecting the right "MASTER" device. That code is not published yet.

Who is working on this?

> That is how the "log device" failure is handled today. Alasdair also
> thought we needed to change LVM to handle events as soon as possible
> using a single thread and not block behind an LVM scan, etc.

I agree. I also described this point in the background section of
"Introduce metadata cache".
https://www.redhat.com/archives/lvm-devel/2009-April/msg00014.html

> Another method is to have dm-mirror target metadata on the disk itself.
> This metadata is internal to the kernel module and would NOT touch it.
> This would avoid any user level interaction and delays.

I'm interested in this approach that dm-mirror manages own data
to keep the status, such as the number of default mirror, valid
legs. When an error is detected, dm-mirror handles the error and
disable the error disk as soon as possible in kernel space, then
lvm metadata is managed in the user-space later.

Some transaction systems are sensitive to delay, and approaches
which don't cause much delay even if an error was detected are
desirable.

> Of course, we can do something in the log itself but it will not fix
> "corelog" mirrors, more over the system can't auto recover after a
> missing log alone.

Yes, storing information on the log device does not save "corelog"
mirrors, so we might need some area to keep information on mirror
legs.

Thanks,
Taka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-04-16 22:24       ` Takahiro Yasui
@ 2009-04-20  9:56         ` Mikulas Patocka
  2009-04-20 17:08           ` Takahiro Yasui
                             ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Mikulas Patocka @ 2009-04-20  9:56 UTC (permalink / raw)
  To: Takahiro Yasui; +Cc: dm-devel

> > Of course, we can do something in the log itself but it will not fix
> > "corelog" mirrors, more over the system can't auto recover after a
> > missing log alone.
> 
> Yes, storing information on the log device does not save "corelog"
> mirrors, so we might need some area to keep information on mirror
> legs.
> 
> Thanks,
> Taka

MD-RAID1 solves this problem by having counters in superblocks on both 
legs. If some leg dies, the counter on the other devices is increased. If 
the dead disk comes online again, it is found that it has old counter and 
cannot be trusted.

Would it be possible to extend a logical volume when converting it to a 
raid1 and use the last area of the volume as a superblock?

Mikulas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-04-20  9:56         ` Mikulas Patocka
@ 2009-04-20 17:08           ` Takahiro Yasui
  2009-05-27  1:33           ` malahal
  2009-06-23  1:09           ` malahal
  2 siblings, 0 replies; 16+ messages in thread
From: Takahiro Yasui @ 2009-04-20 17:08 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel

Mikulas Patocka wrote:
>>> Of course, we can do something in the log itself but it will not fix
>>> "corelog" mirrors, more over the system can't auto recover after a
>>> missing log alone.
>> Yes, storing information on the log device does not save "corelog"
>> mirrors, so we might need some area to keep information on mirror
>> legs.
>>
>> Thanks,
>> Taka
> 
> MD-RAID1 solves this problem by having counters in superblocks on both 
> legs. If some leg dies, the counter on the other devices is increased. If 
> the dead disk comes online again, it is found that it has old counter and 
> cannot be trusted.
> 
> Would it be possible to extend a logical volume when converting it to a 
> raid1 and use the last area of the volume as a superblock?

I agree with this idea. lvm metadata is managed by counters and I think
this method would work for this issue as well.

In addition, I think this sort of superblock is necessary not only
for fixing the data corruption issue but also for introducing faster
blockage algorithm which disables failed mirror legs in kernel.

Thanks,
Taka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-04-20  9:56         ` Mikulas Patocka
  2009-04-20 17:08           ` Takahiro Yasui
@ 2009-05-27  1:33           ` malahal
  2009-06-23  1:09           ` malahal
  2 siblings, 0 replies; 16+ messages in thread
From: malahal @ 2009-05-27  1:33 UTC (permalink / raw)
  To: dm-devel; +Cc: agk

Mikulas, does RedHat have a bugzilla bug opened for this problem?

Mikulas Patocka [mpatocka@redhat.com] wrote:
> MD-RAID1 solves this problem by having counters in superblocks on both 
> legs. If some leg dies, the counter on the other devices is increased. If 
> the dead disk comes online again, it is found that it has old counter and 
> cannot be trusted.
> 
> Would it be possible to extend a logical volume when converting it to a 
> raid1 and use the last area of the volume as a superblock?

Interesting idea. Any more thoughts on this front?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-04-20  9:56         ` Mikulas Patocka
  2009-04-20 17:08           ` Takahiro Yasui
  2009-05-27  1:33           ` malahal
@ 2009-06-23  1:09           ` malahal
  2009-06-23 16:44             ` Takahiro Yasui
  2 siblings, 1 reply; 16+ messages in thread
From: malahal @ 2009-06-23  1:09 UTC (permalink / raw)
  To: dm-devel; +Cc: mpatocka

Mikulas Patocka [mpatocka@redhat.com] wrote:
> > > Of course, we can do something in the log itself but it will not fix
> > > "corelog" mirrors, more over the system can't auto recover after a
> > > missing log alone.
> > 
> > Yes, storing information on the log device does not save "corelog"
> > mirrors, so we might need some area to keep information on mirror
> > legs.
> > 
> > Thanks,
> > Taka
> 
> MD-RAID1 solves this problem by having counters in superblocks on both 
> legs. If some leg dies, the counter on the other devices is increased. If 
> the dead disk comes online again, it is found that it has old counter and 
> cannot be trusted.
> 
> Would it be possible to extend a logical volume when converting it to a 
> raid1 and use the last area of the volume as a superblock?
> 
> Mikulas

This is an old thread, I am just trying to revitalize it! :-) How about
dm-raid1 taking superblock storage as arguments in the command line,
just like the log device? The superblock storage is entirely managed by
the kernel, LVM just allocates it. Error handling can be instant this
way. LVM can auto convert the exiting mirrors to this kind of mirrors if
space is available.

Thanks, Malahal.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-06-23  1:09           ` malahal
@ 2009-06-23 16:44             ` Takahiro Yasui
  2009-06-23 18:22               ` malahal
  2009-06-24  3:03               ` Neil Brown
  0 siblings, 2 replies; 16+ messages in thread
From: Takahiro Yasui @ 2009-06-23 16:44 UTC (permalink / raw)
  To: malahal; +Cc: dm-devel, mpatocka

>> MD-RAID1 solves this problem by having counters in superblocks on both 
>> legs. If some leg dies, the counter on the other devices is increased. If 
>> the dead disk comes online again, it is found that it has old counter and 
>> cannot be trusted.
>>
>> Would it be possible to extend a logical volume when converting it to a 
>> raid1 and use the last area of the volume as a superblock?
>>
>> Mikulas
> 
> This is an old thread, I am just trying to revitalize it! :-) How about
> dm-raid1 taking superblock storage as arguments in the command line,
> just like the log device? The superblock storage is entirely managed by
> the kernel, LVM just allocates it. Error handling can be instant this
> way. LVM can auto convert the exiting mirrors to this kind of mirrors if
> space is available.

Interesting idea. The superblock storage managed by kernel is really
important to handle an error quickly inside the kernel.

According to Mikulas's comment, superblocks are located on each legs
and contains a counter. Do you have an idea to separate a superblock
device like the log device? Or do you mean to add a parameter to
enable superblock like "--superblock"? It might be helpful if you
could describe some command line examples.

Thanks,
Taka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-06-23 16:44             ` Takahiro Yasui
@ 2009-06-23 18:22               ` malahal
  2009-06-24  3:03               ` Neil Brown
  1 sibling, 0 replies; 16+ messages in thread
From: malahal @ 2009-06-23 18:22 UTC (permalink / raw)
  To: dm-devel

Takahiro Yasui [tyasui@redhat.com] wrote:
> >> MD-RAID1 solves this problem by having counters in superblocks on both 
> >> legs. If some leg dies, the counter on the other devices is increased. If 
> >> the dead disk comes online again, it is found that it has old counter and 
> >> cannot be trusted.
> >>
> >> Would it be possible to extend a logical volume when converting it to a 
> >> raid1 and use the last area of the volume as a superblock?
> >>
> >> Mikulas
> > 
> > This is an old thread, I am just trying to revitalize it! :-) How about
> > dm-raid1 taking superblock storage as arguments in the command line,
> > just like the log device? The superblock storage is entirely managed by
> > the kernel, LVM just allocates it. Error handling can be instant this
> > way. LVM can auto convert the exiting mirrors to this kind of mirrors if
> > space is available.
> 
> Interesting idea. The superblock storage managed by kernel is really
> important to handle an error quickly inside the kernel.
> 
> According to Mikulas's comment, superblocks are located on each legs
> and contains a counter. Do you have an idea to separate a superblock
> device like the log device? Or do you mean to add a parameter to
> enable superblock like "--superblock"? It might be helpful if you
> could describe some command line examples.

I was thinking something like adding {metadata-size #metadata-areas <dev offset>
...

echo 0 1000 mirror core 1 64 2 /dev/sda 0 /dev/sdb 0 512 2 /dev/sda 2000
/dev/sdb 2000

The above one will create a mirror with sda and sdb and mirror metadata
on the same disks at offset 2000 with 512 size. The kernel target can
use that 512 bytes whatever the way it wants.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-06-23 16:44             ` Takahiro Yasui
  2009-06-23 18:22               ` malahal
@ 2009-06-24  3:03               ` Neil Brown
  2009-06-24 16:09                 ` Takahiro Yasui
  1 sibling, 1 reply; 16+ messages in thread
From: Neil Brown @ 2009-06-24  3:03 UTC (permalink / raw)
  To: device-mapper development; +Cc: mpatocka

On Tuesday June 23, tyasui@redhat.com wrote:
> >> MD-RAID1 solves this problem by having counters in superblocks on both 
> >> legs. If some leg dies, the counter on the other devices is increased. If 
> >> the dead disk comes online again, it is found that it has old counter and 
> >> cannot be trusted.
> >>
> >> Would it be possible to extend a logical volume when converting it to a 
> >> raid1 and use the last area of the volume as a superblock?
> >>
> >> Mikulas
> > 
> > This is an old thread, I am just trying to revitalize it! :-) How about
> > dm-raid1 taking superblock storage as arguments in the command line,
> > just like the log device? The superblock storage is entirely managed by
> > the kernel, LVM just allocates it. Error handling can be instant this
> > way. LVM can auto convert the exiting mirrors to this kind of mirrors if
> > space is available.
> 
> Interesting idea. The superblock storage managed by kernel is really
> important to handle an error quickly inside the kernel.

I don't think that it is important to handle errors quickly - they
really shouldn't happen often enough that speed is an issue.  All you
need to do is handle errors correctly.

I would suggest that you simply get raid1 to block any write requests
until all drive failures have been acknowledged by userspace.
So you would need to differentiate between an acknowledged drive
failure and an unacknowledged failure.  Writes block when ever there
are unacknowledged failures.
Then you need a message that can be sent to the raid1 to acknowledge
the failure of a particular device.
'suspend' would need to fail if there are any unacknowledged failures
as otherwise it would block.

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-06-24  3:03               ` Neil Brown
@ 2009-06-24 16:09                 ` Takahiro Yasui
  2009-06-25 14:47                   ` Mikulas Patocka
  0 siblings, 1 reply; 16+ messages in thread
From: Takahiro Yasui @ 2009-06-24 16:09 UTC (permalink / raw)
  To: neilb; +Cc: device-mapper development, mpatocka

Neil Brown wrote:
> On Tuesday June 23, tyasui@redhat.com wrote:
>>>> MD-RAID1 solves this problem by having counters in superblocks on both 
>>>> legs. If some leg dies, the counter on the other devices is increased. If 
>>>> the dead disk comes online again, it is found that it has old counter and 
>>>> cannot be trusted.
>>>>
>>>> Would it be possible to extend a logical volume when converting it to a 
>>>> raid1 and use the last area of the volume as a superblock?
>>>>
>>>> Mikulas
>>> This is an old thread, I am just trying to revitalize it! :-) How about
>>> dm-raid1 taking superblock storage as arguments in the command line,
>>> just like the log device? The superblock storage is entirely managed by
>>> the kernel, LVM just allocates it. Error handling can be instant this
>>> way. LVM can auto convert the exiting mirrors to this kind of mirrors if
>>> space is available.
>> Interesting idea. The superblock storage managed by kernel is really
>> important to handle an error quickly inside the kernel.
> 
> I don't think that it is important to handle errors quickly - they
> really shouldn't happen often enough that speed is an issue.  All you
> need to do is handle errors correctly.

Not really. Quick error handling is not for preventing this issue
but shorten system downtime. Fixing this issue is most important,
but it is better to have discuss other approach, too.

> I would suggest that you simply get raid1 to block any write requests
> until all drive failures have been acknowledged by userspace.
> So you would need to differentiate between an acknowledged drive
> failure and an unacknowledged failure.  Writes block when ever there
> are unacknowledged failures.
> Then you need a message that can be sent to the raid1 to acknowledge
> the failure of a particular device.
> 'suspend' would need to fail if there are any unacknowledged failures
> as otherwise it would block.

As we discuss in this thread, your suggestion is quite similar to what
malahal proposed more than one year ago.

malahal@us.ibm.com wrote:
> > Look at this patch
> > http://permalink.gmane.org/gmane.linux.kernel.device-mapper.devel/4973
> >
> > It essentially generates an uevet and waits for the user level code to
> > act on it and send a message to unblock it.

This is a simple approach, but all write I/Os are blocked before write
errors are processed by userspace (dmeventd). Depending on the error,
such as timeout, the recovery procedure in userspace may take a long
time and application sensitive to delay will have another problem.

superblock approach may solve this data corruption issue without an
additional delay. When dm-raid1 detects a write error, it can disable
the mirror leg quickly and ask userspace to process aftertreatment.

I would like to continue discussion how to fix this issue.

Thanks,
Taka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-06-24 16:09                 ` Takahiro Yasui
@ 2009-06-25 14:47                   ` Mikulas Patocka
  2009-06-25 16:16                     ` Takahiro Yasui
  0 siblings, 1 reply; 16+ messages in thread
From: Mikulas Patocka @ 2009-06-25 14:47 UTC (permalink / raw)
  To: Takahiro Yasui; +Cc: device-mapper development

> > I would suggest that you simply get raid1 to block any write requests
> > until all drive failures have been acknowledged by userspace.
> > So you would need to differentiate between an acknowledged drive
> > failure and an unacknowledged failure.  Writes block when ever there
> > are unacknowledged failures.
> > Then you need a message that can be sent to the raid1 to acknowledge
> > the failure of a particular device.
> > 'suspend' would need to fail if there are any unacknowledged failures
> > as otherwise it would block.
> 
> As we discuss in this thread, your suggestion is quite similar to what
> malahal proposed more than one year ago.
> 
> malahal@us.ibm.com wrote:
> > > Look at this patch
> > > http://permalink.gmane.org/gmane.linux.kernel.device-mapper.devel/4973
> > >
> > > It essentially generates an uevet and waits for the user level code to
> > > act on it and send a message to unblock it.
> 
> This is a simple approach, but all write I/Os are blocked before write
> errors are processed by userspace (dmeventd). Depending on the error,
> such as timeout, the recovery procedure in userspace may take a long
> time and application sensitive to delay will have another problem.
> 
> superblock approach may solve this data corruption issue without an
> additional delay. When dm-raid1 detects a write error, it can disable
> the mirror leg quickly and ask userspace to process aftertreatment.
> 
> I would like to continue discussion how to fix this issue.
> 
> Thanks,
> Taka

The current code already blocks all i/os if the log fails. So if your 
application is sensitive for it, you have to test for it anyway.

You can write new implementation dm-raid1.2 that will have two log devices 
(just like md-raid has two superblocks) and can perform without userspace 
intervention if any of the legs or logs fail.

But it is really more like rewrite --- I think it would be easier to write 
it from scratch than to patch it over existing dm-raid1 code.

Mikulas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DM-RAID1 data corruption
  2009-06-25 14:47                   ` Mikulas Patocka
@ 2009-06-25 16:16                     ` Takahiro Yasui
  0 siblings, 0 replies; 16+ messages in thread
From: Takahiro Yasui @ 2009-06-25 16:16 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development

On 06/25/09 10:47, Mikulas Patocka wrote:
>> superblock approach may solve this data corruption issue without an
>> additional delay. When dm-raid1 detects a write error, it can disable
>> the mirror leg quickly and ask userspace to process aftertreatment.
>>
>> I would like to continue discussion how to fix this issue.
>>
>> Thanks,
>> Taka
> 
> The current code already blocks all i/os if the log fails. So if your 
> application is sensitive for it, you have to test for it anyway.

It is also a known issue. We are trying to make a log device redundant
to prevent disk replication when a log device has problem. It will
solve the delay caused by a log device failure as well.

> You can write new implementation dm-raid1.2 that will have two log devices 
> (just like md-raid has two superblocks) and can perform without userspace 
> intervention if any of the legs or logs fail.
> 
> But it is really more like rewrite --- I think it would be easier to write 
> it from scratch than to patch it over existing dm-raid1 code.

Hmm. I will look at implementation if we could add superblocks into
current code.

Thanks,
Taka

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-06-25 16:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-14 20:46 DM-RAID1 data corruption Mikulas Patocka
2009-04-14 21:07 ` Takahiro Yasui
2009-04-15  3:12 ` malahal
2009-04-15 20:38   ` Takahiro Yasui
2009-04-16  2:49     ` malahal
2009-04-16 22:24       ` Takahiro Yasui
2009-04-20  9:56         ` Mikulas Patocka
2009-04-20 17:08           ` Takahiro Yasui
2009-05-27  1:33           ` malahal
2009-06-23  1:09           ` malahal
2009-06-23 16:44             ` Takahiro Yasui
2009-06-23 18:22               ` malahal
2009-06-24  3:03               ` Neil Brown
2009-06-24 16:09                 ` Takahiro Yasui
2009-06-25 14:47                   ` Mikulas Patocka
2009-06-25 16:16                     ` Takahiro Yasui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.