[RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust

All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
       [not found] <17025a94-1999-4619-b23d-7460946c2f85@zmail15.collab.prod.int.phx2.redhat.com>
@ 2012-07-18 11:01 ` Jaromir Capik
  2012-07-18 11:13   ` Mathias Burén
                     ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Jaromir Capik @ 2012-07-18 11:01 UTC (permalink / raw)
  To: linux-raid

Hello.

I'd like to ask you to implement the following ...

The current RAID1 solution is not robust enough to protect the data
against random data corruptions. Such corruptions usually happen
when an unreadable sector is found by the drive's electronics
and when the drive's trying to reallocate the sector to the spare area.
There's no guarantee that the reallocated data will always match
the original stored data since the drive sometimes can't read the data
correctly even with several retries. That unfortunately completely masks
the issue, because the sector can be read by the OS without problems
even if it doesn't contain correct data. Would it be possible
to implement chunk checksums to avoid such data corruptions?
If a corrupted chunk is encountered, it would be taken from the second
drive and immediately synced back. This would have a small performance
and capacity impact (1 sector per chunk to minimize performance impact
caused by unaligned granularity = 0.78% of the capacity with 64k chunks).

Please, let me know if you find my request reasonable or not.

Thanks in advance.

Regards,
Jaromir.

--
Jaromir Capik
Red Hat Czech, s.r.o.
Software Engineer / BaseOS

Email: jcapik@redhat.com
Web: www.cz.redhat.com
Red Hat Czech s.r.o., Purkynova 99/71, 612 45, Brno, Czech Republic
IC: 27690016 

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 11:01 ` [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust Jaromir Capik
@ 2012-07-18 11:13   ` Mathias Burén
  2012-07-18 12:42     ` Jaromir Capik
  2012-07-18 11:15   ` NeilBrown
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Mathias Burén @ 2012-07-18 11:13 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: linux-raid

On 18 July 2012 12:01, Jaromir Capik <jcapik@redhat.com> wrote:
> Hello.
>
> I'd like to ask you to implement the following ...
>
> The current RAID1 solution is not robust enough to protect the data
> against random data corruptions. Such corruptions usually happen
> when an unreadable sector is found by the drive's electronics
> and when the drive's trying to reallocate the sector to the spare area.
> There's no guarantee that the reallocated data will always match
> the original stored data since the drive sometimes can't read the data
> correctly even with several retries. That unfortunately completely masks
> the issue, because the sector can be read by the OS without problems
> even if it doesn't contain correct data. Would it be possible
> to implement chunk checksums to avoid such data corruptions?
> If a corrupted chunk is encountered, it would be taken from the second
> drive and immediately synced back. This would have a small performance
> and capacity impact (1 sector per chunk to minimize performance impact
> caused by unaligned granularity = 0.78% of the capacity with 64k chunks).
>
> Please, let me know if you find my request reasonable or not.
>
> Thanks in advance.
>
> Regards,
> Jaromir.
>
> --
> Jaromir Capik
> Red Hat Czech, s.r.o.
> Software Engineer / BaseOS
>
> Email: jcapik@redhat.com
> Web: www.cz.redhat.com
> Red Hat Czech s.r.o., Purkynova 99/71, 612 45, Brno, Czech Republic
> IC: 27690016
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

That would be a disk format change...

Why not use btrfs or zfs?

Mathias

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 11:13   ` Mathias Burén
@ 2012-07-18 12:42     ` Jaromir Capik
  0 siblings, 0 replies; 36+ messages in thread
From: Jaromir Capik @ 2012-07-18 12:42 UTC (permalink / raw)
  To: Mathias Burén; +Cc: linux-raid

> On 18 July 2012 12:01, Jaromir Capik <jcapik@redhat.com> wrote:
> > Hello.
> >
> > I'd like to ask you to implement the following ...
> >
> > The current RAID1 solution is not robust enough to protect the data
> > against random data corruptions. Such corruptions usually happen
> > when an unreadable sector is found by the drive's electronics
> > and when the drive's trying to reallocate the sector to the spare
> > area.
> > There's no guarantee that the reallocated data will always match
> > the original stored data since the drive sometimes can't read the
> > data
> > correctly even with several retries. That unfortunately completely
> > masks
> > the issue, because the sector can be read by the OS without
> > problems
> > even if it doesn't contain correct data. Would it be possible
> > to implement chunk checksums to avoid such data corruptions?
> > If a corrupted chunk is encountered, it would be taken from the
> > second
> > drive and immediately synced back. This would have a small
> > performance
> > and capacity impact (1 sector per chunk to minimize performance
> > impact
> > caused by unaligned granularity = 0.78% of the capacity with 64k
> > chunks).
> >
> > Please, let me know if you find my request reasonable or not.
> >
> > Thanks in advance.
> >
> > Regards,
> > Jaromir.
> >
> > --
> > Jaromir Capik
> > Red Hat Czech, s.r.o.
> > Software Engineer / BaseOS
> >
> > Email: jcapik@redhat.com
> > Web: www.cz.redhat.com
> > Red Hat Czech s.r.o., Purkynova 99/71, 612 45, Brno, Czech Republic
> > IC: 27690016
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> That would be a disk format change...
> 

Hello Mathias.

Yes. That would be a disk format change ... but optional only!
Chunks would be interleaved with checksums, therefore the small
capacity loss.

> Why not use btrfs or zfs?

I know btrfs implements that, but AFAIK it still lacks the transparent
encryption. Am I wrong?
In that case I would have to create one large regular file holding the
LUKS data and modify the initramdisk to handle that.

I could give ZFS a try ... 

> 
> Mathias
> 

Thanks,
Jaromir.

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 11:01 ` [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust Jaromir Capik
  2012-07-18 11:13   ` Mathias Burén
@ 2012-07-18 11:15   ` NeilBrown
  2012-07-18 13:04     ` Jaromir Capik
  2012-07-18 11:49   ` keld
  2012-07-18 16:28   ` Asdo
  3 siblings, 1 reply; 36+ messages in thread
From: NeilBrown @ 2012-07-18 11:15 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1917 bytes --]

On Wed, 18 Jul 2012 07:01:48 -0400 (EDT) Jaromir Capik <jcapik@redhat.com>
wrote:

> Hello.
> 
> I'd like to ask you to implement the following ...
> 
> The current RAID1 solution is not robust enough to protect the data
> against random data corruptions. Such corruptions usually happen
> when an unreadable sector is found by the drive's electronics
> and when the drive's trying to reallocate the sector to the spare area.
> There's no guarantee that the reallocated data will always match
> the original stored data since the drive sometimes can't read the data
> correctly even with several retries. That unfortunately completely masks
> the issue, because the sector can be read by the OS without problems
> even if it doesn't contain correct data.

If a drive ever lets you read incorrect data rather than giving you an error
indication, then the drive is broken by design.  Don't use drives that do
that.

>     Would it be possible
> to implement chunk checksums to avoid such data corruptions?

No.

NeilBrown

> If a corrupted chunk is encountered, it would be taken from the second
> drive and immediately synced back. This would have a small performance
> and capacity impact (1 sector per chunk to minimize performance impact
> caused by unaligned granularity = 0.78% of the capacity with 64k chunks).
> 
> Please, let me know if you find my request reasonable or not.
> 
> Thanks in advance.
> 
> Regards,
> Jaromir.
> 
> --
> Jaromir Capik
> Red Hat Czech, s.r.o.
> Software Engineer / BaseOS
> 
> Email: jcapik@redhat.com
> Web: www.cz.redhat.com
> Red Hat Czech s.r.o., Purkynova 99/71, 612 45, Brno, Czech Republic
> IC: 27690016 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 11:15   ` NeilBrown
@ 2012-07-18 13:04     ` Jaromir Capik
  2012-07-19  3:48       ` Stan Hoeppner
  0 siblings, 1 reply; 36+ messages in thread
From: Jaromir Capik @ 2012-07-18 13:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hello Neil.

> 
> If a drive ever lets you read incorrect data rather than giving you
> an error
> indication, then the drive is broken by design.  Don't use drives
> that do
> that.

Unfortunately many drives do that. This happens transparently
during the drive's idle surface checks, when there's no read request
from the OS. I don't know how it is done in server harddrives, but
I experienced data corruptions related to bad sector reallocations
in case of several different desktop drives. I got no read errors
at all even when the SMART attributes were showing reallocations.
Somebody could ask, why people want to implement RAID on top of cheap
desktop harddrives. It's surprisingly because of their price.
Chunk checksums would give people a cheap and safe/robust solution.

> 
> >     Would it be possible
> > to implement chunk checksums to avoid such data corruptions?
> 
> No.

I respect your decision.
Thank you for your time.

Jaromir.

> 
> NeilBrown
> 
> 
> > If a corrupted chunk is encountered, it would be taken from the
> > second
> > drive and immediately synced back. This would have a small
> > performance
> > and capacity impact (1 sector per chunk to minimize performance
> > impact
> > caused by unaligned granularity = 0.78% of the capacity with 64k
> > chunks).
> > 
> > Please, let me know if you find my request reasonable or not.
> > 
> > Thanks in advance.
> > 
> > Regards,
> > Jaromir.
> > 
> > --
> > Jaromir Capik
> > Red Hat Czech, s.r.o.
> > Software Engineer / BaseOS
> > 
> > Email: jcapik@redhat.com
> > Web: www.cz.redhat.com
> > Red Hat Czech s.r.o., Purkynova 99/71, 612 45, Brno, Czech Republic
> > IC: 27690016
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 13:04     ` Jaromir Capik
@ 2012-07-19  3:48       ` Stan Hoeppner
  2012-07-20 12:53         ` Jaromir Capik
  0 siblings, 1 reply; 36+ messages in thread
From: Stan Hoeppner @ 2012-07-19  3:48 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: NeilBrown, linux-raid

On 7/18/2012 8:04 AM, Jaromir Capik wrote:

> Unfortunately many drives do that. This happens transparently
> during the drive's idle surface checks, 

Please list the SATA drives you have verified that perform firmware self
initiated surface scans when idle, and transparently (to the OS)
relocate bad sectors during this process.

Then list the drives that have relocated sectors during such a process
for which they could not read all the data, causing the silent data
corruption you describe.

> I experienced data corruptions related to bad sector reallocations
> in case of several different desktop drives.

Please name the drives, make/model/manufacturer, the drive count of the
array, and the array type used when these silent corruptions occurred.

For one user to experience silent corruption once is extremely rare.  To
experience it multiple times within a human lifetime is statistically
impossible, unless you manage very large disk farms with high cap drives.

If your multiple silent corruptions relate strictly to RAID1 pairs, it
would seem the problem is not with the drives, but lay somewhere else.
Unless you're using some el cheapo 3rd rate Asian sourced white label
drives nobody ever heard of.  One such company flooded the market with
such drives in the mid 90s.  I've not heard of anything similar since,
but that doesn't mean such drives aren't in the wild.

-- 
Stan

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-19  3:48       ` Stan Hoeppner
@ 2012-07-20 12:53         ` Jaromir Capik
  2012-07-20 18:24           ` Roberto Spadim
  2012-07-21  3:58           ` Stan Hoeppner
  0 siblings, 2 replies; 36+ messages in thread
From: Jaromir Capik @ 2012-07-20 12:53 UTC (permalink / raw)
  To: stan; +Cc: NeilBrown, linux-raid

> > Unfortunately many drives do that. This happens transparently
> > during the drive's idle surface checks,
> 
> Please list the SATA drives you have verified that perform firmware
> self
> initiated surface scans when idle, and transparently (to the OS)
> relocate bad sectors during this process.
> 
> Then list the drives that have relocated sectors during such a
> process
> for which they could not read all the data, causing the silent data
> corruption you describe.

I can't say I "have verified" that, since that doesn't happen everyday
and in such cases I'm trying to focus on saving my data. I accept 
It's my fault that I had no mental power to play with the failing
drives more prior to returning them for warranty replacement.
I just know that I had corrupted data on the clones whilst there were
no I/O errors in any logs during the cloning. I experienced that 
mainly on systems without RAID (=with single drive). One of my drives
became unbootable due to a MBR data corruption. There were no intentional
writes to that sector for a long time. I was able to read it by dd,
I was able to clean it with zeroes by dd and I was able to create
a new partition table with fdisk. All of these operations worked
without problems and the number of reallocated sectors didn't increase
when I was writing to that sector. I used to periodically check
the SMART attributes by calling smartctl instead of retrieving emails
from smartd and I remember there were no reallocated sectors shortly
before it happened. But they were present after the incident.
That doesn't verify such behavior, but I seems to me that it's exactly
what happened. 

I experienced data corruptions with the following drives:
Seagate Barracuda 7200.7 series (120GB, 200GB, 250GB).
Seagate U6 series (40GB). All of them were IDE drives.
Western Digital (320GB) ... SATA one, don't remember exact type.
And now I'm playing with recently failed WDC WD2500AAJS-60M0A1,
that was as member of RAID1.

In the last case I put the failing drive to a different computer
and assembled two independent arrays in degraded mode since it got
out of sync / kicked the healthy drive out of the RAID1 for unknown
reason. I then mounted partitions from the failing drive via sshfs
and did a directory diff to find modification made in the meantime
and copy all the recently modified files from the failing (but more
recent) drive to the healthy one. I found one patch file, that had
a total binary mess inside on the failing drive, but that mess was
still perfectly readable. And even if it was not caused by the drive
itself, it's a data corruption that would be hopefully prevented
with chunk checksums.

> For one user to experience silent corruption once is extremely rare.
>  To
> experience it multiple times within a human lifetime is statistically
> impossible, unless you manage very large disk farms with high cap
> drives.
> 
> If your multiple silent corruptions relate strictly to RAID1 pairs,
> it
> would seem the problem is not with the drives, but lay somewhere
> else.

I admit, that the problem could lie elsewhere ... but that doesn't 
change anything on the fact, that the data became corrupted without
me noticing that. I don't feel well when I see what happened because
I trusted this solution a bit too much. Sorry if I look too anxious.

Regards,
Jaromir.

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-20 12:53         ` Jaromir Capik
@ 2012-07-20 18:24           ` Roberto Spadim
  2012-07-20 18:30             ` Roberto Spadim
  2012-07-20 20:07             ` Jaromir Capik
  2012-07-21  3:58           ` Stan Hoeppner
  1 sibling, 2 replies; 36+ messages in thread
From: Roberto Spadim @ 2012-07-20 18:24 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: stan, NeilBrown, linux-raid

IMO
I think Jaromir is probably right about silent disk 'losts', it's not
normal to lost data but it's possible (electronic problems,
radioactive problems or another problem not related maybe lost of disk
magnetic properties)

since we are at block device layer (md) i don't know if we
could/should implement a recovery algorithm or just a badblock report
algorithm (checksum)
i know it's not a 'normal' situation and it's not a property of raid1
implementations, but could be nice to implement it raid1 extended?! we
have many mirrors (more than 2) that's not a normal implementation but
it works really nice and help a lot in parallel work load

maybe for a 'fast' solution you could use raid5 or raid6? while we
discuss if this could/should/will not be implemented?!
i think raid5/6 have checksums and others tools to get this type of
problem while you can use your normal filesystem (ext3? ext4? reiser?
xfs?) or direct the block device (a oracle database for example or
mysql innodb)

2012/7/20 Jaromir Capik <jcapik@redhat.com>
>
> > > Unfortunately many drives do that. This happens transparently
> > > during the drive's idle surface checks,
> >
> > Please list the SATA drives you have verified that perform firmware
> > self
> > initiated surface scans when idle, and transparently (to the OS)
> > relocate bad sectors during this process.
> >
> > Then list the drives that have relocated sectors during such a
> > process
> > for which they could not read all the data, causing the silent data
> > corruption you describe.
>
> I can't say I "have verified" that, since that doesn't happen everyday
> and in such cases I'm trying to focus on saving my data. I accept
> It's my fault that I had no mental power to play with the failing
> drives more prior to returning them for warranty replacement.
> I just know that I had corrupted data on the clones whilst there were
> no I/O errors in any logs during the cloning. I experienced that
> mainly on systems without RAID (=with single drive). One of my drives
> became unbootable due to a MBR data corruption. There were no intentional
> writes to that sector for a long time. I was able to read it by dd,
> I was able to clean it with zeroes by dd and I was able to create
> a new partition table with fdisk. All of these operations worked
> without problems and the number of reallocated sectors didn't increase
> when I was writing to that sector. I used to periodically check
> the SMART attributes by calling smartctl instead of retrieving emails
> from smartd and I remember there were no reallocated sectors shortly
> before it happened. But they were present after the incident.
> That doesn't verify such behavior, but I seems to me that it's exactly
> what happened.
>
> I experienced data corruptions with the following drives:
> Seagate Barracuda 7200.7 series (120GB, 200GB, 250GB).
> Seagate U6 series (40GB). All of them were IDE drives.
> Western Digital (320GB) ... SATA one, don't remember exact type.
> And now I'm playing with recently failed WDC WD2500AAJS-60M0A1,
> that was as member of RAID1.
>
> In the last case I put the failing drive to a different computer
> and assembled two independent arrays in degraded mode since it got
> out of sync / kicked the healthy drive out of the RAID1 for unknown
> reason. I then mounted partitions from the failing drive via sshfs
> and did a directory diff to find modification made in the meantime
> and copy all the recently modified files from the failing (but more
> recent) drive to the healthy one. I found one patch file, that had
> a total binary mess inside on the failing drive, but that mess was
> still perfectly readable. And even if it was not caused by the drive
> itself, it's a data corruption that would be hopefully prevented
> with chunk checksums.
>
> > For one user to experience silent corruption once is extremely rare.
> >  To
> > experience it multiple times within a human lifetime is statistically
> > impossible, unless you manage very large disk farms with high cap
> > drives.
> >
> > If your multiple silent corruptions relate strictly to RAID1 pairs,
> > it
> > would seem the problem is not with the drives, but lay somewhere
> > else.
>
> I admit, that the problem could lie elsewhere ... but that doesn't
> change anything on the fact, that the data became corrupted without
> me noticing that. I don't feel well when I see what happened because
> I trusted this solution a bit too much. Sorry if I look too anxious.
>
> Regards,
> Jaromir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Roberto Spadim
Spadim Technology / SPAEmpresarial

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-20 18:24           ` Roberto Spadim
@ 2012-07-20 18:30             ` Roberto Spadim
  2012-07-20 20:07             ` Jaromir Capik
  1 sibling, 0 replies; 36+ messages in thread
From: Roberto Spadim @ 2012-07-20 18:30 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: stan, NeilBrown, linux-raid

Just some examples... searching on google (silent data loss) some guys
report silent loss, but the NEC study is usefull to support Jamiro
report...

http://www.necam.com/Docs/?id=54157ff5-5de8-4966-a99d-341cf2cb27d3

page 3)
Silent data corruption
Introduction
There are certain types of storage errors that go completely
unreported and undetected in
other storage systems which result in corrupt data being provided to
applications with no
warning, logging, error messages or notification of any kind.  Though
the problem is
frequently identified as a silent read failure, the root cause can be
that the write failed,
thus we refer to this class of errors as “silent data corruption.”
These errors are difficult
to detect and diagnose, yet what’s worse is they are actually fairly
common in systems
without an extended data integrity feature.

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-20 18:24           ` Roberto Spadim
  2012-07-20 18:30             ` Roberto Spadim
@ 2012-07-20 20:07             ` Jaromir Capik
  2012-07-20 20:21               ` Roberto Spadim
  1 sibling, 1 reply; 36+ messages in thread
From: Jaromir Capik @ 2012-07-20 20:07 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: stan, NeilBrown, linux-raid

> it's not normal to lost data but it's possible (electronic problems,
> radioactive problems or another problem not related maybe lost of disk
> magnetic properties)

It might be also caused by a bug in the SATA controller driver.
And nobody can be sure that there will be no new issues in case
of future chipsets and their very first driver versions.

> since we are at block device layer (md) i don't know if we
> could/should implement a recovery algorithm or just a badblock report
> algorithm (checksum)

Direct recovery would be better since it doesn't cost much and lowers
the possibility of data loss due to the second drive's failure.

> maybe for a 'fast' solution you could use raid5 or raid6? while we
> discuss if this could/should/will not be implemented?!
> i think raid5/6 have checksums and others tools to get this type of
> problem while you can use your normal filesystem (ext3? ext4? reiser?
> xfs?) or direct the block device (a oracle database for example or
> mysql innodb)

RAID5/6 would need more drives than I actually have, right? There's
not enough space for 3 drives in those small and cheap mini-ITX based home
routers/servers I started building 3 years ago. Moreover that would mean
a need for better cooling and a higher power consumption and that's
something I'm exactly trying to avoid in this particular case.

I slowly started to accept the idea, that I'll have to migrate my
systems from mdraid to btrfs if there's no solution soon :( I don't
like it much, but there's apparently nothing else I can do about that.

> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
> 

Thanks a lot for your answers and have a nice day.

Regards,
Jaromir.

--
Jaromir Capik
Red Hat Czech, s.r.o.
Software Engineer / BaseOS

Email: jcapik@redhat.com
Web: www.cz.redhat.com
Red Hat Czech s.r.o., Purkynova 99/71, 612 45, Brno, Czech Republic
IC: 27690016 

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-20 20:07             ` Jaromir Capik
@ 2012-07-20 20:21               ` Roberto Spadim
  2012-07-20 20:44                 ` Jaromir Capik
  0 siblings, 1 reply; 36+ messages in thread
From: Roberto Spadim @ 2012-07-20 20:21 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: stan, NeilBrown, linux-raid

yeah for a 'fast' solution moving from one file system to another that
works with theses checks can help you, while we check if this is
usefull or not

IMHO, if we implement this, we should implement outside any today raid
levels, this should be done between device and filesystem, in others
words:

we should implement this to work like:
DISKS - (OUR NEW SILENT ERROR SECURITY SYSTEM LEVEL, 1 PER DEVICE) -
TODAY RAID LEVELS - FILESYSTEMS

or

DISKS - RAIDS LEVELS - (OUR NEW SILENT ERROR SECURITY SYSTEM LEVEL, 1
PER DEVICE) - FILESYSTEM

or

DISK - (OUR NEW SILENT ERROR SECURITY SYSTEM LEVEL, 1 PER DEVICE) - FILESYSTEM

using this, we "could give more security" to usb pendrives for
example, and any block device (network block device, DRBD, or anyother
block device in linux)

2012/7/20 Jaromir Capik <jcapik@redhat.com>:
>> it's not normal to lost data but it's possible (electronic problems,
>> radioactive problems or another problem not related maybe lost of disk
>> magnetic properties)
>
> It might be also caused by a bug in the SATA controller driver.
> And nobody can be sure that there will be no new issues in case
> of future chipsets and their very first driver versions.
>
>
>> since we are at block device layer (md) i don't know if we
>> could/should implement a recovery algorithm or just a badblock report
>> algorithm (checksum)
>
> Direct recovery would be better since it doesn't cost much and lowers
> the possibility of data loss due to the second drive's failure.
>
>
>> maybe for a 'fast' solution you could use raid5 or raid6? while we
>> discuss if this could/should/will not be implemented?!
>> i think raid5/6 have checksums and others tools to get this type of
>> problem while you can use your normal filesystem (ext3? ext4? reiser?
>> xfs?) or direct the block device (a oracle database for example or
>> mysql innodb)
>
> RAID5/6 would need more drives than I actually have, right? There's
> not enough space for 3 drives in those small and cheap mini-ITX based home
> routers/servers I started building 3 years ago. Moreover that would mean
> a need for better cooling and a higher power consumption and that's
> something I'm exactly trying to avoid in this particular case.
>
> I slowly started to accept the idea, that I'll have to migrate my
> systems from mdraid to btrfs if there's no solution soon :( I don't
> like it much, but there's apparently nothing else I can do about that.
>
>> --
>> Roberto Spadim
>> Spadim Technology / SPAEmpresarial
>>
>
> Thanks a lot for your answers and have a nice day.
>
> Regards,
> Jaromir.
>
> --
> Jaromir Capik
> Red Hat Czech, s.r.o.
> Software Engineer / BaseOS
>
> Email: jcapik@redhat.com
> Web: www.cz.redhat.com
> Red Hat Czech s.r.o., Purkynova 99/71, 612 45, Brno, Czech Republic
> IC: 27690016
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-20 20:21               ` Roberto Spadim
@ 2012-07-20 20:44                 ` Jaromir Capik
  2012-07-20 20:59                   ` Roberto Spadim
  0 siblings, 1 reply; 36+ messages in thread
From: Jaromir Capik @ 2012-07-20 20:44 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: stan, NeilBrown, linux-raid

> yeah for a 'fast' solution moving from one file system to another
> that
> works with theses checks can help you, while we check if this is
> usefull or not
> 
> IMHO, if we implement this, we should implement outside any today
> raid
> levels, this should be done between device and filesystem, in others
> words:
> 
> we should implement this to work like:
> DISKS - (OUR NEW SILENT ERROR SECURITY SYSTEM LEVEL, 1 PER DEVICE) -
> TODAY RAID LEVELS - FILESYSTEMS
> 
> or
> 
> DISKS - RAIDS LEVELS - (OUR NEW SILENT ERROR SECURITY SYSTEM LEVEL, 1
> PER DEVICE) - FILESYSTEM
> 
> or
> 
> DISK - (OUR NEW SILENT ERROR SECURITY SYSTEM LEVEL, 1 PER DEVICE) -
> FILESYSTEM
> 
> 
> using this, we "could give more security" to usb pendrives for
> example, and any block device (network block device, DRBD, or
> anyother
> block device in linux)

Well ... it looks more modular, easier and could have more usecases.
You're probably right at this point. Dracut maintainers would kill
us both, but that's a different story.
I'm only missing that possibility of immediate resyncing of the data
when a corruption is detected. That's probably the only thing, that
would be nice to have directly in the RAID layer (and could/should
be also optional). 

J.

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-20 20:44                 ` Jaromir Capik
@ 2012-07-20 20:59                   ` Roberto Spadim
  0 siblings, 0 replies; 36+ messages in thread
From: Roberto Spadim @ 2012-07-20 20:59 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: stan, NeilBrown, linux-raid

sorry about many posts guys (that´s not a spam)

well... we are discussing ideas...

IMO, that's one more layer (ok some developers like it, some
developers don't), this implement a security layer of some well know
harddrives, like ECC, checksums (maybe we will talk about inteligent
SDD reallocation algorithm too...)
since we are near to emulate a harddisk controller (with more tools)
we could report block errors too like a harddisk do
the error correction could be done by some raid levels that implement
block correction (resync or maybe badblock reallocation)...
just ideas... it's hard to implement and must be well tested... like
any new code

2012/7/20 Jaromir Capik <jcapik@redhat.com>:
>> yeah for a 'fast' solution moving from one file system to another
>> that
>> works with theses checks can help you, while we check if this is
>> usefull or not
>>
>> IMHO, if we implement this, we should implement outside any today
>> raid
>> levels, this should be done between device and filesystem, in others
>> words:
>>
>> we should implement this to work like:
>> DISKS - (OUR NEW SILENT ERROR SECURITY SYSTEM LEVEL, 1 PER DEVICE) -
>> TODAY RAID LEVELS - FILESYSTEMS
>>
>> or
>>
>> DISKS - RAIDS LEVELS - (OUR NEW SILENT ERROR SECURITY SYSTEM LEVEL, 1
>> PER DEVICE) - FILESYSTEM
>>
>> or
>>
>> DISK - (OUR NEW SILENT ERROR SECURITY SYSTEM LEVEL, 1 PER DEVICE) -
>> FILESYSTEM
>>
>>
>> using this, we "could give more security" to usb pendrives for
>> example, and any block device (network block device, DRBD, or
>> anyother
>> block device in linux)
>
> Well ... it looks more modular, easier and could have more usecases.
> You're probably right at this point. Dracut maintainers would kill
> us both, but that's a different story.
> I'm only missing that possibility of immediate resyncing of the data
> when a corruption is detected. That's probably the only thing, that
> would be nice to have directly in the RAID layer (and could/should
> be also optional).
>
> J.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-20 12:53         ` Jaromir Capik
  2012-07-20 18:24           ` Roberto Spadim
@ 2012-07-21  3:58           ` Stan Hoeppner
  1 sibling, 0 replies; 36+ messages in thread
From: Stan Hoeppner @ 2012-07-21  3:58 UTC (permalink / raw)
  To: Linux RAID

On 7/20/2012 7:53 AM, Jaromir Capik wrote:

> I admit, that the problem could lie elsewhere ... but that doesn't 
> change anything on the fact, that the data became corrupted without
> me noticing that.

The key here I think is "without me noticing that".  Drives normally cry
out in the night, spitting errors to logs, when they encounter problems.
 You may not receive an immediate error in your application, especially
when the drive is a RAID member and the data can be shipped regardless
of the drive error.  If you never check your logs, or simply don't see
these disk errors, how will you know there's a problem?

Likewise, if the checksumming you request is implemented in md/RAID1,
and your application never sees a problem when a drive heads South, and
you never check your logs and thus don't see the checksum errors...

How is this new checksumming any better than the current situation?  The
drive is still failing and you're still unaware of it.

-- 
Stan

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 11:01 ` [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust Jaromir Capik
  2012-07-18 11:13   ` Mathias Burén
  2012-07-18 11:15   ` NeilBrown
@ 2012-07-18 11:49   ` keld
  2012-07-18 13:08     ` Jaromir Capik
  2012-07-18 16:28   ` Asdo
  3 siblings, 1 reply; 36+ messages in thread
From: keld @ 2012-07-18 11:49 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: linux-raid

On Wed, Jul 18, 2012 at 07:01:48AM -0400, Jaromir Capik wrote:
> Hello.
> 
> I'd like to ask you to implement the following ...
> 
> The current RAID1 solution is not robust enough to protect the data
> against random data corruptions. Such corruptions usually happen
> when an unreadable sector is found by the drive's electronics
> and when the drive's trying to reallocate the sector to the spare area.
> There's no guarantee that the reallocated data will always match
> the original stored data since the drive sometimes can't read the data
> correctly even with several retries. That unfortunately completely masks
> the issue, because the sector can be read by the OS without problems
> even if it doesn't contain correct data. Would it be possible
> to implement chunk checksums to avoid such data corruptions?
> If a corrupted chunk is encountered, it would be taken from the second
> drive and immediately synced back. This would have a small performance
> and capacity impact (1 sector per chunk to minimize performance impact
> caused by unaligned granularity = 0.78% of the capacity with 64k chunks).
> 
> Please, let me know if you find my request reasonable or not.

I believe alternative to that is implemented via the Linux RAID MD badblock feature.

best regards
keld

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 11:49   ` keld
@ 2012-07-18 13:08     ` Jaromir Capik
  2012-07-18 16:08       ` Roberto Spadim
  2012-07-18 21:02       ` keld
  0 siblings, 2 replies; 36+ messages in thread
From: Jaromir Capik @ 2012-07-18 13:08 UTC (permalink / raw)
  To: keld; +Cc: linux-raid

> > Hello.
> > 
> > I'd like to ask you to implement the following ...
> > 
> > The current RAID1 solution is not robust enough to protect the data
> > against random data corruptions. Such corruptions usually happen
> > when an unreadable sector is found by the drive's electronics
> > and when the drive's trying to reallocate the sector to the spare
> > area.
> > There's no guarantee that the reallocated data will always match
> > the original stored data since the drive sometimes can't read the
> > data
> > correctly even with several retries. That unfortunately completely
> > masks
> > the issue, because the sector can be read by the OS without
> > problems
> > even if it doesn't contain correct data. Would it be possible
> > to implement chunk checksums to avoid such data corruptions?
> > If a corrupted chunk is encountered, it would be taken from the
> > second
> > drive and immediately synced back. This would have a small
> > performance
> > and capacity impact (1 sector per chunk to minimize performance
> > impact
> > caused by unaligned granularity = 0.78% of the capacity with 64k
> > chunks).
> > 
> > Please, let me know if you find my request reasonable or not.
> 
> I believe alternative to that is implemented via the Linux RAID MD
> badblock feature.

Hello keld ...

I couldn't find any info about that feature.  Could you please give
me more info about that?

Thanks in advance.

Regards,
Jaromir.

> 
> best regards
> keld
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 13:08     ` Jaromir Capik
@ 2012-07-18 16:08       ` Roberto Spadim
  2012-07-20 10:35         ` Jaromir Capik
  2012-07-18 21:02       ` keld
  1 sibling, 1 reply; 36+ messages in thread
From: Roberto Spadim @ 2012-07-18 16:08 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: keld, linux-raid

yeah, i think this data corruption could/should be implemented as badblocks...
do you have a disk that read blocks with wrong data like you told?
if yes, could you check if it have bad blocks? (via some software,
since i don´t know if linux kernel will report it as badblock on dmesg
or something else)
some disk manufacturers have MSDOS compatible program to check disk
badblocks and others features... check if it´s a real bad block, or a
disk problem (controller/data comunication)

2012/7/18 Jaromir Capik <jcapik@redhat.com>
>
> > > Hello.
> > >
> > > I'd like to ask you to implement the following ...
> > >
> > > The current RAID1 solution is not robust enough to protect the data
> > > against random data corruptions. Such corruptions usually happen
> > > when an unreadable sector is found by the drive's electronics
> > > and when the drive's trying to reallocate the sector to the spare
> > > area.
> > > There's no guarantee that the reallocated data will always match
> > > the original stored data since the drive sometimes can't read the
> > > data
> > > correctly even with several retries. That unfortunately completely
> > > masks
> > > the issue, because the sector can be read by the OS without
> > > problems
> > > even if it doesn't contain correct data. Would it be possible
> > > to implement chunk checksums to avoid such data corruptions?
> > > If a corrupted chunk is encountered, it would be taken from the
> > > second
> > > drive and immediately synced back. This would have a small
> > > performance
> > > and capacity impact (1 sector per chunk to minimize performance
> > > impact
> > > caused by unaligned granularity = 0.78% of the capacity with 64k
> > > chunks).
> > >
> > > Please, let me know if you find my request reasonable or not.
> >
> > I believe alternative to that is implemented via the Linux RAID MD
> > badblock feature.
>
> Hello keld ...
>
> I couldn't find any info about that feature.  Could you please give
> me more info about that?
>
> Thanks in advance.
>
> Regards,
> Jaromir.
>
> >
> > best regards
> > keld
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 16:08       ` Roberto Spadim
@ 2012-07-20 10:35         ` Jaromir Capik
  0 siblings, 0 replies; 36+ messages in thread
From: Jaromir Capik @ 2012-07-20 10:35 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: keld, linux-raid

> yeah, i think this data corruption could/should be implemented as
> badblocks...
> do you have a disk that read blocks with wrong data like you told?

All of them were replaced during the warranty period ... but it seems
I have a new candidate. I'll use it for my tests. I'll write specific
data there and then I'll be reading them with sufficient idle time
intervals until I get either a read error or corrupted data without
read errors.

> if yes, could you check if it have bad blocks? (via some software,
> since i don´t know if linux kernel will report it as badblock on
> dmesg or something else)

I always check S.M.A.R.T. atributes and all of the drives reported
reallocated and pending sectors, while there were no uncorrectable
sectors reported in some cases. I remember, that one of the drives
stopped booting because of MBR corruption, but the sector was readable
with dd without problems. I could also clean it and created new
partition table with fdisk (but the SMART atributes didn't change
with the new write operation. That really looks like there was 
a reallocation done prior to my checks even if reallocations should
be done only during the write operation and I'm sure there was
absolutely no need for writing to MBR. I suspect some of the drive
firmwares, that they do the reallocation transparently during
the idle state. Especially seagate drives with capacities around
200GB can be heard, that they're doing their own surface checks
when they're idle. Maybe that's intention of the manufacturers.
I could imagine they don't want people to claim for the drive
replacement and thus they're trying to cover the issues up.
I also believe, that the SMART attributes might be intentionally
misreported by the firmware. The drive's electronics might be 
transparently doing a lot of internal stuff dependent on the current
drive's internal design, that can't be easily mapped to any of the
SMART attributes and thus not reported at all. You know, nobody
can make the manufacturers to follow the rules ... moreover, there
might be a design/firmware bug or something else preventing the drive
working correctly in some cases. I can imagine many different
scenarios since I was a hardware designer for almost 10 years
and writing a firmware for conceptually wrong hardware design
might be the worst nightmare you could ever imagine. And low-price
device designs are often cheated and full of workarounds.

Anyway ... I believe, that relying on (by nature) unreliable hardware
might be considered a conceptual issue of the current MD-RAID layer.

> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
> 

Regards,
Jaromir.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 13:08     ` Jaromir Capik
  2012-07-18 16:08       ` Roberto Spadim
@ 2012-07-18 21:02       ` keld
  1 sibling, 0 replies; 36+ messages in thread
From: keld @ 2012-07-18 21:02 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: linux-raid

On Wed, Jul 18, 2012 at 09:08:51AM -0400, Jaromir Capik wrote:
> > > Hello.
> > > 
> > > I'd like to ask you to implement the following ...
> > > 
> > > The current RAID1 solution is not robust enough to protect the data
> > > against random data corruptions. Such corruptions usually happen
> > > when an unreadable sector is found by the drive's electronics
> > > and when the drive's trying to reallocate the sector to the spare
> > > area.
> > > There's no guarantee that the reallocated data will always match
> > > the original stored data since the drive sometimes can't read the
> > > data
> > > correctly even with several retries. That unfortunately completely
> > > masks
> > > the issue, because the sector can be read by the OS without
> > > problems
> > > even if it doesn't contain correct data. Would it be possible
> > > to implement chunk checksums to avoid such data corruptions?
> > > If a corrupted chunk is encountered, it would be taken from the
> > > second
> > > drive and immediately synced back. This would have a small
> > > performance
> > > and capacity impact (1 sector per chunk to minimize performance
> > > impact
> > > caused by unaligned granularity = 0.78% of the capacity with 64k
> > > chunks).
> > > 
> > > Please, let me know if you find my request reasonable or not.
> > 
> > I believe alternative to that is implemented via the Linux RAID MD
> > badblock feature.
> 
> Hello keld ...
> 
> I couldn't find any info about that feature.  Could you please give
> me more info about that?

I do believe it is already implemented. I do not have the documentation.
But have a look in newer mdadm documentation or in the kernel sources,
or in he archives for this email list.

If you do find useful info, then please tell it here, and we can put the info on our wiki.
The wiki is now open again for modifications.
If you find out how to use it we could also add info on that to the wiki.

best regards
keld

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 11:01 ` [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust Jaromir Capik
                     ` (2 preceding siblings ...)
  2012-07-18 11:49   ` keld
@ 2012-07-18 16:28   ` Asdo
  2012-07-20 11:07     ` Jaromir Capik
  3 siblings, 1 reply; 36+ messages in thread
From: Asdo @ 2012-07-18 16:28 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: linux-raid

On 07/18/12 13:01, Jaromir Capik wrote:
> Hello.
>
> I'd like to ask you to implement the following ...
>
> The current RAID1 solution is not robust enough to protect the data
> against random data corruptions. Such corruptions usually happen
> when an unreadable sector is found by the drive's electronics
> and when the drive's trying to reallocate the sector to the spare area.
> There's no guarantee that the reallocated data will always match
> the original stored data since the drive sometimes can't read the data
> correctly even with several retries. That unfortunately completely masks
> the issue, because the sector can be read by the OS without problems
> even if it doesn't contain correct data. Would it be possible
> to implement chunk checksums to avoid such data corruptions?
> If a corrupted chunk is encountered, it would be taken from the second
> drive and immediately synced back. This would have a small performance
> and capacity impact (1 sector per chunk to minimize performance impact
> caused by unaligned granularity = 0.78% of the capacity with 64k chunks).
>
> Please, let me know if you find my request reasonable or not.
>
> Thanks in advance.
>
> Regards,
> Jaromir.
>

This is a very invasive change that you ask, conceptually, 
man-hours-wise, performance-wise, ondisk-format wise, space-wise and 
also it really should stay at another layer, preferably below the RAID 
(btrfs and zfs do this above though). This should probably be a DM/LVM 
project.

Drives do this already, they have checksums (google for reed-solomon). 
If the checksums are not long enough you should use different drives. 
But in my life I never saw a "silent data corruption" like the one you say.

Also, statistically speaking, if one disk checksum returns false 
positive the drive is very likely dying, because it takes very many bit 
flips to bypass the reed-solomon check, so other sectors on the same 
drive have almost certainly given read error and you should have 
replaced the drive long ago already.

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-18 16:28   ` Asdo
@ 2012-07-20 11:07     ` Jaromir Capik
  2012-07-20 11:14       ` Oliver Schinagl
  2012-07-20 11:28       ` Jaromir Capik
  0 siblings, 2 replies; 36+ messages in thread
From: Jaromir Capik @ 2012-07-20 11:07 UTC (permalink / raw)
  To: Asdo; +Cc: linux-raid

> This is a very invasive change that you ask, conceptually,
> man-hours-wise, performance-wise, ondisk-format wise, space-wise

Yes ... I'm aware of possibly high number of man-hours.
If we talk about space ... 0.78% is not so invasive, is it?
On-disk format ... interleaving chunks with checksum sectors doesn't
seem to me a complicated math ... 

  chunk_starting_sector = chunk_number * (chunk_size_in_sectors + 1)

... of course this is relative to the chunk area offset.

> also it really should stay at another layer, preferably below the
> RAID

but how would you like to implement that if the lower level is known
to be unreliable enough?

> (btrfs and zfs do this above though). 

Btrfs and zfs has it's own RAID layer, so there's no need for
underlying MD-RAID. But I haven't studied how exactly it's done
there.

> This should probably be a
> DM/LVM
> project.

LVM ? How do you want to implement that in LVM? You would create
two big PVs with two big logical partitions protected by checksums?
The mdraid layer would be built on top of these, right? 
That could possibly work too if LVM returns read errors for blocks
with incorrect checksums. I'm not fully against that idea.

> 
> Drives do this already, they have checksums (google for
> reed-solomon).
> If the checksums are not long enough you should use different drives.
> But in my life I never saw a "silent data corruption" like the one
> you say.

I believe I've mentioned my experience with such nasty HDD behaviour
in my previous email. I also don't like that, but it apparently
happens and we can't rely on the proper hardware functioning
especially when it's unreliable by nature.

Regards,
Jaromir.

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-20 11:07     ` Jaromir Capik
@ 2012-07-20 11:14       ` Oliver Schinagl
  2012-07-20 11:28       ` Jaromir Capik
  1 sibling, 0 replies; 36+ messages in thread
From: Oliver Schinagl @ 2012-07-20 11:14 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: Asdo, linux-raid

On 20-07-12 13:07, Jaromir Capik wrote:
>> This is a very invasive change that you ask, conceptually,
>> man-hours-wise, performance-wise, ondisk-format wise, space-wise
> Yes ... I'm aware of possibly high number of man-hours.
> If we talk about space ... 0.78% is not so invasive, is it?
> On-disk format ... interleaving chunks with checksum sectors doesn't
> seem to me a complicated math ...
>
>    chunk_starting_sector = chunk_number * (chunk_size_in_sectors + 1)
>
> ... of course this is relative to the chunk area offset.
>
>> also it really should stay at another layer, preferably below the
>> RAID
> but how would you like to implement that if the lower level is known
> to be unreliable enough?
>
>> (btrfs and zfs do this above though).
> Btrfs and zfs has it's own RAID layer, so there's no need for
> underlying MD-RAID. But I haven't studied how exactly it's done
> there.
>
>> This should probably be a
>> DM/LVM
>> project.
> LVM ? How do you want to implement that in LVM? You would create
> two big PVs with two big logical partitions protected by checksums?
> The mdraid layer would be built on top of these, right?
> That could possibly work too if LVM returns read errors for blocks
> with incorrect checksums. I'm not fully against that idea.
>
>> Drives do this already, they have checksums (google for
>> reed-solomon).
>> If the checksums are not long enough you should use different drives.
>> But in my life I never saw a "silent data corruption" like the one
>> you say.
> I believe I've mentioned my experience with such nasty HDD behaviour
> in my previous email. I also don't like that, but it apparently
> happens and we can't rely on the proper hardware functioning
> especially when it's unreliable by nature.
Actually, I've had quite some dataloss due to a 
hardrive/controller/cabling not working properly (no clue what caused 
it) but raid5 never complained. To this date, I do not know what 
happened and why my data was corrupt.
> Regards,
> Jaromir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-20 11:07     ` Jaromir Capik
  2012-07-20 11:14       ` Oliver Schinagl
@ 2012-07-20 11:28       ` Jaromir Capik
  1 sibling, 0 replies; 36+ messages in thread
From: Jaromir Capik @ 2012-07-20 11:28 UTC (permalink / raw)
  To: Asdo; +Cc: linux-raid

> Btrfs and zfs has it's own RAID layer, so there's no need for
> underlying MD-RAID. But I haven't studied how exactly it's done
> there.

I just read some ZFS docs and look at the following paragraph ...

---
Mirrored Vdev’s (RAID1)
This is akin to RAID1. If you mirror a pair of Vdev’s (each Vdev
is usually a single hard drive) it is just like RAID1, except
you get the added bonus of automatic checksumming. This prevents
silent data corruption that is usually undetectable by most
hardware RAID cards.
---

As you see that's not just my own blurred dream ... It seems
I'm not the only one who encountered silent data corruptions.
And it doesn't matter what the root cause of such corruptions is.
They simply appear from time to time and checksums seem to
prevent from them being silently ignored.

> 
> > This should probably be a
> > DM/LVM
> > project.
> 
> LVM ? How do you want to implement that in LVM? You would create
> two big PVs with two big logical partitions protected by checksums?
> The mdraid layer would be built on top of these, right?
> That could possibly work too if LVM returns read errors for blocks
> with incorrect checksums. I'm not fully against that idea.

I just got on my mind, that this wouldn't allow us to resync
the correct data immediately back to the drive where the corruption
appeared. So ... I still believe, that the RAID layer is the best
place for this feature.

Regards,
Jaromir.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
[parent not found: <1082734092.338339.1342995087426.JavaMail.root@redhat.com>]
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
       [not found] <1082734092.338339.1342995087426.JavaMail.root@redhat.com>
@ 2012-07-23  4:29 ` Stan Hoeppner
  2012-07-23  9:34   ` Jaromir Capik
  0 siblings, 1 reply; 36+ messages in thread
From: Stan Hoeppner @ 2012-07-23  4:29 UTC (permalink / raw)
  To: Jaromir Capik, Linux RAID

Please keep discussion on list.  This is probably an MUA issue.  Happens
to me on occasion when I hit "reply to list" instead of "reply to all".
 vger doesn't provide a List-Post: header so "reply to list" doesn't
work and you end up replying to the sender.

On 7/22/2012 5:11 PM, Jaromir Capik wrote:
>>> I admit, that the problem could lie elsewhere ... but that doesn't
>>> change anything on the fact, that the data became corrupted without
>>> me noticing that.
>>
>> The key here I think is "without me noticing that".  Drives normally
>> cry
>> out in the night, spitting errors to logs, when they encounter
>> problems.
>>  You may not receive an immediate error in your application,
>>  especially
>> when the drive is a RAID member and the data can be shipped
>> regardless
>> of the drive error.  If you never check your logs, or simply don't
>> see
>> these disk errors, how will you know there's a problem?
> 
> Hello Stan.
> 
> I used to periodically check logs as well as S.M.A.R.T. attributes.
> And I believe I've already mentioned two of the cases and how
> I finally discovered the issues. Moreover I switched from manual
> checking to receiving emails from monitoring daemons. And even
> if you receive such email, it usually takes some time to replace
> the failing drive. That time window might be fatal for your data
> if junk is read from one of the drives and when it's followed
> by a write. Such write would destroy the second correct copy ...
> 
>>
>> Likewise, if the checksumming you request is implemented in md/RAID1,
>> and your application never sees a problem when a drive heads South,
>> and
>> you never check your logs and thus don't see the checksum errors...
> 
> You wouldn't have to ... because the corrupted chunks would be 
> immediately resynced with good data and you'll REALLY get some errors
> in the logs if the harddrive or controller or it's driver doesn't
> produce them for whatever reason.
> 
>>
>> How is this new checksumming any better than the current situation?
>>  The
>> drive is still failing and you're still unaware of it.
> 
> Do you believe, that other reasons of silent data corruptions simply
> do not exist? Try to imagine a case, when the correct data aren't
> written at all to one of the drives due to a bug in the drive's firmware
> or due to a bug in the controller design or due to a bug in the
> controller driver or due to other reasons. Such bug could be tiggered
> by anything ... it could be a delay in the read operation when the
> sector is not well readable or any race condition, etc. Especially
> new devices and their very first versions are expected to be buggy.
> Checksuming would prevent them all and would make the whole
> I/O really bulletproof. 

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-23  4:29 ` Stan Hoeppner
@ 2012-07-23  9:34   ` Jaromir Capik
  2012-07-23 10:53     ` Stan Hoeppner
  2012-07-23 17:03     ` Piergiorgio Sartor
  0 siblings, 2 replies; 36+ messages in thread
From: Jaromir Capik @ 2012-07-23  9:34 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID

Hello Stan.

I received your reply without having the Linux RAID list in Cc
and thus I was unsure if you wanna discuss that privately or not.
I always choose reply to all unless I really want to remove
some of the recipients :]

Cheers,
Jaromir.

> 
> Please keep discussion on list.  This is probably an MUA issue.
>  Happens
> to me on occasion when I hit "reply to list" instead of "reply to
> all".
>  vger doesn't provide a List-Post: header so "reply to list" doesn't
> work and you end up replying to the sender.
> 
> On 7/22/2012 5:11 PM, Jaromir Capik wrote:
> >>> I admit, that the problem could lie elsewhere ... but that
> >>> doesn't
> >>> change anything on the fact, that the data became corrupted
> >>> without
> >>> me noticing that.
> >>
> >> The key here I think is "without me noticing that".  Drives
> >> normally
> >> cry
> >> out in the night, spitting errors to logs, when they encounter
> >> problems.
> >>  You may not receive an immediate error in your application,
> >>  especially
> >> when the drive is a RAID member and the data can be shipped
> >> regardless
> >> of the drive error.  If you never check your logs, or simply don't
> >> see
> >> these disk errors, how will you know there's a problem?
> > 
> > Hello Stan.
> > 
> > I used to periodically check logs as well as S.M.A.R.T. attributes.
> > And I believe I've already mentioned two of the cases and how
> > I finally discovered the issues. Moreover I switched from manual
> > checking to receiving emails from monitoring daemons. And even
> > if you receive such email, it usually takes some time to replace
> > the failing drive. That time window might be fatal for your data
> > if junk is read from one of the drives and when it's followed
> > by a write. Such write would destroy the second correct copy ...
> > 
> >>
> >> Likewise, if the checksumming you request is implemented in
> >> md/RAID1,
> >> and your application never sees a problem when a drive heads
> >> South,
> >> and
> >> you never check your logs and thus don't see the checksum
> >> errors...
> > 
> > You wouldn't have to ... because the corrupted chunks would be
> > immediately resynced with good data and you'll REALLY get some
> > errors
> > in the logs if the harddrive or controller or it's driver doesn't
> > produce them for whatever reason.
> > 
> >>
> >> How is this new checksumming any better than the current
> >> situation?
> >>  The
> >> drive is still failing and you're still unaware of it.
> > 
> > Do you believe, that other reasons of silent data corruptions
> > simply
> > do not exist? Try to imagine a case, when the correct data aren't
> > written at all to one of the drives due to a bug in the drive's
> > firmware
> > or due to a bug in the controller design or due to a bug in the
> > controller driver or due to other reasons. Such bug could be
> > tiggered
> > by anything ... it could be a delay in the read operation when the
> > sector is not well readable or any race condition, etc. Especially
> > new devices and their very first versions are expected to be buggy.
> > Checksuming would prevent them all and would make the whole
> > I/O really bulletproof.
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-23  9:34   ` Jaromir Capik
@ 2012-07-23 10:53     ` Stan Hoeppner
  2012-07-23 17:03     ` Piergiorgio Sartor
  1 sibling, 0 replies; 36+ messages in thread
From: Stan Hoeppner @ 2012-07-23 10:53 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: Linux RAID

On 7/23/2012 4:34 AM, Jaromir Capik wrote:
> Hello Stan.
> 
> I received your reply without having the Linux RAID list in Cc
> and thus I was unsure if you wanna discuss that privately or not.
> I always choose reply to all unless I really want to remove
> some of the recipients :]

When you saw the same message also arrive via the list, that wasn't a
clue. ;)

-- 
Stan

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-23  9:34   ` Jaromir Capik
  2012-07-23 10:53     ` Stan Hoeppner
@ 2012-07-23 17:03     ` Piergiorgio Sartor
  2012-07-23 18:24       ` Roberto Spadim
  1 sibling, 1 reply; 36+ messages in thread
From: Piergiorgio Sartor @ 2012-07-23 17:03 UTC (permalink / raw)
  To: Jaromir Capik; +Cc: stan, Linux RAID

Hi all,

actually, what you would like to do is already
possible, albeit it will kill the performance
of a rotating, mechanical, HDD.
With SSD might work better.

If you take an HDD and partition it, let's say
with 100 partitions (GPT will be required),
then you can build a RAID-6 using this 100
partitions, having a redundancy of 2%.
Taking two, or more, of such configured RAID-6,
it will be possible to build (with them) a
RAID-1 (or else).

If a check of this RAID-1 returns mismatches,
it will be possible to check the single devices
and find out which is not OK.
With RAID-6 (per device), and a bit of luck, it
will be possible to fix it directly.

Of course a lot of variables are tunable here.
For example the number of partitions, the chunk
size, or even the fact that with X partitions
it could be possible to build more than one RAID-6,
increasing the effective redundancy.

All with the performance price I mentioned at the
beginning.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-23 17:03     ` Piergiorgio Sartor
@ 2012-07-23 18:24       ` Roberto Spadim
  2012-07-23 21:31         ` Drew
  2012-07-24 15:09         ` Jaromir Capik
  0 siblings, 2 replies; 36+ messages in thread
From: Roberto Spadim @ 2012-07-23 18:24 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Jaromir Capik, stan, Linux RAID

yeah, i think this too, but IMO Jamiro exposed a specific scenario,
let´s get back to it and after check a generic scenario,
he is using small computers (i don´t know if it´s ARM or X86) with
space to only 2 disks (i told him to use raid5 or raid6 because the
checksums but he don´t have space for >=3 disks in computer case,
maybe if we could run raid5 with 2 disks could help... or 1 disk...
just a idiot idea, but this could help...)
i don´t know the real scenario, i think he will not use it in
100partitions, maybe 4 or 5 partitions, and performance to be a second
option, security is priority here
in the implementation of this new layer (maybe like LINEAR, MULTIPATH
or another not raid level) we could focus on security and after
performace

just some ideas...

2012/7/23 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
>
> Hi all,
>
> actually, what you would like to do is already
> possible, albeit it will kill the performance
> of a rotating, mechanical, HDD.
> With SSD might work better.
>
> If you take an HDD and partition it, let's say
> with 100 partitions (GPT will be required),
> then you can build a RAID-6 using this 100
> partitions, having a redundancy of 2%.
> Taking two, or more, of such configured RAID-6,
> it will be possible to build (with them) a
> RAID-1 (or else).
>
> If a check of this RAID-1 returns mismatches,
> it will be possible to check the single devices
> and find out which is not OK.
> With RAID-6 (per device), and a bit of luck, it
> will be possible to fix it directly.
>
> Of course a lot of variables are tunable here.
> For example the number of partitions, the chunk
> size, or even the fact that with X partitions
> it could be possible to build more than one RAID-6,
> increasing the effective redundancy.
>
> All with the performance price I mentioned at the
> beginning.
>
> bye,
>
> --
>
> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-23 18:24       ` Roberto Spadim
@ 2012-07-23 21:31         ` Drew
  2012-07-23 21:42           ` Roberto Spadim
                             ` (2 more replies)
  2012-07-24 15:09         ` Jaromir Capik
  1 sibling, 3 replies; 36+ messages in thread
From: Drew @ 2012-07-23 21:31 UTC (permalink / raw)
  To: Linux RAID

Been mulling this problem over and I keep getting hung up on one
problem with ECC on a two disk RAID1 setup.

In the event of silent corruption of one disk, which one is the good copy?

It works fine if the ECC code is identical across both mirrors. Just
checksum both chunks and discard the incorrect one.

It also works fine if the ECC codes are corrupted but the data chunks
are identical. Discard the bad checksum.

What if the corruption goes across several sectors and both data & ECC
chuncks are corrupted? Now you're back to square one.

-- 
Drew

"Nothing in life is to be feared. It is only to be understood."
--Marie Curie

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-23 21:31         ` Drew
@ 2012-07-23 21:42           ` Roberto Spadim
  2012-07-24  4:42           ` Stan Hoeppner
  2012-07-27  6:06           ` Adam Goryachev
  2 siblings, 0 replies; 36+ messages in thread
From: Roberto Spadim @ 2012-07-23 21:42 UTC (permalink / raw)
  To: Drew; +Cc: Linux RAID

that´s the point...
in few words....

2012/7/23 Drew <drew.kay@gmail.com>:
> Been mulling this problem over and I keep getting hung up on one
> problem with ECC on a two disk RAID1 setup.
>
> In the event of silent corruption of one disk, which one is the good copy?
>
> It works fine if the ECC code is identical across both mirrors. Just
> checksum both chunks and discard the incorrect one.
nice we can recover data =)

>
> It also works fine if the ECC codes are corrupted but the data chunks
> are identical. Discard the bad checksum.
nice we can recover data too =)

>
> What if the corruption goes across several sectors and both data & ECC
> chuncks are corrupted? Now you're back to square one.
report badblock to upper layer (file system or a mdraid or lvm or
anyother process)
the same should occur on harddisk with known corrupted data
but in this case we know that´s wrong and we can report it! =) that´s
the nice part!
very different from "silent data corruption" where no alert or warning
or error is reported, that´s the today bad part...
ok... you will tell about NEC report? they have this 'software
security' in firmware (i think we could make something similar)

> --
> Drew
>
> "Nothing in life is to be feared. It is only to be understood."
> --Marie Curie
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-23 21:31         ` Drew
  2012-07-23 21:42           ` Roberto Spadim
@ 2012-07-24  4:42           ` Stan Hoeppner
  2012-07-24 12:51             ` Roberto Spadim
  2012-07-27  6:06           ` Adam Goryachev
  2 siblings, 1 reply; 36+ messages in thread
From: Stan Hoeppner @ 2012-07-24  4:42 UTC (permalink / raw)
  To: Drew; +Cc: Linux RAID

On 7/23/2012 4:31 PM, Drew wrote:

> What if the corruption goes across several sectors and both data & ECC
> chuncks are corrupted?

What if the 'silent' corruption spans 10 million sectors (~5GB)?

-- 
Stan

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-24  4:42           ` Stan Hoeppner
@ 2012-07-24 12:51             ` Roberto Spadim
  0 siblings, 0 replies; 36+ messages in thread
From: Roberto Spadim @ 2012-07-24 12:51 UTC (permalink / raw)
  To: stan; +Cc: Drew, Linux RAID

10 milion bad blocks (5gb of lost information)
check that we are talking about one device (it can be a disk, a
partition, a raid1 a raid0, nbd, drbd, or anything else)

2012/7/24 Stan Hoeppner <stan@hardwarefreak.com>:
> On 7/23/2012 4:31 PM, Drew wrote:
>
>> What if the corruption goes across several sectors and both data & ECC
>> chuncks are corrupted?
>
> What if the 'silent' corruption spans 10 million sectors (~5GB)?
>
> --
> Stan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-23 21:31         ` Drew
  2012-07-23 21:42           ` Roberto Spadim
  2012-07-24  4:42           ` Stan Hoeppner
@ 2012-07-27  6:06           ` Adam Goryachev
  2012-07-27 13:42             ` Roberto Spadim
  2 siblings, 1 reply; 36+ messages in thread
From: Adam Goryachev @ 2012-07-27  6:06 UTC (permalink / raw)
  To: Linux RAID

On 24/07/12 07:31, Drew wrote:
> Been mulling this problem over and I keep getting hung up on one
> problem with ECC on a two disk RAID1 setup.
>
> In the event of silent corruption of one disk, which one is the good
> copy?
>
> It works fine if the ECC code is identical across both mirrors. Just
> checksum both chunks and discard the incorrect one.
>
> It also works fine if the ECC codes are corrupted but the data
> chunks are identical. Discard the bad checksum.
>
> What if the corruption goes across several sectors and both data &
> ECC chuncks are corrupted? Now you're back to square one.

I know I'm a bit late to this discussion, and I know very little about
the code level/etc... however, I thought the whole point of the checksum
is to determine that the data + checksum do not match, therefore the
data is wrong and should be discarded. You would re-write the data and
checksum from another source (ie, the other drive in RAID1, or other
drives in RAID5/6 etc...).

ie, it should be treated the same as a bad block / non-readable sector
(or lots of unreadable sectors....)

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-27  6:06           ` Adam Goryachev
@ 2012-07-27 13:42             ` Roberto Spadim
  0 siblings, 0 replies; 36+ messages in thread
From: Roberto Spadim @ 2012-07-27 13:42 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Linux RAID

IMO
the first idea was put this only in md_raid1,
the second idea was a new md device (maybe a md_security or
md_redundancy or md_conformity or any another beautiful name...) in
this case the device will do a checksum and report 'badblock' (maybe
the right word could be badchecksum), that the option that i agree,
since we could do it to any device, doesn´t matter if it´s a raid1 or
raid4 or raidXYZ

just to explain words:
badchecksum -> we can read data but we know that it doesn´t match
checksum, or checksum doesn´t match data
badblock -> we can´t read, because 'physical block' reported as bad

for mirror layers we could do more than just know if we have a
badchecksum (this is not good, check...)
in the case of all mirrors reporting badchecksum, we could read data
(doesn´t matter the badchecksum information) and vote to the data that
have more repeated values and resync data from this new 'primary
information', for example:

/dev/md0 -> disks: /dev/sda /dev/sdb /dev/sdc
original data: block= "ABCDEF", checksum=5

for /dev/sda: block="ABCDEH", checksum=5 (badchecksum)
for /dev/sdb: block="ABCDEG", checksum=5 (badchecksum)
for /dev/sdc: block="ABCDEG", checksum=5 (badchecksum)

in this case, we could elect "ABCEG" (2 repeats) as the 'new data'
recalcule the checksum and sync data to all devices (check that we
coudl have a a 1 repeat for each device and couln´t elect the new
primary information source...)

well this ideal could be good and bad... for application level that´s
bad, since we done a silent data corruption..., but maybe for a
recovery tool this could be good since we corrected the checksum...
maybe this could be a tool of the new device level... (CHECKS and
REPAIRS like mdadm do today with echo "check">
/sys/block/md0/md/sync_action, or echo "repair" >
/sys/block/md0/md/sync_action )

i don´t like the idea of put the 'recovery' inside md_raid1, i prefer
a badblock per device (doesn´t matter if it´s a badblock or
badchecksum..), and don´t do any 'silent recover' of information at
raid level, to do a checksum correction or data correction, maybe
leave this problem to a external tool, like harddisks have badblocks
tools, we could have a badblock tool too

going back to our new device,
check that a data corruption (silent or not) is a data corruption, and
in any case (checksum corruption or data corruption) we have a bad
device, and we should report that we have a badblock in that read
operation
the best we could do when we have a badchecksum is reread many times
and recalculate the checksum, if the good matches are bigger than X%
(maybe 80%) we could send a write to device (to ensure that disk wrote
the good value to disk again) and do a new read if that match (only
with 1 read) that´s nice we done a good 'silent' repair with a 'good'
(80% of probabilty of good) data, this could be an option of the new
device to the new device ("silent recover")

i think that´s all we could do of interesting =)
maybe in some future... we could do a realoc?! like ssd do...
mark the badchecksum block as badblock (inside a badblock list) and
sync the data inside current badblock, to a new never used block (we
could alloc 1% of device to use as never used blocks), this could be
good for data security, but administrator should read logs to ensure
that system don´t run with badblocks....

that´s are the ideas of the 'new' security device level that i could imagine...
thanks guys :)

2012/7/27 Adam Goryachev <mailinglists@websitemanagers.com.au>:
> On 24/07/12 07:31, Drew wrote:
>> Been mulling this problem over and I keep getting hung up on one
>> problem with ECC on a two disk RAID1 setup.
>>
>> In the event of silent corruption of one disk, which one is the good
>> copy?
>>
>> It works fine if the ECC code is identical across both mirrors. Just
>> checksum both chunks and discard the incorrect one.
>>
>> It also works fine if the ECC codes are corrupted but the data
>> chunks are identical. Discard the bad checksum.
>>
>> What if the corruption goes across several sectors and both data &
>> ECC chuncks are corrupted? Now you're back to square one.
>
> I know I'm a bit late to this discussion, and I know very little about
> the code level/etc... however, I thought the whole point of the checksum
> is to determine that the data + checksum do not match, therefore the
> data is wrong and should be discarded. You would re-write the data and
> checksum from another source (ie, the other drive in RAID1, or other
> drives in RAID5/6 etc...).
>
> ie, it should be treated the same as a bad block / non-readable sector
> (or lots of unreadable sectors....)
>
> Regards,
> Adam
>
>
> --
> Adam Goryachev
> Website Managers
> www.websitemanagers.com.au
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
  2012-07-23 18:24       ` Roberto Spadim
  2012-07-23 21:31         ` Drew
@ 2012-07-24 15:09         ` Jaromir Capik
  1 sibling, 0 replies; 36+ messages in thread
From: Jaromir Capik @ 2012-07-24 15:09 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: stan, Linux RAID, Piergiorgio Sartor

> yeah, i think this too, but IMO Jamiro exposed a specific scenario,
> let´s get back to it and after check a generic scenario,
> he is using small computers (i don´t know if it´s ARM or X86) with
> space to only 2 disks (i told him to use raid5 or raid6 because the
> checksums but he don´t have space for >=3 disks in computer case,
> maybe if we could run raid5 with 2 disks could help... or 1 disk...
> just a idiot idea, but this could help...)

I believe, that Piergiorgio meant something else. It was about
creation of a high number of small partitions on two physical
drives and then build a RAID6 array on top of them. But that's
really a bit overkill :]

> i don´t know the real scenario, i think he will not use it in
> 100partitions, maybe 4 or 5 partitions, and performance to be a
> second
> option, security is priority here
> in the implementation of this new layer (maybe like LINEAR, MULTIPATH
> or another not raid level) we could focus on security and after
> performace
> 
> just some ideas...
> 
> 2012/7/23 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
> >
> > Hi all,
> >
> > actually, what you would like to do is already
> > possible, albeit it will kill the performance
> > of a rotating, mechanical, HDD.
> > With SSD might work better.
> >
> > If you take an HDD and partition it, let's say
> > with 100 partitions (GPT will be required),
> > then you can build a RAID-6 using this 100
> > partitions, having a redundancy of 2%.
> > Taking two, or more, of such configured RAID-6,
> > it will be possible to build (with them) a
> > RAID-1 (or else).
> >
> > If a check of this RAID-1 returns mismatches,
> > it will be possible to check the single devices
> > and find out which is not OK.
> > With RAID-6 (per device), and a bit of luck, it
> > will be possible to fix it directly.
> >
> > Of course a lot of variables are tunable here.
> > For example the number of partitions, the chunk
> > size, or even the fact that with X partitions
> > it could be possible to build more than one RAID-6,
> > increasing the effective redundancy.
> >
> > All with the performance price I mentioned at the
> > beginning.
> >
> > bye,
> >
> > --
> >
> > piergiorgio
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread
[parent not found: <1897705147.341625.1342995720661.JavaMail.root@redhat.com>]
* Re: [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust
       [not found] <1897705147.341625.1342995720661.JavaMail.root@redhat.com>
@ 2012-07-23  4:30 ` Stan Hoeppner
  0 siblings, 0 replies; 36+ messages in thread
From: Stan Hoeppner @ 2012-07-23  4:30 UTC (permalink / raw)
  To: Jaromir Capik, Linux RAID

Same issue likely.

On 7/22/2012 5:22 PM, Jaromir Capik wrote:
>>> Likewise, if the checksumming you request is implemented in
>>> md/RAID1,
> 
> Btw. I like what Roberto proposed ... this could be a completely
> independent layer having it's own device file. MD RAID1 would
> be then built on top of such safe device files. The only thing
> to be implemented directly in the RAID1 would be that immediate
> resyncing in case of discovered read error received from such
> safety layer. And this immediate resyncing could/should be
> optional ...
> 
> Jaromir.
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread
end of thread, other threads:[~2012-07-27 13:42 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <17025a94-1999-4619-b23d-7460946c2f85@zmail15.collab.prod.int.phx2.redhat.com>
2012-07-18 11:01 ` [RFE] Please, add optional RAID1 feature (= chunk checksums) to make it more robust Jaromir Capik
2012-07-18 11:13   ` Mathias Burén
2012-07-18 12:42     ` Jaromir Capik
2012-07-18 11:15   ` NeilBrown
2012-07-18 13:04     ` Jaromir Capik
2012-07-19  3:48       ` Stan Hoeppner
2012-07-20 12:53         ` Jaromir Capik
2012-07-20 18:24           ` Roberto Spadim
2012-07-20 18:30             ` Roberto Spadim
2012-07-20 20:07             ` Jaromir Capik
2012-07-20 20:21               ` Roberto Spadim
2012-07-20 20:44                 ` Jaromir Capik
2012-07-20 20:59                   ` Roberto Spadim
2012-07-21  3:58           ` Stan Hoeppner
2012-07-18 11:49   ` keld
2012-07-18 13:08     ` Jaromir Capik
2012-07-18 16:08       ` Roberto Spadim
2012-07-20 10:35         ` Jaromir Capik
2012-07-18 21:02       ` keld
2012-07-18 16:28   ` Asdo
2012-07-20 11:07     ` Jaromir Capik
2012-07-20 11:14       ` Oliver Schinagl
2012-07-20 11:28       ` Jaromir Capik
     [not found] <1082734092.338339.1342995087426.JavaMail.root@redhat.com>
2012-07-23  4:29 ` Stan Hoeppner
2012-07-23  9:34   ` Jaromir Capik
2012-07-23 10:53     ` Stan Hoeppner
2012-07-23 17:03     ` Piergiorgio Sartor
2012-07-23 18:24       ` Roberto Spadim
2012-07-23 21:31         ` Drew
2012-07-23 21:42           ` Roberto Spadim
2012-07-24  4:42           ` Stan Hoeppner
2012-07-24 12:51             ` Roberto Spadim
2012-07-27  6:06           ` Adam Goryachev
2012-07-27 13:42             ` Roberto Spadim
2012-07-24 15:09         ` Jaromir Capik
     [not found] <1897705147.341625.1342995720661.JavaMail.root@redhat.com>
2012-07-23  4:30 ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.