All of lore.kernel.org
 help / color / mirror / Atom feed
* Filesystem corruption on RAID1
@ 2017-07-13 15:35 Gionatan Danti
  2017-07-13 16:48 ` Roman Mamedov
  0 siblings, 1 reply; 46+ messages in thread
From: Gionatan Danti @ 2017-07-13 15:35 UTC (permalink / raw)
  To: linux-raid; +Cc: g.danti

Hi list,
today I had an unexpected filesystem corruption on a RAID1 machine used 
for backup purposes. I would like to reconstruct what possibly happened 
on why, so I am asking for your help.

System specs:
- OS CentOS 7.2 x86_64 with kernel 3.10.0-514.6.1.el7.x86_64
- 2x SEAGATE ST4000VN000-1H4168 (4 TB 5900rpm disks)
- 4 GB DDR3 RAM
- Intel(R) Pentium(R) CPU G3260 @ 3.30GHz

Today, I found the machine crashed with an XFS warning about corrupted 
metadata. The warning stated that in-core (or in-memory) data corruption 
was detected so, thinking about a DRAM-related problem (no ECC memory on 
this small box...) I simply rebooted tha machine. To no avail - the same 
problem immediately happened, preventing the machine from booting (the 
root filesystem did not mount).

After the filesystem was repaired (with significant corruption signs, 
also due to the clearing of the XFS journal), I looked at dmesg and 
found something interesting: a raid-resync action was *automatically* 
performed, as when re-attaching a (detached) disk.

I start investigating in /var/log/messages and found plenty of these 
errors, spanning many days:

...
Jul 10 03:24:01 nas kernel: ata1.00: failed command: READ FPDMA QUEUED
Jul 10 14:50:54 nas kernel: ata1.00: failed command: FLUSH CACHE EXT
Jul 12 03:14:41 nas kernel: ata1.00: failed command: WRITE FPDMA QUEUED
...

To me, it seems that a disks (the first one, sda) had problem executing 
some SATA commands, becoming out-of-sync from the second one (sdb). 
However it was not kicked out the array, as both /var/log/messages *and* 
my custom monitoring script (which keep an eye on /proc/mdstat) reported 
nothing. Moreover, inspecting both the SMART values and log show *no* 
error at all.

Question 1: it is possible to have such a situation, where a failed 
command *silently* put the array in out-of-sync state?

At a certain point, the machine crashed. I noticed and rebooted it.

Question 2: it is possible that the old disk become offline just before 
the crash and, by rebooting, the mdadm re-added it to the array?

Question 3: if so, it is possible that the corruption was due to the 
first disk being the one read by the md array and, by extension, by the 
filesystem?

Any thoughts will be greatly appreciated.
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-13 15:35 Filesystem corruption on RAID1 Gionatan Danti
@ 2017-07-13 16:48 ` Roman Mamedov
  2017-07-13 21:28   ` Gionatan Danti
  0 siblings, 1 reply; 46+ messages in thread
From: Roman Mamedov @ 2017-07-13 16:48 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-raid

On Thu, 13 Jul 2017 17:35:12 +0200
Gionatan Danti <g.danti@assyoma.it> wrote:

> Jul 10 03:24:01 nas kernel: ata1.00: failed command: READ FPDMA QUEUED

Failed reads are not as bad, as they are just retried.

> Jul 12 03:14:41 nas kernel: ata1.00: failed command: WRITE FPDMA QUEUED

But these WILL cause incorrect data written to disk, in my experience. After
that, one of your disks will contain some corruption, whether in files, or (as
you discovered) in the filesystem itself. mdadm may or may not read from that
disk, as it chooses the mirror for reads pretty much randomly, using the least
loaded one. And even though the other disk still contains good data, there is
no mechanism for the user-space to say "hey, this doesn't look right, what's
on the other mirror?"

Check your cables and/or disks themselves.

If you know that only one disk had these write errors all the time, you could
try disconnecting it from mirror, and checking if you can get a more
consistent view of the filesystem on the remaining one.

P.S: about my case (which I witnessed on a RAID6):

  * copy a file to the array, one disk will hit tons of WRITE FPDMA QUEUED
    errors (due to insufficient power and/or bad data cable).
  * the file that was just copied, turns out to be corrupted when reading back.
  * the problem disk WILL NOT get kicked from the array during this.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-13 16:48 ` Roman Mamedov
@ 2017-07-13 21:28   ` Gionatan Danti
  2017-07-13 21:34     ` Reindl Harald
  0 siblings, 1 reply; 46+ messages in thread
From: Gionatan Danti @ 2017-07-13 21:28 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid, g.danti

Il 13-07-2017 18:48 Roman Mamedov ha scritto:
> 
> Failed reads are not as bad, as they are just retried.
> 

I agree, I reported them only to give a broad picture of the system 
state :)

>> Jul 12 03:14:41 nas kernel: ata1.00: failed command: WRITE FPDMA 
>> QUEUED
> 
> But these WILL cause incorrect data written to disk, in my experience. 
> After
> that, one of your disks will contain some corruption, whether in files, 
> or (as
> you discovered) in the filesystem itself.

This is the "scary" part: if the write was not acknowledged as committed 
to disk, why the block layer did not report it to the MD driver? Or if 
the block layer reported that, why MD did not kick the disk out of the 
array?

> mdadm may or may not read from that
> disk, as it chooses the mirror for reads pretty much randomly, using 
> the least
> loaded one. And even though the other disk still contains good data, 
> there is
> no mechanism for the user-space to say "hey, this doesn't look right, 
> what's
> on the other mirror?"

I understand and agree with that. I'm fully aware that MD can not (by 
design) detect/correct corrupted data. However, I wonder if, and why, a 
disk with obvious errors was not kicked out of the array.

> 
> Check your cables and/or disks themselves.
> 

I tried reseating and inverting the cables ;)
Let see if the problem disappears or if it "follow" the 
cable/drive/interface...

> If you know that only one disk had these write errors all the time, you 
> could
> try disconnecting it from mirror, and checking if you can get a more
> consistent view of the filesystem on the remaining one.
> 
> P.S: about my case (which I witnessed on a RAID6):
> 
>   * copy a file to the array, one disk will hit tons of WRITE FPDMA 
> QUEUED
>     errors (due to insufficient power and/or bad data cable).
>   * the file that was just copied, turns out to be corrupted when 
> reading back.
>   * the problem disk WILL NOT get kicked from the array during this.

Wow, a die-hard data corruption. It seems VERY similar to what happened 
to me, and the key problem seems the same: a failing drive was not 
detached from the array in a timely fashion.

Thanks very much for reporting, Roman.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-13 21:28   ` Gionatan Danti
@ 2017-07-13 21:34     ` Reindl Harald
  2017-07-13 22:34       ` Gionatan Danti
  0 siblings, 1 reply; 46+ messages in thread
From: Reindl Harald @ 2017-07-13 21:34 UTC (permalink / raw)
  To: Gionatan Danti, Roman Mamedov; +Cc: linux-raid



Am 13.07.2017 um 23:28 schrieb Gionatan Danti:
> I understand and agree with that. I'm fully aware that MD can not (by 
> design) detect/correct corrupted data. However, I wonder if, and why, a 
> disk with obvious errors was not kicked out of the array.
maybe because the disk is, well, not in a good shape and don't know that 
by itself - i had storage devices which refused to write but said 
nothing (flash media), frankly you where able to even format that crap 
and overwrite if with zeros and all looked fine - until you pulled the 
broken device and inserted it again - same data as yesterday - a sd-card 
doestroyed a smartphone phisically by empty the whole battey within 30 
minutes while sitting in the cinema

broken hardware don't know that it's broken moste of the time

that#s why you need always backups or can just delete the data at all 
because they are not important

thins like above only could be detected by verify every write with an 
uncached read/verify which would lead in a uneccaptable performane 
penalty (and no filesystems with checksums won't magically recover your 
data, they just tell you realier they are gone)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-13 21:34     ` Reindl Harald
@ 2017-07-13 22:34       ` Gionatan Danti
  2017-07-14  0:32         ` Reindl Harald
  2017-07-14  1:48         ` Chris Murphy
  0 siblings, 2 replies; 46+ messages in thread
From: Gionatan Danti @ 2017-07-13 22:34 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Roman Mamedov, linux-raid, g.danti

Il 13-07-2017 23:34 Reindl Harald ha scritto:
> maybe because the disk is, well, not in a good shape and don't know
> that by itself
> 

But the kernel *does* know that, as the dmesg entries clearly show. 
Basically, some SATA commands timed-out and/or were aborted. As the 
kernel reported these erros in dmesg, why do not use these information 
to stop a failing disk?

> 
> (and no filesystems with checksums won't magically recover
> your data, they just tell you realier they are gone)
> 

Checksummed filesystem that integrates their block-level management 
(read: ZFS or BTRFS) can recover the missing/corrupted data by the 
healthy disks, discarging corrupted data based on the checksum mismatch.

Anyway, this has nothing to do with linux software RAID. I was only 
"thinking loud" :)
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-13 22:34       ` Gionatan Danti
@ 2017-07-14  0:32         ` Reindl Harald
  2017-07-14  0:52           ` Anthony Youngman
  2017-07-14 10:46           ` Gionatan Danti
  2017-07-14  1:48         ` Chris Murphy
  1 sibling, 2 replies; 46+ messages in thread
From: Reindl Harald @ 2017-07-14  0:32 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Roman Mamedov, linux-raid



Am 14.07.2017 um 00:34 schrieb Gionatan Danti:
> Il 13-07-2017 23:34 Reindl Harald ha scritto:
>> maybe because the disk is, well, not in a good shape and don't know
>> that by itself
> 
> But the kernel *does* know that, as the dmesg entries clearly show. 
> Basically, some SATA commands timed-out and/or were aborted. As the 
> kernel reported these erros in dmesg, why do not use these information 
> to stop a failing disk?

because you won't be that happy when the kernel spits out a disk each 
time a random SATA command times out - the 4 RAID10 disks on my 
workstation are from 2011 and showed them too several times in the past 
while they are just fine

here you go:
http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-14  0:32         ` Reindl Harald
@ 2017-07-14  0:52           ` Anthony Youngman
  2017-07-14  1:10             ` Reindl Harald
  2017-07-14 10:46           ` Gionatan Danti
  1 sibling, 1 reply; 46+ messages in thread
From: Anthony Youngman @ 2017-07-14  0:52 UTC (permalink / raw)
  To: Reindl Harald, Gionatan Danti; +Cc: Roman Mamedov, linux-raid



On 14/07/17 01:32, Reindl Harald wrote:
> 
> 
> Am 14.07.2017 um 00:34 schrieb Gionatan Danti:
>> Il 13-07-2017 23:34 Reindl Harald ha scritto:
>>> maybe because the disk is, well, not in a good shape and don't know
>>> that by itself
>>
>> But the kernel *does* know that, as the dmesg entries clearly show. 
>> Basically, some SATA commands timed-out and/or were aborted. As the 
>> kernel reported these erros in dmesg, why do not use these information 
>> to stop a failing disk?
> 
> because you won't be that happy when the kernel spits out a disk each 
> time a random SATA command times out - the 4 RAID10 disks on my 
> workstation are from 2011 and showed them too several times in the past 
> while they are just fine
> 
Except, in the context of this thread, the alternative is CORRUPTED 
DATA. I certainly know which one I would prefer, and that is a crashed 
array!

If a *write* fails, then a failed array may well be the least of the 
user's problems - and silent failure merely makes matters worse!

I know, the problem is that linux isn't actually that good at 
propagating errors back to user space, and I believe that's a fault of 
POSIX. So fixing the problem might be a massive job - indeed I think it is.

But that's no excuse for mocking someone just because they want to be 
told that the system has just gone and lost their work for them ...

Oh - and isn't that what raid is *supposed* to do? Kick a disk on a 
write failure?

Cheers,
Wol

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-14  0:52           ` Anthony Youngman
@ 2017-07-14  1:10             ` Reindl Harald
  0 siblings, 0 replies; 46+ messages in thread
From: Reindl Harald @ 2017-07-14  1:10 UTC (permalink / raw)
  To: Anthony Youngman, Gionatan Danti; +Cc: Roman Mamedov, linux-raid



Am 14.07.2017 um 02:52 schrieb Anthony Youngman:
> On 14/07/17 01:32, Reindl Harald wrote:
>>
>>
>> Am 14.07.2017 um 00:34 schrieb Gionatan Danti:
>>> Il 13-07-2017 23:34 Reindl Harald ha scritto:
>>>> maybe because the disk is, well, not in a good shape and don't know
>>>> that by itself
>>>
>>> But the kernel *does* know that, as the dmesg entries clearly show. 
>>> Basically, some SATA commands timed-out and/or were aborted. As the 
>>> kernel reported these erros in dmesg, why do not use these 
>>> information to stop a failing disk?
>>
>> because you won't be that happy when the kernel spits out a disk each 
>> time a random SATA command times out - the 4 RAID10 disks on my 
>> workstation are from 2011 and showed them too several times in the 
>> past while they are just fine
>>
> Except, in the context of this thread, the alternative is CORRUPTED 
> DATA. I certainly know which one I would prefer, and that is a crashed 
> array!
> 
> If a *write* fails, then a failed array may well be the least of the 
> user's problems - and silent failure merely makes matters worse!

i doubt that you would repeat that if for whatever load condition a 
random SATA timeout occours on both disks of a mirror and you lose some 
TB of data while in *that case* not silent corruption or anything else 
bad would have happened except a short lag

did you really read 
http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/ 
or just ignored it on purpose?

> I know, the problem is that linux isn't actually that good at 
> propagating errors back to user space, and I believe that's a fault of 
> POSIX. So fixing the problem might be a massive job - indeed I think it is.
> 
> But that's no excuse for mocking someone just because they want to be 
> told that the system has just gone and lost their work for them ...

nobody is mocking someone, i just explained why things are not as simple 
as they appear and with a 2 disk mirror they are always complicated in 
any error case by lack of quorum

> Oh - and isn't that what raid is *supposed* to do? Kick a disk on a 
> write failure?

if things only would be that easy in the real world...

in doubt with a mirrored RAID without data checksums *you have no way* 
to guarantee what is the correct data if something flips and "Except, in 
the context of this thread" is nice but won't help in general and trying 
to handle each and every bordercase with some workarounds would lead nowhere

yes, agreed, silent corruption is bad, hardware lying about data written 
is bad, but if things would be that easy all that won't happen and 
nobody would have spent time for develop checksummed filesystems



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-13 22:34       ` Gionatan Danti
  2017-07-14  0:32         ` Reindl Harald
@ 2017-07-14  1:48         ` Chris Murphy
  2017-07-14  7:22           ` Roman Mamedov
  1 sibling, 1 reply; 46+ messages in thread
From: Chris Murphy @ 2017-07-14  1:48 UTC (permalink / raw)
  To: Linux-RAID

On Thu, Jul 13, 2017 at 4:34 PM, Gionatan Danti <g.danti@assyoma.it> wrote:
> Il 13-07-2017 23:34 Reindl Harald ha scritto:
>>
>> maybe because the disk is, well, not in a good shape and don't know
>> that by itself
>>
>
> But the kernel *does* know that, as the dmesg entries clearly show.
> Basically, some SATA commands timed-out and/or were aborted. As the kernel
> reported these erros in dmesg, why do not use these information to stop a
> failing disk?
>
>>
>> (and no filesystems with checksums won't magically recover
>> your data, they just tell you realier they are gone)
>>
>
> Checksummed filesystem that integrates their block-level management (read:
> ZFS or BTRFS) can recover the missing/corrupted data by the healthy disks,
> discarging corrupted data based on the checksum mismatch.
>
> Anyway, this has nothing to do with linux software RAID. I was only
> "thinking loud" :)
> Thanks.
>
>


Dealing with device betrayal at a hardware level is a difficult
problem. I'm under the impression the md driver is very intolerant of
write failure and would eject a drive even with a single failed write?
It would seem to be disqualifying for RAID.

Btrfs still tolerates many errors, read and write, so it can still be
a problem there too. But yes it does have an independent way to
unambiguously determine whether file system metadata, or extent data,
is corrupt. It also often keeps two copies of metadata (the file
system itself). Another option (read-only) is dm-verity, but that is
not RAID, it uses forward error correction and cryptographic hash
verification.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-14  1:48         ` Chris Murphy
@ 2017-07-14  7:22           ` Roman Mamedov
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Mamedov @ 2017-07-14  7:22 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux-RAID

On Thu, 13 Jul 2017 19:48:29 -0600
Chris Murphy <lists@colorremedies.com> wrote:

> Btrfs still tolerates many errors, read and write

Actually it was Btrfs which saved me back then. Btrfs was making two copies
of metadata blocks and restored corrupted copies from good ones, and also
signaled me that user files were also affected (via data checksums).

FS checksums do work, and if you have redundancy for the corrupted part (such
as metadata DUP by default, or data DUP (unusual) or data RAID1), allow the FS
to sustain through corruptions, including hardware-caused ones.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-14  0:32         ` Reindl Harald
  2017-07-14  0:52           ` Anthony Youngman
@ 2017-07-14 10:46           ` Gionatan Danti
  2017-07-14 10:58             ` Reindl Harald
  2017-08-17  8:23             ` Gionatan Danti
  1 sibling, 2 replies; 46+ messages in thread
From: Gionatan Danti @ 2017-07-14 10:46 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Roman Mamedov, linux-raid, g.danti

Il 14-07-2017 02:32 Reindl Harald ha scritto:
> because you won't be that happy when the kernel spits out a disk each
> time a random SATA command times out - the 4 RAID10 disks on my
> workstation are from 2011 and showed them too several times in the
> past while they are just fine
> 
> here you go:
> http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/

Hi, so a premature/preventive drive detachment is not a silver bullet, 
and I buy it. However, I would at least expect this behavior to be 
configurable. Maybe it is, and I am missing something?

Anyway, what really surprise me is *not* the drive to not be detached, 
rather permitting that corruption make its way into real data. I naively 
expect that when a WRITE_QUEUED or CACHE_FLUSH command aborts/fails 
(which *will* cause data corruption if not properly handled) the I/O 
layer has the following possibilities:

a) retry the write/flush. You don't want to retry indefinitely, so the 
kernel need some type of counter/threshold; when the counter is reached, 
continue with b). This would mask out sporadic errors, while propagating 
recurring ones;

b) notify the upper layer that a write error happened. For synchronized 
and direct writes it can notify that by simply returning the correct 
exit code to the calling function. In this case, the block layer should 
return an error to the MD driver, which must act accordlying: for 
example, dropping the disk from the array.

c) do nothing. This seems to me by far the worst choice.

If b) is correcly implemented, it should prevent corruption to 
accumulate on the drives.

Please also note the *type* of corrupted data: not only user data, but 
filesystem journal and metadata also. The latter should be protected by 
the using of write barriers / FUAs, so they should be able to stop 
themselves *before* corruption.

So I have some very important questions:
- how does MD behave when flushing data to disk?
- does it propagate write barriers?
- when a write barrier fails, is the error propagated to the upper 
layers?

Thanks you all.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-14 10:46           ` Gionatan Danti
@ 2017-07-14 10:58             ` Reindl Harald
  2017-08-17  8:23             ` Gionatan Danti
  1 sibling, 0 replies; 46+ messages in thread
From: Reindl Harald @ 2017-07-14 10:58 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Roman Mamedov, linux-raid



Am 14.07.2017 um 12:46 schrieb Gionatan Danti:
> Il 14-07-2017 02:32 Reindl Harald ha scritto:
>> because you won't be that happy when the kernel spits out a disk each
>> time a random SATA command times out - the 4 RAID10 disks on my
>> workstation are from 2011 and showed them too several times in the
>> past while they are just fine
>>
>> here you go:
>> http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/ 
>>
> 
> Hi, so a premature/preventive drive detachment is not a silver bullet, 
> and I buy it. However, I would at least expect this behavior to be 
> configurable. Maybe it is, and I am missing something?

dunno, maybe it is, but it wouldn't be wise because in case of a RAID5 
rebuilding after a disk-failure would become even more dangerous on a 
large array as it is already

> Anyway, what really surprise me is *not* the drive to not be detached, 
> rather permitting that corruption make its way into real data. I naively 
> expect that when a WRITE_QUEUED or CACHE_FLUSH command aborts/fails 
> (which *will* cause data corruption if not properly handled) the I/O 
> layer has the following possibilities:

given that i have seen at least SD-cards confirming over hours sucessful 
writes with no single error in the syslog maybe it was one of the rare 
cases where the hardware lied and if that is the case you have nearly no 
chance on the software layer except verify each write with a uncached 
read of the block which would have a unacceptable impact on performance

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-07-14 10:46           ` Gionatan Danti
  2017-07-14 10:58             ` Reindl Harald
@ 2017-08-17  8:23             ` Gionatan Danti
  2017-08-17 12:41               ` Roger Heflin
  1 sibling, 1 reply; 46+ messages in thread
From: Gionatan Danti @ 2017-08-17  8:23 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Roman Mamedov, linux-raid, Gionatan Danti

On 14/07/2017 12:46, Gionatan Danti wrote:> Hi, so a 
premature/preventive drive detachment is not a silver bullet,
> and I buy it. However, I would at least expect this behavior to be 
> configurable. Maybe it is, and I am missing something?
> 
> Anyway, what really surprise me is *not* the drive to not be detached, 
> rather permitting that corruption make its way into real data. I naively 
> expect that when a WRITE_QUEUED or CACHE_FLUSH command aborts/fails 
> (which *will* cause data corruption if not properly handled) the I/O 
> layer has the following possibilities:
> 
> a) retry the write/flush. You don't want to retry indefinitely, so the 
> kernel need some type of counter/threshold; when the counter is reached, 
> continue with b). This would mask out sporadic errors, while propagating 
> recurring ones;
> 
> b) notify the upper layer that a write error happened. For synchronized 
> and direct writes it can notify that by simply returning the correct 
> exit code to the calling function. In this case, the block layer should 
> return an error to the MD driver, which must act accordlying: for 
> example, dropping the disk from the array.
> 
> c) do nothing. This seems to me by far the worst choice.
> 
> If b) is correcly implemented, it should prevent corruption to 
> accumulate on the drives.
> 
> Please also note the *type* of corrupted data: not only user data, but 
> filesystem journal and metadata also. The latter should be protected by 
> the using of write barriers / FUAs, so they should be able to stop 
> themselves *before* corruption.
> 
> So I have some very important questions:
> - how does MD behave when flushing data to disk?
> - does it propagate write barriers?
> - when a write barrier fails, is the error propagated to the upper layers?
> 
> Thanks you all.
>

Hi all,
having some free time, I conducted some new tests and I am now able to 
100% replicate the problem. To recap: a filesystem on a RAID1 array was 
corrupted due to SATA WRITEs failing but *no* I/O error being reported 
to higher layer (ie: mdraid/mdadm). I already submitted my findings on 
the linux-scsi mailing list, but I want to share them here because they 
can be useful to others.

On the affected machine, /var/log/messages shown some "failed command: 
WRITE FPDMA QUEUED" entries, but *no* action (ie: kick off disk) was 
taken by MDRAID. I tracked down the problem to an instable power supply 
(switching power rail/connector solved the problem).

In the latest day I had some spare time and I am now able to regularly 
replicate the problem. Basically, when a short powerloss happens, the 
scsi midlayer logs some failed operations, but does *not* pass these 
errors to higher layer. In other words, no I/O error is returned to the 
calling application. This is the reason why MDRAID did not kick off the 
instable disk on the machine with corrupted filesystem.

To replicated the problem, I wrote a large random file on a small MD 
RAID1 array, pulling off the power of one disk from about 2 seconds. The 
file write operation stopped for some seconds, than recovered. Running 
an array check resulted in a high number of mismatch_cnt sectors. Dmesg 
logged the following lines:

Aug 16 16:04:02 blackhole kernel: ata6.00: exception Emask 0x50 SAct 
0x7fffffff SErr 0x90a00 action 0xe frozen
Aug 16 16:04:02 blackhole kernel: ata6.00: irq_stat 0x00400000, PHY RDY 
changed
Aug 16 16:04:02 blackhole kernel: ata6: SError: { Persist HostInt 
PHYRdyChg 10B8B }
Aug 16 16:04:02 blackhole kernel: ata6.00: failed command: WRITE FPDMA 
QUEUED
Aug 16 16:04:02 blackhole kernel: ata6.00: cmd 
61/00:00:10:82:09/04:00:00:00:00/40 tag 0 ncq 524288 out#012         res 
40/00:d8:10:72:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Aug 16 16:04:02 blackhole kernel: ata6.00: status: { DRDY }
...
Aug 16 16:04:02 blackhole kernel: ata6.00: failed command: WRITE FPDMA 
QUEUED
Aug 16 16:04:02 blackhole kernel: ata6.00: cmd 
61/00:f0:10:7e:09/04:00:00:00:00/40 tag 30 ncq 524288 out#012 
res 40/00:d8:10:72:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Aug 16 16:04:02 blackhole kernel: ata6.00: status: { DRDY }
Aug 16 16:04:02 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:03 blackhole kernel: ata6: SATA link down (SStatus 0 
SControl 310)
Aug 16 16:04:04 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:14 blackhole kernel: ata6: softreset failed (device not ready)
Aug 16 16:04:14 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:24 blackhole kernel: ata6: softreset failed (device not ready)
Aug 16 16:04:24 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:35 blackhole kernel: ata6: link is slow to respond, please 
be patient (ready=0)
Aug 16 16:04:42 blackhole kernel: ata6: SATA link down (SStatus 0 
SControl 310)
Aug 16 16:04:46 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:46 blackhole kernel: ata3: exception Emask 0x10 SAct 0x0 
SErr 0x40d0202 action 0xe frozen
Aug 16 16:04:46 blackhole kernel: ata3: irq_stat 0x00400000, PHY RDY changed
Aug 16 16:04:46 blackhole kernel: ata3: SError: { RecovComm Persist 
PHYRdyChg CommWake 10B8B DevExch }
Aug 16 16:04:46 blackhole kernel: ata3: hard resetting link
Aug 16 16:04:51 blackhole kernel: ata3: softreset failed (device not ready)
Aug 16 16:04:51 blackhole kernel: ata3: applying PMP SRST workaround and 
retrying
Aug 16 16:04:51 blackhole kernel: ata3: SATA link up 3.0 Gbps (SStatus 
123 SControl 300)
Aug 16 16:04:51 blackhole kernel: ata3.00: configured for UDMA/133
Aug 16 16:04:51 blackhole kernel: ata3: EH complete
Aug 16 16:04:52 blackhole kernel: ata6: softreset failed (device not ready)
Aug 16 16:04:52 blackhole kernel: ata6: applying PMP SRST workaround and 
retrying
Aug 16 16:04:52 blackhole kernel: ata6: SATA link up 1.5 Gbps (SStatus 
113 SControl 310)
Aug 16 16:04:52 blackhole kernel: ata6.00: configured for UDMA/133
Aug 16 16:04:52 blackhole kernel: ata6: EH complete

As you can see, while failed SATA operation were logged in dmesg (and 
/var/log/messages), no I/O errors where returned to the upper layer 
(MDRAID) or the calling application. I had to say that I *fully expect* 
some inconsistencies: after all, removing the power wipes the volatile 
disk's DRAM cache, which means data loss. However, I really expected 
some I/O errors to be thrown to the higher layers, causing visible 
reactions (ie: a disks pushed out the array). With no I/O errors 
returned, the higher layer application are effectively blind.

Moreover, both disks show the *same RAID event number*, so the MDRAID 
layer can not automatically offline/don't read the corrupted disk. Here 
is the relevant output:

[root@blackhole storage]# mdadm -E /dev/sd[bc]1
/dev/sdb1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 8d06acd3:95920a78:069a7fc3:a526ca8a
            Name : blackhole.assyoma.it:200  (local to host 
blackhole.assyoma.it)
   Creation Time : Wed Aug 16 15:11:14 2017
      Raid Level : raid1
    Raid Devices : 2

  Avail Dev Size : 2095104 (1023.00 MiB 1072.69 MB)
      Array Size : 1047552 (1023.00 MiB 1072.69 MB)
     Data Offset : 2048 sectors
    Super Offset : 8 sectors
    Unused Space : before=1960 sectors, after=0 sectors
           State : clean
     Device UUID : 97bfbe06:89016508:2cb250c9:937a5c2e

     Update Time : Thu Aug 17 10:09:28 2017
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : 52670329 - correct
          Events : 759


    Device Role : Active device 0
    Array State : AA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 8d06acd3:95920a78:069a7fc3:a526ca8a
            Name : blackhole.assyoma.it:200  (local to host 
blackhole.assyoma.it)
   Creation Time : Wed Aug 16 15:11:14 2017
      Raid Level : raid1
    Raid Devices : 2

  Avail Dev Size : 2095104 (1023.00 MiB 1072.69 MB)
      Array Size : 1047552 (1023.00 MiB 1072.69 MB)
     Data Offset : 2048 sectors
    Super Offset : 8 sectors
    Unused Space : before=1960 sectors, after=0 sectors
           State : clean
     Device UUID : bf660182:701430fd:55f5fde9:6ded709e

     Update Time : Thu Aug 17 10:09:28 2017
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : 5336733f - correct
          Events : 759


    Device Role : Active device 1
    Array State : AA ('A' == active, '.' == missing, 'R' == replacing)

More concerning is the fact that these undetected errors can make their 
way even when the higher application consistently calls sync() and/or 
fsync. In other words, it seems than even acknowledged writes can fail 
in this manner (and this is consistent with the first machine corrupting 
its filesystem due to journal trashing - XFS journal surely uses sync() 
where appropriate). The mechanism seems the following:

- an higher layer application issue sync();
- a write barrier is generated;
- a first FLUSH CACHE command is sent to the disk;
- data are written to the disk's DRAM cache;
- power is lost! The volatile cache lose its content;
- power is re-established and the disk become responsive again;
- a second FLUSH CACHE command is sent to the disk;
- the disk acks each SATA command, but real data are lost.

As a side note, when the power loss or SATA cable disconnection is 
relatively long (over 10 seconds, as by eh timeout), the SATA disks 
become disconnected (and the MD layer acts accordlying):

Aug 16 16:12:20 blackhole kernel: ata6.00: exception Emask 0x50 SAct 
0x7fffffff SErr 0x490a00 action 0xe frozen
Aug 16 16:12:20 blackhole kernel: ata6.00: irq_stat 0x08000000, 
interface fatal error
Aug 16 16:12:20 blackhole kernel: ata6: SError: { Persist HostInt 
PHYRdyChg 10B8B Handshk }
Aug 16 16:12:20 blackhole kernel: ata6.00: failed command: WRITE FPDMA 
QUEUED
Aug 16 16:12:20 blackhole kernel: ata6.00: cmd 
61/00:00:38:88:09/04:00:00:00:00/40 tag 0 ncq 524288 out#012         res 
40/00:d8:38:f4:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Aug 16 16:12:20 blackhole kernel: ata6.00: status: { DRDY }
...
Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] Sense Key : Illegal 
Request [current] [descriptor]
Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] Add. Sense: 
Unaligned write command
Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] CDB: Write(10) 2a 00 
00 09 88 38 00 04 00 00
Aug 16 16:12:32 blackhole kernel: blk_update_request: 23 callbacks 
suppressed
Aug 16 16:12:32 blackhole kernel: blk_update_request: I/O error, dev 
sdf, sector 624696

Now, I have few questions:
- is the above explanation plausible, or I am (horribly) missing something?
- why the scsi midlevel does not respond to a power loss event by 
immediately offlining the disks?
- is the scsi midlevel behavior configurable (I know I can lower eh 
timeout, but is this the right solution)?
- how to deal with this problem (other than being 100% sure power is 
never lost by any disks)?

Thank you all,
regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-17  8:23             ` Gionatan Danti
@ 2017-08-17 12:41               ` Roger Heflin
  2017-08-17 14:31                 ` Gionatan Danti
  0 siblings, 1 reply; 46+ messages in thread
From: Roger Heflin @ 2017-08-17 12:41 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Reindl Harald, Roman Mamedov, Linux RAID

On Thu, Aug 17, 2017 at 3:23 AM, Gionatan Danti <g.danti@assyoma.it> wrote:
> On 14/07/2017 12:46, Gionatan Danti wrote:> Hi, so a premature/preventive
> drive detachment is not a silver bullet,

> but is this the right solution)?
> - how to deal with this problem (other than being 100% sure power is never
> lost by any disks)?
>
> Thank you all,
> regards.
>

Here is a guess based on what you determined was the cause.

The mid-layer does not know the writes were lost.   The writes were in
the drives write cache (already submitted to the drive and confirmed
back to the mid-layer as done, even though they were not yet on the
platter), and when the driver lost power and "rebooted" those writes
disappeared, the write(s) the mid-layer had in progress and that never
got a done from the drive failed were retried and succeeded after the
driver reset was completed.

In high reliability raid the solution is to turn off that write cache,
*but* if you do direct io writes (most databases) with the drives
write cache off and no battery backed up cache between the 2 then the
drive becomes horribly slow since it must actually write the data to
the platter before telling the next level up that the data was safe.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-17 12:41               ` Roger Heflin
@ 2017-08-17 14:31                 ` Gionatan Danti
  2017-08-17 17:33                   ` Wols Lists
  0 siblings, 1 reply; 46+ messages in thread
From: Gionatan Danti @ 2017-08-17 14:31 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Reindl Harald, Roman Mamedov, Linux RAID

Il 17-08-2017 14:41 Roger Heflin ha scritto:
> 
> Here is a guess based on what you determined was the cause.
> 
> The mid-layer does not know the writes were lost.   The writes were in
> the drives write cache (already submitted to the drive and confirmed
> back to the mid-layer as done, even though they were not yet on the
> platter), and when the driver lost power and "rebooted" those writes
> disappeared, the write(s) the mid-layer had in progress and that never
> got a done from the drive failed were retried and succeeded after the
> driver reset was completed.
> 
> In high reliability raid the solution is to turn off that write cache,
> *but* if you do direct io writes (most databases) with the drives
> write cache off and no battery backed up cache between the 2 then the
> drive becomes horribly slow since it must actually write the data to
> the platter before telling the next level up that the data was safe.

Sure, disabling caching should at least greatly reduce the problem (torn 
writes remain a problem, but their are inevitable).

However, the entire idea of barriers/cache flushes/FUAs was to *safely 
enable* unprotected write caches, even in the face of powerloss. Indeed, 
for full-system powerloss their are adequate. However, device-level 
micro-powerlosses seem to pose an bigger threat to data reliability.

I suspect that the recurrent "my RAID1 array develops huge amount of 
mismatch_cnt sectors" question, which is often labeled as "don't worry 
about RAID1 mismatches", really has a strong tie with this specific 
problem.

I suggest anyone reading this list to also read the current thread on 
the linux-scsi list - it is very interesting.
Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-17 14:31                 ` Gionatan Danti
@ 2017-08-17 17:33                   ` Wols Lists
  2017-08-17 20:50                     ` Gionatan Danti
  0 siblings, 1 reply; 46+ messages in thread
From: Wols Lists @ 2017-08-17 17:33 UTC (permalink / raw)
  To: Gionatan Danti, Roger Heflin; +Cc: Reindl Harald, Roman Mamedov, Linux RAID

On 17/08/17 15:31, Gionatan Danti wrote:
> However, the entire idea of barriers/cache flushes/FUAs was to *safely
> enable* unprotected write caches, even in the face of powerloss. Indeed,
> for full-system powerloss their are adequate. However, device-level
> micro-powerlosses seem to pose an bigger threat to data reliability.

Which is fine until the drive, bluntly put, lies to you. Cheaper drives
are prone to this, in order to look good in benchmarks. Especially as
it's hard to detect until you get screwed over by exactly this sort of
thing.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-17 17:33                   ` Wols Lists
@ 2017-08-17 20:50                     ` Gionatan Danti
  2017-08-17 21:01                       ` Roger Heflin
  2017-08-17 22:51                       ` Wols Lists
  0 siblings, 2 replies; 46+ messages in thread
From: Gionatan Danti @ 2017-08-17 20:50 UTC (permalink / raw)
  To: Wols Lists; +Cc: Roger Heflin, Reindl Harald, Roman Mamedov, Linux RAID

Il 17-08-2017 19:33 Wols Lists ha scritto:
> Which is fine until the drive, bluntly put, lies to you. Cheaper drives
> are prone to this, in order to look good in benchmarks. Especially as
> it's hard to detect until you get screwed over by exactly this sort of
> thing.

It's more complex, actually. The hardware did not "lie" to me, as it 
correcly flushes caches when instructed to do.
The problem is that a micro-powerloss wiped the cache *before* the drive 
had a chance to flush it, and the operating system did not detect this 
condition.

 From what I read on the linux-scsi and linux-ide lists, the host OS can 
not tell between a SATA link glitch and a SATA poweroff/poweron. This 
sound to me as a SATA specification problem, rather than a disk/OS one. 
However, a fix should be possible by examining some specific SMART 
values, which identify the powerloss/poweron condition.

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-17 20:50                     ` Gionatan Danti
@ 2017-08-17 21:01                       ` Roger Heflin
  2017-08-17 21:21                         ` Gionatan Danti
  2017-08-17 22:51                       ` Wols Lists
  1 sibling, 1 reply; 46+ messages in thread
From: Roger Heflin @ 2017-08-17 21:01 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Wols Lists, Reindl Harald, Roman Mamedov, Linux RAID

But even if you figured out which it was, you would have no way to
know what writes were still sitting in the cache, it could be pretty
much any writes from the last few seconds (or longer depending on how
exactly the drive firmware works), and it would add additional
complexity to keep a list of recent writes to validate actually
happened in the case of an unexpected drive reset.  This is probably
more of a avoid this failure condition since this failure condition is
not a normal failure mode and more of a very rare failure mode.

On Thu, Aug 17, 2017 at 3:50 PM, Gionatan Danti <g.danti@assyoma.it> wrote:
> Il 17-08-2017 19:33 Wols Lists ha scritto:
>>
>> Which is fine until the drive, bluntly put, lies to you. Cheaper drives
>> are prone to this, in order to look good in benchmarks. Especially as
>> it's hard to detect until you get screwed over by exactly this sort of
>> thing.
>
>
> It's more complex, actually. The hardware did not "lie" to me, as it
> correcly flushes caches when instructed to do.
> The problem is that a micro-powerloss wiped the cache *before* the drive had
> a chance to flush it, and the operating system did not detect this
> condition.
>
> From what I read on the linux-scsi and linux-ide lists, the host OS can not
> tell between a SATA link glitch and a SATA poweroff/poweron. This sound to
> me as a SATA specification problem, rather than a disk/OS one. However, a
> fix should be possible by examining some specific SMART values, which
> identify the powerloss/poweron condition.
>
> Regards.
>
> --
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-17 21:01                       ` Roger Heflin
@ 2017-08-17 21:21                         ` Gionatan Danti
  2017-08-17 21:23                           ` Gionatan Danti
  0 siblings, 1 reply; 46+ messages in thread
From: Gionatan Danti @ 2017-08-17 21:21 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Wols Lists, Reindl Harald, Roman Mamedov, Linux RAID

Il 17-08-2017 23:01 Roger Heflin ha scritto:
> But even if you figured out which it was, you would have no way to
> know what writes were still sitting in the cache, it could be pretty
> much any writes from the last few seconds (or longer depending on how
> exactly the drive firmware works), and it would add additional
> complexity to keep a list of recent writes to validate actually
> happened in the case of an unexpected drive reset.  This is probably
> more of a avoid this failure condition since this failure condition is
> not a normal failure mode and more of a very rare failure mode.

Yes, but having identified the power-cycled disk, the system can not 
take the most sensible action.
For example, it can re-sync it with its mirror disk, basically treating 
it as a --add-spare action.
Or it can simply considering the disk as failing, kicking off it from 
the array and sending an alert email.

What the system should not do is doing nothing: as differences 
accumulates, reading from the array become non-deterministic. In other 
words, two reads can produce two different results, based on what disk 
was queried. This *will* cause all sort of problems.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-17 21:21                         ` Gionatan Danti
@ 2017-08-17 21:23                           ` Gionatan Danti
  0 siblings, 0 replies; 46+ messages in thread
From: Gionatan Danti @ 2017-08-17 21:23 UTC (permalink / raw)
  To: Roger Heflin
  Cc: Wols Lists, Reindl Harald, Roman Mamedov, Linux RAID, linux-raid-owner

Il 17-08-2017 23:21 Gionatan Danti ha scritto:
> Il 17-08-2017 23:01 Roger Heflin ha scritto:
>> But even if you figured out which it was, you would have no way to
>> know what writes were still sitting in the cache, it could be pretty
>> much any writes from the last few seconds (or longer depending on how
>> exactly the drive firmware works), and it would add additional
>> complexity to keep a list of recent writes to validate actually
>> happened in the case of an unexpected drive reset.  This is probably
>> more of a avoid this failure condition since this failure condition is
>> not a normal failure mode and more of a very rare failure mode.
> 
> Yes, but having identified the power-cycled disk, the system can not
> take the most sensible action.

Sorry, this should read:
"Yes, but having identified the power-cycled disk, the system can *now* 
take the most sensible action"

> For example, it can re-sync it with its mirror disk, basically
> treating it as a --add-spare action.
> Or it can simply considering the disk as failing, kicking off it from
> the array and sending an alert email.
> 
> What the system should not do is doing nothing: as differences
> accumulates, reading from the array become non-deterministic. In other
> words, two reads can produce two different results, based on what disk
> was queried. This *will* cause all sort of problems.
> 
> Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-17 20:50                     ` Gionatan Danti
  2017-08-17 21:01                       ` Roger Heflin
@ 2017-08-17 22:51                       ` Wols Lists
  2017-08-18 12:26                         ` Gionatan Danti
  1 sibling, 1 reply; 46+ messages in thread
From: Wols Lists @ 2017-08-17 22:51 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Roger Heflin, Reindl Harald, Roman Mamedov, Linux RAID

On 17/08/17 21:50, Gionatan Danti wrote:
> 
> It's more complex, actually. The hardware did not "lie" to me, as it
> correcly flushes caches when instructed to do.
> The problem is that a micro-powerloss wiped the cache *before* the drive
> had a chance to flush it, and the operating system did not detect this
> condition.

Except that that is not what should be happening. I don't know my hard
drive details, but I believe drives have an instruction "async write
this data and let me know when you have done so".

This should NOT return "yes I've flushed it TO cache". Which is how you
get your problem - the level above thinks it's been safely flushed to
disk (because the disk has said "yes I've got it"), but it then gets
lost because of your power fluctuation. It should only acknowledge it
*after* it's been flushed *from* cache.

And this is apparently exactly what cheap drives do ...

If the level above says "tell me when it's safely on disk", and the
drive truly does as its told, your problem won't happen because the disk
block layer will time out waiting for the acknowledgement and retry the
write.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-17 22:51                       ` Wols Lists
@ 2017-08-18 12:26                         ` Gionatan Danti
  2017-08-18 12:54                           ` Roger Heflin
  0 siblings, 1 reply; 46+ messages in thread
From: Gionatan Danti @ 2017-08-18 12:26 UTC (permalink / raw)
  To: Wols Lists; +Cc: Roger Heflin, Reindl Harald, Roman Mamedov, Linux RAID

Il 18-08-2017 00:51 Wols Lists ha scritto:
> Except that that is not what should be happening. I don't know my hard
> drive details, but I believe drives have an instruction "async write
> this data and let me know when you have done so".
> 
> This should NOT return "yes I've flushed it TO cache". Which is how you
> get your problem - the level above thinks it's been safely flushed to
> disk (because the disk has said "yes I've got it"), but it then gets
> lost because of your power fluctuation. It should only acknowledge it
> *after* it's been flushed *from* cache.
> 
> And this is apparently exactly what cheap drives do ...
> 
> If the level above says "tell me when it's safely on disk", and the
> drive truly does as its told, your problem won't happen because the 
> disk
> block layer will time out waiting for the acknowledgement and retry the
> write.

SATA drives generally guarantee persistent storage on physical medium by 
issuing *two* different FLUSH_CACHE commands, which do *not* form an 
atomic operation. In other words, it's not a problem of "cheap drives" 
or "lying hardware", rather, it seems a specific SATA limitation.

This means the problem can not be solved by simply "buying better 
disks". Traditional flushing/barrier infrastructure simply has *no* 
method to ensure an atomic commit at the hardware level, and if 
something goes wrong between the two flushes, a (small) possibility 
exists to have corrupted writes without I/O errors reported to the upper 
layer, even in case of sync() writes. It's basically as a failing DRAM 
cache, but with *no* real failures...

Newer drivers should implement FUAs, but I don't know if libata alredy 
uses them by default. Anyway, the disk's firmware is free to split a 
single FUA in more internal operations, so I am not sure they solves all 
problems.

I really found the linux-scsi discussion interesting. Give it a look...

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-18 12:26                         ` Gionatan Danti
@ 2017-08-18 12:54                           ` Roger Heflin
  2017-08-18 19:42                             ` Gionatan Danti
  0 siblings, 1 reply; 46+ messages in thread
From: Roger Heflin @ 2017-08-18 12:54 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Wols Lists, Reindl Harald, Roman Mamedov, Linux RAID

I have noticed all of the hardware raid controllers explicitly turn
off the disk's write cache so this would eliminate this issue, but the
cost is much slower write times.

It makes the hardware raid controllers (and disk arrays) become
uselessly slow when their battery backup dies and disables the raid
card and/or arrays write cache.

Remember, safe, fast and cheap, you only get to pick 2.   We generally
pick fast and cheap, the disk arrays/raid controllers pick safe and
fast, but not so cheap as a hardware raid controller with write cache
backup of some sort are quite expensive.

On Fri, Aug 18, 2017 at 7:26 AM, Gionatan Danti <g.danti@assyoma.it> wrote:
> Il 18-08-2017 00:51 Wols Lists ha scritto:
>>
>> Except that that is not what should be happening. I don't know my hard
>> drive details, but I believe drives have an instruction "async write
>> this data and let me know when you have done so".
>>
>> This should NOT return "yes I've flushed it TO cache". Which is how you
>> get your problem - the level above thinks it's been safely flushed to
>> disk (because the disk has said "yes I've got it"), but it then gets
>> lost because of your power fluctuation. It should only acknowledge it
>> *after* it's been flushed *from* cache.
>>
>> And this is apparently exactly what cheap drives do ...
>>
>> If the level above says "tell me when it's safely on disk", and the
>> drive truly does as its told, your problem won't happen because the disk
>> block layer will time out waiting for the acknowledgement and retry the
>> write.
>
>
> SATA drives generally guarantee persistent storage on physical medium by
> issuing *two* different FLUSH_CACHE commands, which do *not* form an atomic
> operation. In other words, it's not a problem of "cheap drives" or "lying
> hardware", rather, it seems a specific SATA limitation.
>
> This means the problem can not be solved by simply "buying better disks".
> Traditional flushing/barrier infrastructure simply has *no* method to ensure
> an atomic commit at the hardware level, and if something goes wrong between
> the two flushes, a (small) possibility exists to have corrupted writes
> without I/O errors reported to the upper layer, even in case of sync()
> writes. It's basically as a failing DRAM cache, but with *no* real
> failures...
>
> Newer drivers should implement FUAs, but I don't know if libata alredy uses
> them by default. Anyway, the disk's firmware is free to split a single FUA
> in more internal operations, so I am not sure they solves all problems.
>
> I really found the linux-scsi discussion interesting. Give it a look...
>
>
> Regards.
>
> --
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-18 12:54                           ` Roger Heflin
@ 2017-08-18 19:42                             ` Gionatan Danti
  2017-08-20  7:14                               ` Mikael Abrahamsson
  0 siblings, 1 reply; 46+ messages in thread
From: Gionatan Danti @ 2017-08-18 19:42 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Wols Lists, Reindl Harald, Roman Mamedov, Linux RAID

Il 18-08-2017 14:54 Roger Heflin ha scritto:
> I have noticed all of the hardware raid controllers explicitly turn
> off the disk's write cache so this would eliminate this issue, but the
> cost is much slower write times.

True...

> It makes the hardware raid controllers (and disk arrays) become
> uselessly slow when their battery backup dies and disables the raid
> card and/or arrays write cache.

...true...

> Remember, safe, fast and cheap, you only get to pick 2.   We generally
> pick fast and cheap, the disk arrays/raid controllers pick safe and
> fast, but not so cheap as a hardware raid controller with write cache
> backup of some sort are quite expensive.

...and true. I am not arguing any of these points.

What really surprised me was to realize that, facing micro-powerlosses, 
*even sync() writes* can be vulnerable to undetected data loss, at least 
when not using FUAs (using instead the common barrier infrastructure).

So while many (old) mismatch_cnt reports on RAID1/10 arrays where 
dismissed as "don't bother, it's a harmless RAID1 thing", I really think 
than some were genuine corruptions due to micro powerlosses and similar 
causes.

If nothing more, such reports really emphasize the need to have a 
"trusted" mismatch_cnt for mirrored arrays, even in the face of some 
performance losses (due to no using zero copy anymore).

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-18 19:42                             ` Gionatan Danti
@ 2017-08-20  7:14                               ` Mikael Abrahamsson
  2017-08-20  7:24                                 ` Gionatan Danti
  2017-08-20 23:22                                 ` Chris Murphy
  0 siblings, 2 replies; 46+ messages in thread
From: Mikael Abrahamsson @ 2017-08-20  7:14 UTC (permalink / raw)
  To: Gionatan Danti
  Cc: Roger Heflin, Wols Lists, Reindl Harald, Roman Mamedov, Linux RAID

On Fri, 18 Aug 2017, Gionatan Danti wrote:

> So while many (old) mismatch_cnt reports on RAID1/10 arrays where 
> dismissed as "don't bother, it's a harmless RAID1 thing", I really think 
> than some were genuine corruptions due to micro powerlosses and similar 
> causes.

After a non-clean poweroff and possible mismatch now between the RAID1 
drives, and now fsck runs. It reads from the drives and fixes problem. 
However because the RAID1 drives contain different information, some of 
the errors are not fixed. Next time anything comes along, it might read 
from a different drive than what fsck read from, and now we have 
corruption.

Wouldn't it make sense for an option where fsck can do its reads and the 
md layer would run "repair" on all stripes that fsck touches? Whatever 
information is handed off to fsck, then parity is always checked (and 
repaired) if there is a mismatch.

The problem here with issuing a "repair" action is that it might actually 
copy data from the drive that fsck didn't read from, so now even though 
fsck thought it had made everything clean in the fs, it's no longer clean 
because md "repair" copied non-clean inforamation to the drive that fsck 
looked at and deemed to be ok?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20  7:14                               ` Mikael Abrahamsson
@ 2017-08-20  7:24                                 ` Gionatan Danti
  2017-08-20 10:43                                   ` Mikael Abrahamsson
  2017-08-20 23:22                                 ` Chris Murphy
  1 sibling, 1 reply; 46+ messages in thread
From: Gionatan Danti @ 2017-08-20  7:24 UTC (permalink / raw)
  To: Mikael Abrahamsson
  Cc: Roger Heflin, Wols Lists, Reindl Harald, Roman Mamedov, Linux RAID

Il 20-08-2017 09:14 Mikael Abrahamsson ha scritto:
> After a non-clean poweroff and possible mismatch now between the RAID1
> drives, and now fsck runs. It reads from the drives and fixes problem.
> However because the RAID1 drives contain different information, some
> of the errors are not fixed. Next time anything comes along, it might
> read from a different drive than what fsck read from, and now we have
> corruption.

It can be even worse: if fsck reads from the disks with corrupted data 
and tries to repair based on these corrupted information, it can blow up 
the filesystem completely.

In my case, heavy XFS corruption was prevented by the journal metadata 
checksum, which detected a corrupted journal and stopped mounting. 
However, some minor corruption found their ways onto the dentry/inode 
structures.

Being a backup machine, this was not a big deal, as I simply recreated 
the filesystem from scratch. However, the failure mode (synced writes 
which were corrupted) was quite scary.

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20  7:24                                 ` Gionatan Danti
@ 2017-08-20 10:43                                   ` Mikael Abrahamsson
  2017-08-20 13:07                                     ` Wols Lists
  2017-08-31 22:55                                     ` Robert L Mathews
  0 siblings, 2 replies; 46+ messages in thread
From: Mikael Abrahamsson @ 2017-08-20 10:43 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Linux RAID

On Sun, 20 Aug 2017, Gionatan Danti wrote:

> It can be even worse: if fsck reads from the disks with corrupted data 
> and tries to repair based on these corrupted information, it can blow up 
> the filesystem completely.

Indeed, but as far as I know there is nothing md can do about this. What 
md could do about it is at least present a consistent view of data to 
fsck (which for raid1 would be read all stripes and issue "repair" if they 
don't match). Yes, this might indeed cause corruption but at least it 
would be consistent and visible.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 10:43                                   ` Mikael Abrahamsson
@ 2017-08-20 13:07                                     ` Wols Lists
  2017-08-20 15:38                                       ` Adam Goryachev
  2017-08-20 19:01                                       ` Gionatan Danti
  2017-08-31 22:55                                     ` Robert L Mathews
  1 sibling, 2 replies; 46+ messages in thread
From: Wols Lists @ 2017-08-20 13:07 UTC (permalink / raw)
  To: Mikael Abrahamsson, Gionatan Danti; +Cc: Linux RAID

On 20/08/17 11:43, Mikael Abrahamsson wrote:
> On Sun, 20 Aug 2017, Gionatan Danti wrote:
> 
>> It can be even worse: if fsck reads from the disks with corrupted data
>> and tries to repair based on these corrupted information, it can blow
>> up the filesystem completely.
> 
> Indeed, but as far as I know there is nothing md can do about this. What
> md could do about it is at least present a consistent view of data to
> fsck (which for raid1 would be read all stripes and issue "repair" if
> they don't match). Yes, this might indeed cause corruption but at least
> it would be consistent and visible.
> 
Which is exactly what my "force integrity check on read" proposal would
have achieved, but that generated so much heat and argument IN FAVOUR of
returning possibly corrupt data that I'll probably get flamed to high
heaven if I bring it back up again. Yes, the performance hit is probably
awful, yes it can only fix things if it's got raid-6 or a 3-disk-or-more
raid-1 array, but the idea was that if you knew or suspected something
was wrong, this would force a read error somewhere in the stack if the
raid wasn't consistent.

Switching it on then running your fsck might trash chunks of the
filesystem, but at least (a) it would be known to be consistent
afterwards, and (b) you'd know what had been trashed!

Cheers,
Wol

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 13:07                                     ` Wols Lists
@ 2017-08-20 15:38                                       ` Adam Goryachev
  2017-08-20 15:48                                         ` Mikael Abrahamsson
  2017-08-20 19:03                                         ` Gionatan Danti
  2017-08-20 19:01                                       ` Gionatan Danti
  1 sibling, 2 replies; 46+ messages in thread
From: Adam Goryachev @ 2017-08-20 15:38 UTC (permalink / raw)
  To: Wols Lists, Mikael Abrahamsson, Gionatan Danti; +Cc: Linux RAID



On 20/8/17 23:07, Wols Lists wrote:
> On 20/08/17 11:43, Mikael Abrahamsson wrote:
>> On Sun, 20 Aug 2017, Gionatan Danti wrote:
>>
>>> It can be even worse: if fsck reads from the disks with corrupted data
>>> and tries to repair based on these corrupted information, it can blow
>>> up the filesystem completely.
>> Indeed, but as far as I know there is nothing md can do about this. What
>> md could do about it is at least present a consistent view of data to
>> fsck (which for raid1 would be read all stripes and issue "repair" if
>> they don't match). Yes, this might indeed cause corruption but at least
>> it would be consistent and visible.
>>
> Which is exactly what my "force integrity check on read" proposal would
> have achieved, but that generated so much heat and argument IN FAVOUR of
> returning possibly corrupt data that I'll probably get flamed to high
> heaven if I bring it back up again. Yes, the performance hit is probably
> awful, yes it can only fix things if it's got raid-6 or a 3-disk-or-more
> raid-1 array, but the idea was that if you knew or suspected something
> was wrong, this would force a read error somewhere in the stack if the
> raid wasn't consistent.
>
> Switching it on then running your fsck might trash chunks of the
> filesystem, but at least (a) it would be known to be consistent
> afterwards, and (b) you'd know what had been trashed!
In the case where you know there are "probably" some inconsistencies, 
you have a few choices:
1) If you know which disk is faulty, then fail it, then clean the 
superblock and add it. It will be re-written from the known good drive
2) If you don't know which drive is faulty, or both drives accrued 
random write errors, then all you can do is make sure that both drives 
have the same data (even where it is wrong). So just do a check/repair 
which will ensure both drives are consistent, then you can safely do the 
fsck. (Assuming you fixed the problem causing random write errors first).

Your proposed option to read from all (or at least 2) data sources to 
ensure data consistency is an online version of the above process in 
(2), not a bad tool to have available, but not required in this scenario 
(IMHO). It is more useful when you think all drives are OK, and you want 
to be *sure* that they are OK on a continuous basis, not just after you 
think there might be a problem.

While I suspect patches would be accepted, without someone capable of 
actually writing the code being interested, then it probably won't 
happen (until one of those people needs it).

Regards,
Adam

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 15:38                                       ` Adam Goryachev
@ 2017-08-20 15:48                                         ` Mikael Abrahamsson
  2017-08-20 16:10                                           ` Wols Lists
  2017-08-20 19:11                                           ` Gionatan Danti
  2017-08-20 19:03                                         ` Gionatan Danti
  1 sibling, 2 replies; 46+ messages in thread
From: Mikael Abrahamsson @ 2017-08-20 15:48 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Linux RAID

On Mon, 21 Aug 2017, Adam Goryachev wrote:

> data (even where it is wrong). So just do a check/repair which will 
> ensure both drives are consistent, then you can safely do the fsck. 
> (Assuming you fixed the problem causing random write errors first).

This involves manual intervention.

While I don't know how to implement this, let's at least see if we can 
architect something for throwing ideas around.

What about having an option for any raid level that would do "repair on 
read". So you can do "0" or "1" on this. RAID1 would mean it reads all 
stripes and if there is inconsistency, pick one and write it to all of 
them. It could also be some kind of IOCTL option I guess. For RAID5/6, 
read all data drives, and check parity. If parity is wrong, write parity.

This could mean that if filesystem developers wanted to do repair (and 
this could be a userspace option or mount option), it would use the 
beforementioned option for all fsck-like operation to make sure that 
metadata was consistent while doing fsck (this would be different for 
different tools, if it's an "fs needs to be mounted"-type of fs, or if 
it's an "offline fsck" type filesystem. Then it could go back to normal 
operation for everything else that would hopefully not cause 
catastrophical failures to the filesystem, but instead just individual 
file corruption in case of mismatches.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 15:48                                         ` Mikael Abrahamsson
@ 2017-08-20 16:10                                           ` Wols Lists
  2017-08-20 23:11                                             ` Adam Goryachev
  2017-08-20 19:11                                           ` Gionatan Danti
  1 sibling, 1 reply; 46+ messages in thread
From: Wols Lists @ 2017-08-20 16:10 UTC (permalink / raw)
  To: Mikael Abrahamsson, Adam Goryachev; +Cc: Linux RAID

On 20/08/17 16:48, Mikael Abrahamsson wrote:
> On Mon, 21 Aug 2017, Adam Goryachev wrote:
> 
>> data (even where it is wrong). So just do a check/repair which will
>> ensure both drives are consistent, then you can safely do the fsck.
>> (Assuming you fixed the problem causing random write errors first).
> 
> This involves manual intervention.
> 
> While I don't know how to implement this, let's at least see if we can
> architect something for throwing ideas around.
> 
> What about having an option for any raid level that would do "repair on
> read". So you can do "0" or "1" on this. RAID1 would mean it reads all
> stripes and if there is inconsistency, pick one and write it to all of
> them. It could also be some kind of IOCTL option I guess. For RAID5/6,
> read all data drives, and check parity. If parity is wrong, write parity.
> 
> This could mean that if filesystem developers wanted to do repair (and
> this could be a userspace option or mount option), it would use the
> beforementioned option for all fsck-like operation to make sure that
> metadata was consistent while doing fsck (this would be different for
> different tools, if it's an "fs needs to be mounted"-type of fs, or if
> it's an "offline fsck" type filesystem. Then it could go back to normal
> operation for everything else that would hopefully not cause
> catastrophical failures to the filesystem, but instead just individual
> file corruption in case of mismatches.
> 
Look for the thread "RFC Raid error detection and auto-recovery, 10th May.

Basically, that proposed a three-way flag - "default" is the current
"read the data section", "check" would read the entire stripe and
compare a mirror or calculate parity on a raid and return a read error
if it couldn't work out the correct data, and "fix" would write the
correct data back if it could work it out.

So basically, on a two-disk raid-1, or raid 4 or 5, both "check" and
"fix" would return read errors if there's a problem and you're SOL
without a backup.

With a three-disk or more raid-1, or raid-6, it would return the correct
data (and fix the stripe) if it could, otherwise again you're SOL.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 13:07                                     ` Wols Lists
  2017-08-20 15:38                                       ` Adam Goryachev
@ 2017-08-20 19:01                                       ` Gionatan Danti
  1 sibling, 0 replies; 46+ messages in thread
From: Gionatan Danti @ 2017-08-20 19:01 UTC (permalink / raw)
  To: Wols Lists; +Cc: Mikael Abrahamsson, Linux RAID

Il 20-08-2017 15:07 Wols Lists ha scritto:
> Which is exactly what my "force integrity check on read" proposal would
> have achieved, but that generated so much heat and argument IN FAVOUR 
> of
> returning possibly corrupt data that I'll probably get flamed to high
> heaven if I bring it back up again.

I think the aversion to such an approach is due to:
a) a *big* performance degradation (you get the IOPs of a single disk);
b) the existence on all-encompassing checksummed filesystems as ZFS and 
BTRFS[1];
c) the difficulty to actually write such a code;
d) a understimating of how often can these data-corruption problem 
happens in real life.

I can not really blame MDRAID for what it provides, as it is incredibly 
flexible and very fast. Sure, a user-selectable option to auto 
discover/correct corrupted data would be great, but it seems that this 
is not the road MDRAID will ever take.

However, a possible solution would be to use dm-integrity on top of the 
single component devices of an MDRAID array. Give at look at stratis[2], 
it will be interesting...


[1] In the current state, I do not really trust BTRFS. I put much more 
hopes on ZoL...
[2] https://stratis-storage.github.io/StratisSoftwareDesign.pdf

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 15:38                                       ` Adam Goryachev
  2017-08-20 15:48                                         ` Mikael Abrahamsson
@ 2017-08-20 19:03                                         ` Gionatan Danti
  1 sibling, 0 replies; 46+ messages in thread
From: Gionatan Danti @ 2017-08-20 19:03 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Wols Lists, Mikael Abrahamsson, Linux RAID

Il 20-08-2017 17:38 Adam Goryachev ha scritto:
> In the case where you know there are "probably" some inconsistencies,
> you have a few choices:
> 1) If you know which disk is faulty, then fail it, then clean the
> superblock and add it. It will be re-written from the known good drive

No need to clear the superblock. You can re-add it using the 
"--add-spare" option which will force a full re-sync from the mirror 
disk.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 15:48                                         ` Mikael Abrahamsson
  2017-08-20 16:10                                           ` Wols Lists
@ 2017-08-20 19:11                                           ` Gionatan Danti
  1 sibling, 0 replies; 46+ messages in thread
From: Gionatan Danti @ 2017-08-20 19:11 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Adam Goryachev, Linux RAID, linux-raid-owner

Il 20-08-2017 17:48 Mikael Abrahamsson ha scritto:
> 
> This involves manual intervention.
> 
> While I don't know how to implement this, let's at least see if we can
> architect something for throwing ideas around.
> 
> What about having an option for any raid level that would do "repair
> on read". So you can do "0" or "1" on this. RAID1 would mean it reads
> all stripes and if there is inconsistency, pick one and write it to
> all of them. It could also be some kind of IOCTL option I guess. For
> RAID5/6, read all data drives, and check parity. If parity is wrong,
> write parity.

Wait, is isn't that what MDRAID already do by issuing "echo 1 > 
sync_action"?

The big plus would be to not blindly copy the first mirror/stripe, 
rather to identify the correct one and use it to correct any corrupted 
data.

Obviously you need sufficient data to do that, by the mean of 3-way 
RAID1, double parity (RAID6) or checksummed data blocks (ZFS, BTRFS and 
dm-integrity).

Please note that these methods alone do not provide complete protection 
over other failures mode as phantom writes; however, any of them would 
significantly increase the current data protection level.

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 16:10                                           ` Wols Lists
@ 2017-08-20 23:11                                             ` Adam Goryachev
  2017-08-21 14:03                                               ` Anthony Youngman
  0 siblings, 1 reply; 46+ messages in thread
From: Adam Goryachev @ 2017-08-20 23:11 UTC (permalink / raw)
  To: Wols Lists, Mikael Abrahamsson; +Cc: Linux RAID

On 21/08/17 02:10, Wols Lists wrote:
> On 20/08/17 16:48, Mikael Abrahamsson wrote:
>> On Mon, 21 Aug 2017, Adam Goryachev wrote:
>>
>>> data (even where it is wrong). So just do a check/repair which will
>>> ensure both drives are consistent, then you can safely do the fsck.
>>> (Assuming you fixed the problem causing random write errors first).
>> This involves manual intervention.
>>
>> While I don't know how to implement this, let's at least see if we can
>> architect something for throwing ideas around.
>>
>> What about having an option for any raid level that would do "repair on
>> read". So you can do "0" or "1" on this. RAID1 would mean it reads all
>> stripes and if there is inconsistency, pick one and write it to all of
>> them. It could also be some kind of IOCTL option I guess. For RAID5/6,
>> read all data drives, and check parity. If parity is wrong, write parity.
>>
>> This could mean that if filesystem developers wanted to do repair (and
>> this could be a userspace option or mount option), it would use the
>> beforementioned option for all fsck-like operation to make sure that
>> metadata was consistent while doing fsck (this would be different for
>> different tools, if it's an "fs needs to be mounted"-type of fs, or if
>> it's an "offline fsck" type filesystem. Then it could go back to normal
>> operation for everything else that would hopefully not cause
>> catastrophical failures to the filesystem, but instead just individual
>> file corruption in case of mismatches.
>>
> Look for the thread "RFC Raid error detection and auto-recovery, 10th May.
>
> Basically, that proposed a three-way flag - "default" is the current
> "read the data section", "check" would read the entire stripe and
> compare a mirror or calculate parity on a raid and return a read error
> if it couldn't work out the correct data, and "fix" would write the
> correct data back if it could work it out.
>
> So basically, on a two-disk raid-1, or raid 4 or 5, both "check" and
> "fix" would return read errors if there's a problem and you're SOL
> without a backup.
>
> With a three-disk or more raid-1, or raid-6, it would return the correct
> data (and fix the stripe) if it could, otherwise again you're SOL.

 From memory, the main sticking point was in implementing this with 
RAID6 and the argument that you might not be able to choose the "right" 
pieces of data because there wasn't a sufficient amount of data to know 
which was corrupted. Perhaps it would be a easier starting point to use 
RAID1 with a three (or more) mirrors to implement this. You only need to 
read two drives to "check" that there is consensus (technically, 
int(n/2)+1, though you could start with just 2 which ensures there isn't 
one drive behaving badly). Once this is implemented, if you need larger 
arrays, then you would need to layer your RAID, using RAID61 with >=3 
mirror RAID1 components. Eventually, you might be able to migrate this 
to RAID6 or other levels, but at least once it is in kernel, and proven 
to be working (and actually used by people) then it will get a lot easier.

Regards,
Adam


-- 
Adam Goryachev Website Managers www.websitemanagers.com.au
-- 
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful. If you have received this message
in error, please notify us immediately. Please also destroy and delete the
message from your computer. Viruses - Any loss/damage incurred by receiving
this email is not the sender's responsibility.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20  7:14                               ` Mikael Abrahamsson
  2017-08-20  7:24                                 ` Gionatan Danti
@ 2017-08-20 23:22                                 ` Chris Murphy
  2017-08-21  5:57                                   ` Gionatan Danti
  2017-08-21  8:37                                   ` Mikael Abrahamsson
  1 sibling, 2 replies; 46+ messages in thread
From: Chris Murphy @ 2017-08-20 23:22 UTC (permalink / raw)
  To: Mikael Abrahamsson
  Cc: Gionatan Danti, Roger Heflin, Wols Lists, Reindl Harald,
	Roman Mamedov, Linux RAID

On Sun, Aug 20, 2017 at 1:14 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:

> After a non-clean poweroff and possible mismatch now between the RAID1
> drives, and now fsck runs. It reads from the drives and fixes problem.
> However because the RAID1 drives contain different information, some of the
> errors are not fixed. Next time anything comes along, it might read from a
> different drive than what fsck read from, and now we have corruption.

The fsck has no idea this is two drives, it things it's one and does
an overwrite of whatever (virtual) blocks contain file system metadata
needing repair. Then md should take each fsck write, and duplicate it
(for 2 way mirror) and push those writes to each real physical device.

Since md doesn't read from both mirrors, it's possible there's a read
from a non-corrupt drive, which presents good information to fsck,
which then sees no reason to fix anything in that block; but the other
mirror does have corruption which thus goes undetected.

One way of dealing with it is to scrub (repair) so they both have the
same information to hand over to fsck. Fixups then get replicated to
disks by md.

Another way is to split the mirror (make one device faulty), and then
fix the remaining drive (now degraded). If that goes well, the 2nd
device can be re-added. Here's a caveat thought: how it resync's will
depend on the write-intent bitmap being present. I have no idea if
write-intent bitmaps on two drives can get out of sync and what the
ensuing behavior is, but I'd like to think md will discover the fixed
drive event count is higher than the re-added one, and if necessary
does a full resync, rather than possibly re-introducing any
corruption.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 23:22                                 ` Chris Murphy
@ 2017-08-21  5:57                                   ` Gionatan Danti
  2017-08-21  8:37                                   ` Mikael Abrahamsson
  1 sibling, 0 replies; 46+ messages in thread
From: Gionatan Danti @ 2017-08-21  5:57 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Mikael Abrahamsson, Roger Heflin, Wols Lists, Reindl Harald,
	Roman Mamedov, Linux RAID, chris

Il 21-08-2017 01:22 Chris Murphy ha scritto:
> Another way is to split the mirror (make one device faulty), and then
> fix the remaining drive (now degraded). If that goes well, the 2nd
> device can be re-added. Here's a caveat thought: how it resync's will
> depend on the write-intent bitmap being present. I have no idea if
> write-intent bitmaps on two drives can get out of sync and what the
> ensuing behavior is, but I'd like to think md will discover the fixed
> drive event count is higher than the re-added one, and if necessary
> does a full resync, rather than possibly re-introducing any
> corruption.

On the corruption I am replicating (brief SATA power interruptions), the 
event count of both drives where identical, and so was the write bitmap 
(not always, thought).

To be 100% sure to completly copy from the mirror device, you had to 
re-add the corrupted driver as a spare, by using "--add-spare". From the 
man page:

"--add-spare
Add a device as a spare. This is similar to --add except that it does 
not attempt --re-add first. The device will be added as a spare even if 
it looks like it could be an recent member of the array."

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 23:22                                 ` Chris Murphy
  2017-08-21  5:57                                   ` Gionatan Danti
@ 2017-08-21  8:37                                   ` Mikael Abrahamsson
  2017-08-21 12:28                                     ` Gionatan Danti
  2017-08-21 17:33                                     ` Chris Murphy
  1 sibling, 2 replies; 46+ messages in thread
From: Mikael Abrahamsson @ 2017-08-21  8:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux RAID

On Sun, 20 Aug 2017, Chris Murphy wrote:

> Since md doesn't read from both mirrors, it's possible there's a read 
> from a non-corrupt drive, which presents good information to fsck, which 
> then sees no reason to fix anything in that block; but the other mirror 
> does have corruption which thus goes undetected.

That was exactly what I wrote.

> One way of dealing with it is to scrub (repair) so they both have the 
> same information to hand over to fsck. Fixups then get replicated to 
> disks by md.

Yes, it is, but that would require a full repair before doing fsck. That 
seems excessive because that will take hours on larger drives.

> Another way is to split the mirror (make one device faulty), and then
> fix the remaining drive (now degraded). If that goes well, the 2nd
> device can be re-added. Here's a caveat thought: how it resync's will
> depend on the write-intent bitmap being present. I have no idea if
> write-intent bitmaps on two drives can get out of sync and what the
> ensuing behavior is, but I'd like to think md will discover the fixed
> drive event count is higher than the re-added one, and if necessary
> does a full resync, rather than possibly re-introducing any
> corruption.

This doesn't solve the problem because it doesn't check if the second 
mirror is out of sync with the first one, because it'll only detect writes 
to the degraded array and sync those. It doesn't fix the "fsck read the 
block and it was fine, but on the second drive it's not fine".

In that case fsck would have to be modified to write all blocks it read to 
make them dirty, so they're sync:ed.

However, this again causes the problem that if there is an URE on the 
degraded array remaining drive, things will fail.

The only way to solve this is to add more code to implement a new mode 
which would be "repair-on-read".

I understand that we can't necessarily detect which drive has the right or 
wrong information, but at least we can this way make sure that when fsck 
is done, all the inodes and other metadata is now consistent. Everything 
that fsck touched during the fsck will be consistent across all drives, 
with correct parity. It might not contain the "best" information that 
could have been presented by a more intelligent algorithm/metadata, but at 
least it's better than today when after a fsck run you don't know if 
parity is correct or not.

It would also be a good diagnostic tool for admins. If you suspect that 
you're getting inconsistencies but you're fine with the performance 
degradation then md could log inconsistencies somewhere so you know about 
them.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-21  8:37                                   ` Mikael Abrahamsson
@ 2017-08-21 12:28                                     ` Gionatan Danti
  2017-08-21 14:09                                       ` Anthony Youngman
  2017-08-21 17:33                                     ` Chris Murphy
  1 sibling, 1 reply; 46+ messages in thread
From: Gionatan Danti @ 2017-08-21 12:28 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Chris Murphy, Linux RAID, linux-raid-owner

Il 21-08-2017 10:37 Mikael Abrahamsson ha scritto:
> This doesn't solve the problem because it doesn't check if the second
> mirror is out of sync with the first one, because it'll only detect
> writes to the degraded array and sync those. It doesn't fix the "fsck
> read the block and it was fine, but on the second drive it's not
> fine".

As stated elsewhere, you can re-attach a detached device with 
"--add-spare": this will copy *all* data from the other mirror leg. 
However, it is vastly better to simple issue a "repair" action. Anyway, 
the basic problem remains: with larger drives, this will take many hours 
or even days.

> However, this again causes the problem that if there is an URE on the
> degraded array remaining drive, things will fail.

On relatively recent MDRAID code (kernel > 3.5.x), a degraded array with 
a URE in another disk will *not* totally fail the array. Rather, a 
badblock is logged into MDRAID superblock and a read error is returned 
to upper layers.

Anyway, this has little to do with the main problem: micro power losses 
can cause undetected, silent data corruption, even with synced writes.

> The only way to solve this is to add more code to implement a new mode
> which would be "repair-on-read".
> 
> I understand that we can't necessarily detect which drive has the
> right or wrong information, but at least we can this way make sure
> that when fsck is done, all the inodes and other metadata is now
> consistent. Everything that fsck touched during the fsck will be
> consistent across all drives, with correct parity. It might not
> contain the "best" information that could have been presented by a
> more intelligent algorithm/metadata, but at least it's better than
> today when after a fsck run you don't know if parity is correct or
> not.
> 
> It would also be a good diagnostic tool for admins. If you suspect
> that you're getting inconsistencies but you're fine with the
> performance degradation then md could log inconsistencies somewhere so
> you know about them.

I second that.
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 23:11                                             ` Adam Goryachev
@ 2017-08-21 14:03                                               ` Anthony Youngman
  0 siblings, 0 replies; 46+ messages in thread
From: Anthony Youngman @ 2017-08-21 14:03 UTC (permalink / raw)
  To: Adam Goryachev, Mikael Abrahamsson; +Cc: Linux RAID



On 21/08/17 00:11, Adam Goryachev wrote:
> On 21/08/17 02:10, Wols Lists wrote:
>> On 20/08/17 16:48, Mikael Abrahamsson wrote:
>>> On Mon, 21 Aug 2017, Adam Goryachev wrote:
>>>
>>>> data (even where it is wrong). So just do a check/repair which will
>>>> ensure both drives are consistent, then you can safely do the fsck.
>>>> (Assuming you fixed the problem causing random write errors first).
>>> This involves manual intervention.
>>>
>>> While I don't know how to implement this, let's at least see if we can
>>> architect something for throwing ideas around.
>>>
>>> What about having an option for any raid level that would do "repair on
>>> read". So you can do "0" or "1" on this. RAID1 would mean it reads all
>>> stripes and if there is inconsistency, pick one and write it to all of
>>> them. It could also be some kind of IOCTL option I guess. For RAID5/6,
>>> read all data drives, and check parity. If parity is wrong, write 
>>> parity.
>>>
>>> This could mean that if filesystem developers wanted to do repair (and
>>> this could be a userspace option or mount option), it would use the
>>> beforementioned option for all fsck-like operation to make sure that
>>> metadata was consistent while doing fsck (this would be different for
>>> different tools, if it's an "fs needs to be mounted"-type of fs, or if
>>> it's an "offline fsck" type filesystem. Then it could go back to normal
>>> operation for everything else that would hopefully not cause
>>> catastrophical failures to the filesystem, but instead just individual
>>> file corruption in case of mismatches.
>>>
>> Look for the thread "RFC Raid error detection and auto-recovery, 10th 
>> May.
>>
>> Basically, that proposed a three-way flag - "default" is the current
>> "read the data section", "check" would read the entire stripe and
>> compare a mirror or calculate parity on a raid and return a read error
>> if it couldn't work out the correct data, and "fix" would write the
>> correct data back if it could work it out.
>>
>> So basically, on a two-disk raid-1, or raid 4 or 5, both "check" and
>> "fix" would return read errors if there's a problem and you're SOL
>> without a backup.
>>
>> With a three-disk or more raid-1, or raid-6, it would return the correct
>> data (and fix the stripe) if it could, otherwise again you're SOL.
> 
>  From memory, the main sticking point was in implementing this with 
> RAID6 and the argument that you might not be able to choose the "right" 
> pieces of data because there wasn't a sufficient amount of data to know 
> which was corrupted.

That was the impression I got, but I really don't understand the 
problem. If *ANY* one stripe is corrupted, we have two unknowns, two 
parity blocks, and we can recalculate the missing stripe.

If two or more stripes are corrupt, the recovery will return garbage 
(which is detectable) and we return a read error. We DO NOT attempt to 
rewrite the stripe! In your words, if we can't choose the "right" piece 
of data, we bail and do nothing.

As I understood it, the worry was that we would run the recovery 
algorithm and then overwrite the data with garbage, but nobody ever gave 
me a plausible scenario where that could happen. The only plausible 
scenario is where multiple stripes are corrupted in such a way that the 
recovery algorithm is fooled into thinking only one stripe is affected. 
And if I read that paper correctly, the odds of that happening are very low.

Short summary - if just one stripe is corrupted, then my proposal will 
fix and return CORRECT data. If however, more than one stripe is 
corrupted, then my proposal will with near-perfect accuracy bail and do 
nothing (apart from returning a read error). As I say, the only risk to 
the data is if the error looks like a single-stripe problem when it 
isn't, and that's unlikely.

I've had enough data-loss scenarios in my career to be rather paranoid 
about scribbling over stuff when I don't know what I'm doing ... (I do 
understand concerns about "using the wrong tool to fix the wrong 
problem", but you don't refuse to sell a punter a wheel-wrench because 
he might not be able to tell the difference between a flat tyre and a 
mis-firing engine).

Cheers,
Wol

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-21 12:28                                     ` Gionatan Danti
@ 2017-08-21 14:09                                       ` Anthony Youngman
  0 siblings, 0 replies; 46+ messages in thread
From: Anthony Youngman @ 2017-08-21 14:09 UTC (permalink / raw)
  To: Gionatan Danti, Mikael Abrahamsson
  Cc: Chris Murphy, Linux RAID, linux-raid-owner

On 21/08/17 13:28, Gionatan Danti wrote:
>> It would also be a good diagnostic tool for admins. If you suspect
>> that you're getting inconsistencies but you're fine with the
>> performance degradation then md could log inconsistencies somewhere so
>> you know about them.
> 
> I second that.
> Thanks.

Sounds like I should try and write the code for my RFC then :-)

Just be prepared for a lot of requests for help ... :-) Another item on 
my long list of "I'll do it when I get the chance" things :-(

Cheers,
Wol

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-21  8:37                                   ` Mikael Abrahamsson
  2017-08-21 12:28                                     ` Gionatan Danti
@ 2017-08-21 17:33                                     ` Chris Murphy
  2017-08-21 17:52                                       ` Reindl Harald
  1 sibling, 1 reply; 46+ messages in thread
From: Chris Murphy @ 2017-08-21 17:33 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Chris Murphy, Linux RAID

On Mon, Aug 21, 2017 at 2:37 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
> On Sun, 20 Aug 2017, Chris Murphy wrote:
>
>> Since md doesn't read from both mirrors, it's possible there's a read from
>> a non-corrupt drive, which presents good information to fsck, which then
>> sees no reason to fix anything in that block; but the other mirror does have
>> corruption which thus goes undetected.
>
>
> That was exactly what I wrote.
>
>> One way of dealing with it is to scrub (repair) so they both have the same
>> information to hand over to fsck. Fixups then get replicated to disks by md.
>
>
> Yes, it is, but that would require a full repair before doing fsck. That
> seems excessive because that will take hours on larger drives.

Hence we have ZFS and Btrfs and dm-integrity to unambiguously identify
corruption and prevent it from escaping to higher levels.

That you have multiple incongruencies with fs metadata, there's a good
chance some data is also affected. Data is a much bigger percentage.

Might as well bite the bullet and scrub the whole thing.


>
>> Another way is to split the mirror (make one device faulty), and then
>> fix the remaining drive (now degraded). If that goes well, the 2nd
>> device can be re-added. Here's a caveat thought: how it resync's will
>> depend on the write-intent bitmap being present. I have no idea if
>> write-intent bitmaps on two drives can get out of sync and what the
>> ensuing behavior is, but I'd like to think md will discover the fixed
>> drive event count is higher than the re-added one, and if necessary
>> does a full resync, rather than possibly re-introducing any
>> corruption.
>
>
> This doesn't solve the problem because it doesn't check if the second mirror
> is out of sync with the first one, because it'll only detect writes to the
> degraded array and sync those. It doesn't fix the "fsck read the block and
> it was fine, but on the second drive it's not fine".
>
> In that case fsck would have to be modified to write all blocks it read to
> make them dirty, so they're sync:ed.

OK so you have a corrupt underlying storage stack for possibly unknown
reasons, and you're just going to take a chance and overwrite the
entire file system. Seems like a bad hack to me, but I'd love to know
what the ext4 and XFS devs think about it.

The rule has always been get lower levels healthy first. Two mirrors
that have the same even count but are not block identical is a broken
array.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-21 17:33                                     ` Chris Murphy
@ 2017-08-21 17:52                                       ` Reindl Harald
  0 siblings, 0 replies; 46+ messages in thread
From: Reindl Harald @ 2017-08-21 17:52 UTC (permalink / raw)
  To: Chris Murphy, Mikael Abrahamsson; +Cc: Linux RAID



Am 21.08.2017 um 19:33 schrieb Chris Murphy:
> On Mon, Aug 21, 2017 at 2:37 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
>> On Sun, 20 Aug 2017, Chris Murphy wrote:
>>
>>> Since md doesn't read from both mirrors, it's possible there's a read from
>>> a non-corrupt drive, which presents good information to fsck, which then
>>> sees no reason to fix anything in that block; but the other mirror does have
>>> corruption which thus goes undetected.
>>
>>
>> That was exactly what I wrote.
>>
>>> One way of dealing with it is to scrub (repair) so they both have the same
>>> information to hand over to fsck. Fixups then get replicated to disks by md.
>>
>>
>> Yes, it is, but that would require a full repair before doing fsck. That
>> seems excessive because that will take hours on larger drives.
> 
> Hence we have ZFS and Btrfs and dm-integrity to unambiguously identify
> corruption and prevent it from escaping to higher levels

where do we have ZFS?
where do we have *stable* BTRFS after Redhat gave up recently?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-20 10:43                                   ` Mikael Abrahamsson
  2017-08-20 13:07                                     ` Wols Lists
@ 2017-08-31 22:55                                     ` Robert L Mathews
  2017-09-01  5:39                                       ` Reindl Harald
  1 sibling, 1 reply; 46+ messages in thread
From: Robert L Mathews @ 2017-08-31 22:55 UTC (permalink / raw)
  To: Linux RAID

On 8/20/17 3:43 AM, Mikael Abrahamsson wrote:

> Indeed, but as far as I know there is nothing md can do about this. What
> md could do about it is at least present a consistent view of data to
> fsck (which for raid1 would be read all stripes and issue "repair" if
> they don't match). Yes, this might indeed cause corruption but at least
> it would be consistent and visible.

If you set all disks except one as "write-mostly", won't mdadm give you
a consistent view of it because it only reads from a single disk?

(Sorry for the delayed reply to a two-week-old thread; I was on vacation.)

-- 
Robert L Mathews, Tiger Technologies, http://www.tigertech.net/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-08-31 22:55                                     ` Robert L Mathews
@ 2017-09-01  5:39                                       ` Reindl Harald
  2017-09-01 23:14                                         ` Robert L Mathews
  0 siblings, 1 reply; 46+ messages in thread
From: Reindl Harald @ 2017-09-01  5:39 UTC (permalink / raw)
  To: Linux RAID


Am 01.09.2017 um 00:55 schrieb Robert L Mathews:
> On 8/20/17 3:43 AM, Mikael Abrahamsson wrote:
> 
>> Indeed, but as far as I know there is nothing md can do about this. What
>> md could do about it is at least present a consistent view of data to
>> fsck (which for raid1 would be read all stripes and issue "repair" if
>> they don't match). Yes, this might indeed cause corruption but at least
>> it would be consistent and visible.
> 
> If you set all disks except one as "write-mostly", won't mdadm give you
> a consistent view of it because it only reads from a single disk?

you gain nothing when you completly lie to fsck and after that switch 
back to normal operations with the other disks part of the game

and it works only on RAID1 and *really* RAID1, otherwise my current 
RAID10 would be as fast as hell instead the terrible random lags which 
are worser sometimes as before replace two out of 4 disks with a SSD

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Filesystem corruption on RAID1
  2017-09-01  5:39                                       ` Reindl Harald
@ 2017-09-01 23:14                                         ` Robert L Mathews
  0 siblings, 0 replies; 46+ messages in thread
From: Robert L Mathews @ 2017-09-01 23:14 UTC (permalink / raw)
  To: Linux RAID

On 8/31/17 10:39 PM, Reindl Harald wrote:
> you gain nothing when you completly lie to fsck and after that switch
> back to normal operations with the other disks part of the game

Sorry, I didn't make myself clear. I meant that you could leave the
array like that (all disks except one write-mostly) permanently, in
normal use (and while fscking too). That way you always have a
consistent view of the array.

With an array composed of modern SSDs, such a setup still performs well
for many loads.

-- 
Robert L Mathews, Tiger Technologies, http://www.tigertech.net/

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-09-01 23:14 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-13 15:35 Filesystem corruption on RAID1 Gionatan Danti
2017-07-13 16:48 ` Roman Mamedov
2017-07-13 21:28   ` Gionatan Danti
2017-07-13 21:34     ` Reindl Harald
2017-07-13 22:34       ` Gionatan Danti
2017-07-14  0:32         ` Reindl Harald
2017-07-14  0:52           ` Anthony Youngman
2017-07-14  1:10             ` Reindl Harald
2017-07-14 10:46           ` Gionatan Danti
2017-07-14 10:58             ` Reindl Harald
2017-08-17  8:23             ` Gionatan Danti
2017-08-17 12:41               ` Roger Heflin
2017-08-17 14:31                 ` Gionatan Danti
2017-08-17 17:33                   ` Wols Lists
2017-08-17 20:50                     ` Gionatan Danti
2017-08-17 21:01                       ` Roger Heflin
2017-08-17 21:21                         ` Gionatan Danti
2017-08-17 21:23                           ` Gionatan Danti
2017-08-17 22:51                       ` Wols Lists
2017-08-18 12:26                         ` Gionatan Danti
2017-08-18 12:54                           ` Roger Heflin
2017-08-18 19:42                             ` Gionatan Danti
2017-08-20  7:14                               ` Mikael Abrahamsson
2017-08-20  7:24                                 ` Gionatan Danti
2017-08-20 10:43                                   ` Mikael Abrahamsson
2017-08-20 13:07                                     ` Wols Lists
2017-08-20 15:38                                       ` Adam Goryachev
2017-08-20 15:48                                         ` Mikael Abrahamsson
2017-08-20 16:10                                           ` Wols Lists
2017-08-20 23:11                                             ` Adam Goryachev
2017-08-21 14:03                                               ` Anthony Youngman
2017-08-20 19:11                                           ` Gionatan Danti
2017-08-20 19:03                                         ` Gionatan Danti
2017-08-20 19:01                                       ` Gionatan Danti
2017-08-31 22:55                                     ` Robert L Mathews
2017-09-01  5:39                                       ` Reindl Harald
2017-09-01 23:14                                         ` Robert L Mathews
2017-08-20 23:22                                 ` Chris Murphy
2017-08-21  5:57                                   ` Gionatan Danti
2017-08-21  8:37                                   ` Mikael Abrahamsson
2017-08-21 12:28                                     ` Gionatan Danti
2017-08-21 14:09                                       ` Anthony Youngman
2017-08-21 17:33                                     ` Chris Murphy
2017-08-21 17:52                                       ` Reindl Harald
2017-07-14  1:48         ` Chris Murphy
2017-07-14  7:22           ` Roman Mamedov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.